What's the Clojure data science toolkit?

didibus · February 24, 2019, 6:23am

So a lot of people are wondering, what is the Clojure data-science toolkit?

Because I’m new to data-science, I’m asking, what is the Python data-science toolkit first?

A quick google came up with: https://jakevdp.github.io/PythonDataScienceHandbook/

Where it seems the common toolkit would be:

IPython shell

From the looks of it, IPython just looks like nRepl to me, or any of the many Clojure REPLs. With some additional syntax to make Python more REPL friendly.

Is there anything nRepl wouldn’t have for example which IPython shell would?

Jupyter notebook with IPython kernel

Well, this seems covered quite well in Clojure by using Jupyter with IClojure or Clojupyter kernel.

Or, we can even use Oz or Saite instead of Jupyter for a graphical output of our Clojure REPL.

Even simpler, one can make use of JavaFX or Swing, or just Cider’s support for rendering images in the Cider REPL.

Is there anything that IPython Jupyter Kernel has that IClojure or Clojupyter does not? Or that can’t be done as easily with Oz, Saite, or a JavaFX/Swing solution?

NumPy

This seems to address the following: efficient storage and manipulation of numerical arrays

Now, defacto, Clojure excels at manipulating data through the use of its standard persistent data-structures and its core sequence functions and transducers. But, it is not efficient at it, well, it’s pretty damn fast, but not the fast that NumPy is.

For that, it seems we want raw arrays. In Clojure, that would be all the -array functions like make-array, into-array, amap, areduce, aget, aset, etc.

These would be similar to Python 3 array.

This would count as efficient storage, but NumPy adds efficient computation over them.

Now, the options I see in Clojure are core.matrix, Neanderthal, nd4j, or role your own over standard Java arrays.

Neanderthal seems like the way to go, except for one thing, you can’t use it without setting up BLAS, and that’s annoying when you don’t care about the utter most performance. A pure Java backend for it would be nice in that regard.

Anyways, I think Neanderthal is just the way to go here. It uses Blas and Lapack, which are the state of the art, and highly funded. We should just all contribute to the patreon and have dragan keep maintaining it and making it better in my opinion.

SciPy

Now, the way I see this, it is a higher level API of common functions over fast numerics. Neanderthal has some, but maybe not all that SciPy brings to the table, and I feel this one is missing in Clojure.

So I ask, what do people use here? Is there anything in Clojure that adds easy to use APIs and common operations over Neanderthal? And this assumes Neanderthal itself is missing some, and maybe it isn’t?

Pandas

Pandas turn ndarrays of numpy into row or column indexed structures. These resemble your typical relational database table.

Now, my knowledge of available Clojure libs that fill this use case is low. I can see using datascript or even an in-memory DB like h2. But Pandas isn’t just a way to perform joins and queries over tabular data. It combines that with fast parallel matrix operations.

Again, I think it would be nice to see something here that built on top of Neanderthal. That said, I think Apache Arrow is a good contender here as well. But how does Apache Arrow and Neanderthal interact is something I’m not sure.

Matplotlib

Plotting seems to be another aspect that Clojure already has covered. The Vega/Lite offerings like Oz or Hanami are quite complete. And simpler alternatives like clj-xchart or incanter can also be used for quite a few plots.

Is there anything Matplotlib has that these can’t do?

Seaborn

Seaborn adds specialised plots specific to helping you in building ML models. I’m not familiar with these, so I don’t know what exist in Clojure.

Scikit-learn

This is a set of ML algorithms modeled on top of NumPy and Pandas. If you follow my logic, you’ll see that it would mean we’d have something on top of Neanderthal and maybe Apache Arrow for this. Which I don’t think we do.

I’m not sure what exists here in Clojure either. We would want four things:

a. A wide range of ML model classes
b. The ability to perform manual and automated hyperparameter tuning.
c. Some support for feature extraction of common encodings for say categorical, text, image, sound, etc.
d. A common API and a common data representation, so that changing model class is easy.

Some things I’ve seen which come close:

Spark SQL and MLlib (also overlaps with Pandas)
PMML and PFA and JPMML evaluator
h2o.ai has Java APIs
XGBoost (doesn’t seem to have as many model classes though)
Java-ML
RapidMiner (has Java APIs)
Weka (through Java APIs)

Statsmodels

I don’t really understand what this does. But it seems to relate to explanability of ML models? Anyways, don’t really know what it does, and so I have no idea if Clojure has anything of that sort.

Keras

For deep learning, Clojure seems covered here as well, can use DeepLearning4J or MXnet.

Is there anything Keras has that isn’t covered by these?

daslu · February 24, 2019, 9:15am

Hi, @didibus, interesting, what a thorough list!!

See also this related list.

It would be a great idea to keep such lists in mind while going into use cases (e.g., translating existing Kaggle kernels to clojure), and seeing which kinds of functionality are actually important.

Regarding visualization, the libraries you mentioned have huge potential. Still, to get an idea what is missing, you may look into galleries like these:
htmlwidgetss | ggplot2 | R in general
and think, which of these visual elements are already easy to implement, and which are not.

Regarding scikit-learn, it is worth watching fastmath and tech.ml (under construction, growing fast).

Regarding Numpy, Scipy, it is worth looking into the discussion in this Zulip topic.

Carsten_Behring · February 24, 2019, 2:54pm

I want to praise Emacs org mode for doing data science in Clojure. To be able to use Cider in the clojure blocks is such a huge advantage compared to using any of the Clojure notebook solutions.
On top of this, org mode is the only true multi language solution, if you want to use Clojure and R and python in the same environment.
The biggest issue with org mode is in my view the complexity to get it working with “interactive” html based visualizations. But it can be done, by either creating svg (which can be rendered inside Emacs) or “inject” the html during the org->html export.

Regarding the “list” of python tools, it is probably correct to say, that there is not such a simple list in Clojure. I tend to combine in org mode, “whatever I find”, taking from:

existing clojure libs
existing java libs
existing R libs
existing python libs

linpengcheng · February 24, 2019, 3:15pm

@Carsten_Behring
You can use any programming language (Clojure,R,Python,etc.) and any mark language (org,md,rst,etc.) in any editor (Notepad++,Emacs,etc.) .

markdown literary programming is more suitable for programming than org.

I use R (or other external system) as a database service, using RDSL query, everything is no different with pure clojure . Although I use R, but I don’t write R code, the whole project is a pure Clojure project.

daslu · February 24, 2019, 9:17pm

@linpengcheng interesting, what is RDSL ?

linpengcheng · February 25, 2019, 12:33am

@daslu
It’s not an open source library, I like Datomic, Hiccup and Honeysql’s DSL solutions, for all my external systems, I refer to their design ideas to design a DSL solution, and I remember Lisp tending to use DSL as a library. This is my design reference model:
Everything is RMDB

daslu · February 25, 2019, 5:30am

@linpengcheng is your DSL similar to the one used in gg4clj?

… and what is RMDB?

linpengcheng · February 25, 2019, 5:43am

@daslu
All DSLs based on Hiccup thought are similar, and the data structures that are suitable as DSLs in Clojure are only vector and hash-map.

gg4clj has not yet reached the production available state, very alpha, and it is no longer developed.

system · August 26, 2019, 5:43pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.