So a lot of people are wondering, what is the Clojure data-science toolkit?
Because I’m new to data-science, I’m asking, what is the Python data-science toolkit first?
A quick google came up with: https://jakevdp.github.io/PythonDataScienceHandbook/
Where it seems the common toolkit would be:
- IPython shell
From the looks of it, IPython just looks like nRepl to me, or any of the many Clojure REPLs. With some additional syntax to make Python more REPL friendly.
Is there anything nRepl wouldn’t have for example which IPython shell would?
- Jupyter notebook with IPython kernel
Well, this seems covered quite well in Clojure by using Jupyter with IClojure or Clojupyter kernel.
Or, we can even use Oz or Saite instead of Jupyter for a graphical output of our Clojure REPL.
Even simpler, one can make use of JavaFX or Swing, or just Cider’s support for rendering images in the Cider REPL.
Is there anything that IPython Jupyter Kernel has that IClojure or Clojupyter does not? Or that can’t be done as easily with Oz, Saite, or a JavaFX/Swing solution?
- NumPy
This seems to address the following: efficient storage and manipulation of numerical arrays
Now, defacto, Clojure excels at manipulating data through the use of its standard persistent data-structures and its core sequence functions and transducers. But, it is not efficient at it, well, it’s pretty damn fast, but not the fast that NumPy is.
For that, it seems we want raw arrays. In Clojure, that would be all the -array functions like make-array, into-array, amap, areduce, aget, aset, etc.
These would be similar to Python 3 array.
This would count as efficient storage, but NumPy adds efficient computation over them.
Now, the options I see in Clojure are core.matrix, Neanderthal, nd4j, or role your own over standard Java arrays.
Neanderthal seems like the way to go, except for one thing, you can’t use it without setting up BLAS, and that’s annoying when you don’t care about the utter most performance. A pure Java backend for it would be nice in that regard.
Anyways, I think Neanderthal is just the way to go here. It uses Blas and Lapack, which are the state of the art, and highly funded. We should just all contribute to the patreon and have dragan keep maintaining it and making it better in my opinion.
- SciPy
Now, the way I see this, it is a higher level API of common functions over fast numerics. Neanderthal has some, but maybe not all that SciPy brings to the table, and I feel this one is missing in Clojure.
So I ask, what do people use here? Is there anything in Clojure that adds easy to use APIs and common operations over Neanderthal? And this assumes Neanderthal itself is missing some, and maybe it isn’t?
- Pandas
Pandas turn ndarrays of numpy into row or column indexed structures. These resemble your typical relational database table.
Now, my knowledge of available Clojure libs that fill this use case is low. I can see using datascript or even an in-memory DB like h2. But Pandas isn’t just a way to perform joins and queries over tabular data. It combines that with fast parallel matrix operations.
Again, I think it would be nice to see something here that built on top of Neanderthal. That said, I think Apache Arrow is a good contender here as well. But how does Apache Arrow and Neanderthal interact is something I’m not sure.
- Matplotlib
Plotting seems to be another aspect that Clojure already has covered. The Vega/Lite offerings like Oz or Hanami are quite complete. And simpler alternatives like clj-xchart or incanter can also be used for quite a few plots.
Is there anything Matplotlib has that these can’t do?
- Seaborn
Seaborn adds specialised plots specific to helping you in building ML models. I’m not familiar with these, so I don’t know what exist in Clojure.
- Scikit-learn
This is a set of ML algorithms modeled on top of NumPy and Pandas. If you follow my logic, you’ll see that it would mean we’d have something on top of Neanderthal and maybe Apache Arrow for this. Which I don’t think we do.
I’m not sure what exists here in Clojure either. We would want four things:
a. A wide range of ML model classes
b. The ability to perform manual and automated hyperparameter tuning.
c. Some support for feature extraction of common encodings for say categorical, text, image, sound, etc.
d. A common API and a common data representation, so that changing model class is easy.
Some things I’ve seen which come close:
- Spark SQL and MLlib (also overlaps with Pandas)
- PMML and PFA and JPMML evaluator
- h2o.ai has Java APIs
- XGBoost (doesn’t seem to have as many model classes though)
- Java-ML
- RapidMiner (has Java APIs)
- Weka (through Java APIs)
- Statsmodels
I don’t really understand what this does. But it seems to relate to explanability of ML models? Anyways, don’t really know what it does, and so I have no idea if Clojure has anything of that sort.
- Keras
For deep learning, Clojure seems covered here as well, can use DeepLearning4J or MXnet.
Is there anything Keras has that isn’t covered by these?