Online meeting: clojure data science

cnuernber · January 17, 2019, 3:26pm

@alanmarazzi: Fantastic writeup; I think we both may end up going to fastmath!

I am not sure overlap really matters all that much, aside from documentation. I think, just in general, we should avoid criticizing architectures and stick to talking about features. We have enough tools for basic data science. My first step were this my job would be to evaluate exactly what we do have at the moment.

Reference Datasets & Solutions

What I would like to see is a set of datasets, starting with straight classification and regression. Ideally we can have datasets that have very different base attributes small-N, large feature dimension, very large N (larger than fits in ram), etc. We then solve these and get to great results; maybe we should pick from kaggle where we have examples of the best practices or at least the practices that are effective. I would stay away from computer vision, personally, as this is really time consuming and I do not think it matches the types of problems most clojurists are going to encounter. For that matter, after writing the majority of Cortex, I tend to avoid NN architectures but to each their own.

We then all try our different methods and talk about the results. Ideally we can develop best practices in e.g. visualization during this pathway. It really doesn’t matter who does best but why they do best does matter. And an object evaluation of the end result plus great visualization. I agree that clojurescript should present an advantage here but perhaps also beefing up the jupyter support for clojure would help.

So that is one direction; have a set of datasets and solutions so we can quantitatively compare the toolkits and bring together some of the really interesting pieces we have on the table right now. Note that I am most interested in classification and regression but others may differ (clustering, ranking, anomaly detection, the list is infinite). With this done well I think most clojurists can just cut&paste their way through their own particular exploration.

Missing Feature Dependency Graph

Another (parallel) pathway is, given aggregate total of everything done and accessible in clojure, what is missing? We can then just walk down the dependency graph, figure out good ways to get each thing and just sort of slowly over time fill in the things that are nice. Scikit wasn’t built in a day. Right now I feel like good hyperparameter optimization is a large missing piece and I am very doubtful at this point that anything is better than the best toolkits for python; this includes anglican based systems.

Speaking of which, potentially the best way to get a lot of the missing pieces is just to shell out to python. We can get multithreading and such by just using many shells and while I feel like right now if everyone had a rock they would be throwing it at me it just kind of makes sense. In this way we get great coverage in a format that matches the rest of the world. And getting good at this pathway guarantees us access to some of the cutting edge stuff that we won’t get any other way. If we have to install scala to use mxnet we can sure has hell install the python subsystems to get everything else on the planet. Finally we also help people to want to do data science in clojure not be marooned on a clojure-only island by working towards deeper python ML integration.

Final Thoughts

My feeling is that our community is very very small. Building bridges is going to get us a lot farther than building islands.