Online meeting: clojure data science

Hi, author of kixi.stats here. I fully support this initiative, count me in.

3 Likes

Hey! Love this goal you set!

Wanted to mention also flare a Dynamic Tensor Graph library in Clojure (think PyTorch, DynNet, etc.) and koala, both created by Aria Haghighi (Prismatic -> Facebook -> Amperity) which I think for sure should be part of your core team.

As for me, I’d be glad to join the user discussion :slight_smile:

2 Likes

I’m excited about the prospects of a data science initiative in Clojure, and all the leads mentioned so far here. I’m a software engineer at a university and also a PhD Candidate in data science, particularly system and control theory. My colleagues are all in Python, while I try to eek by with DeepLearning4j and Incanter, and hopefully soon Neanderthal. My work is centrally NLP-related. I’m not sure what I have to contribute to this effort other than moral support and deep interest (I’m very much a newbie at the data science stuff), but I’m all for it!

1 Like

I just saw that I ditched NLP as a topic in the list I made, but it would probably make more sense to consider it as its own issue. If you know Clojure stuff for NLP don’t hesitate to share with us and I’ll add the topic in the list

2 Likes

Happy to Participate. There are quite a few Clojure MXNet contributors now as well. It would be great for everyone to get involved.

Thank you for getting this initiative started.

4 Likes

@alanmarazzi: Fantastic writeup; I think we both may end up going to fastmath!

I am not sure overlap really matters all that much, aside from documentation. I think, just in general, we should avoid criticizing architectures and stick to talking about features. We have enough tools for basic data science. My first step were this my job would be to evaluate exactly what we do have at the moment.

Reference Datasets & Solutions

What I would like to see is a set of datasets, starting with straight classification and regression. Ideally we can have datasets that have very different base attributes small-N, large feature dimension, very large N (larger than fits in ram), etc. We then solve these and get to great results; maybe we should pick from kaggle where we have examples of the best practices or at least the practices that are effective. I would stay away from computer vision, personally, as this is really time consuming and I do not think it matches the types of problems most clojurists are going to encounter. For that matter, after writing the majority of Cortex, I tend to avoid NN architectures but to each their own.

We then all try our different methods and talk about the results. Ideally we can develop best practices in e.g. visualization during this pathway. It really doesn’t matter who does best but why they do best does matter. And an object evaluation of the end result plus great visualization. I agree that clojurescript should present an advantage here but perhaps also beefing up the jupyter support for clojure would help.

So that is one direction; have a set of datasets and solutions so we can quantitatively compare the toolkits and bring together some of the really interesting pieces we have on the table right now. Note that I am most interested in classification and regression but others may differ (clustering, ranking, anomaly detection, the list is infinite). With this done well I think most clojurists can just cut&paste their way through their own particular exploration.

Missing Feature Dependency Graph

Another (parallel) pathway is, given aggregate total of everything done and accessible in clojure, what is missing? We can then just walk down the dependency graph, figure out good ways to get each thing and just sort of slowly over time fill in the things that are nice. Scikit wasn’t built in a day. Right now I feel like good hyperparameter optimization is a large missing piece and I am very doubtful at this point that anything is better than the best toolkits for python; this includes anglican based systems.

Speaking of which, potentially the best way to get a lot of the missing pieces is just to shell out to python. We can get multithreading and such by just using many shells and while I feel like right now if everyone had a rock they would be throwing it at me it just kind of makes sense. In this way we get great coverage in a format that matches the rest of the world. And getting good at this pathway guarantees us access to some of the cutting edge stuff that we won’t get any other way. If we have to install scala to use mxnet we can sure has hell install the python subsystems to get everything else on the planet. Finally we also help people to want to do data science in clojure not be marooned on a clojure-only island by working towards deeper python ML integration.

Final Thoughts

My feeling is that our community is very very small. Building bridges is going to get us a lot farther than building islands.

2 Likes

Thanks @otfrom!

@cnuernber is here, and I let @mikera know through twitter. Will you please point me to anyone else relevant? (or invite them?)

Interested as well. Great summary from @alanmarazzi .

When I first started Clojure I was missing Python dataframes a lot. With a bit more experience I’m ok with manipulating data the Clojure way, and it’s really the plotting / display capabilities that I find very limited. You can get results, but you can’t present them easily. Related is the lack of a simple web framework. I’d like to crunch data in Clojure and be able to display tables and charts easily in the browser, with limited web development knowledge. Oz I think is a nice step in that direction.

2 Likes

Thanks! I guess that since Tomasz (fastmath creator) is among the starters of this thing he will be happy about all this love for his library! Anyway both of us were looking at tech.smile and tech.xgboost and we were really liking the data serialization and deserialization, so probably the best way to go would be to take the best ideas from many implementations and either come up with something new or merge them into one of the existing libraries.

This is exactly one of the ideas we had: give people more examples and best practices to draw from. I agree on the NN part, but I guess the MXNet people can help in that direction!

I tend to use scikit-learn as an example more for its scope, consistency, docs and tutorials than about code quality (as you said it wasn’t built in a day, and in my opinion it is clear by just looking at the code). My point here being that we might cover most of scikit’s scope with many different libraries but it would be very nice to have integrated examples and a consistent API.

I have mixed feelings about this :laughing:, but I won’t deny the fact that Python right now is the de-facto lingua franca for machine learning. It might even be a way to cover some “holes” until something else comes out of the community.

I can only agree with you, and I would like to thank everyone for the great answer out of this! One of the main reasons I love Clojure is because of its community!

3 Likes

Actually, I think the plotting/display capabilities are well underway. Thanks largely to the amazing work of the IDL folks who have given us Vega and Vega-Lite.

For plotting and charting with zero web dev knowledge, have a look at Saite, a general exploratory visualization app built on Hanami. You can work in the client (browser) and / or from an IDE/editor of your choice on the server side. Saite is evolving toward a new notebook capability. There are a number of issues about this (on the github) and if you think that sounds like an idea worth pursuing, I’d appreciate input on those issues (or others of your own).

If you want to create your own domain specific (or other general purpose) visualization app, look at Hanami. Of course that would involve web dev knowledge. Some new features supporting client only apps are coming. A lot of documentation now, but a lot more coming.

2 Likes

One thing I would like to point out. This

make the plotting experience seamless: …
result would be a bar chart with reasonable defaults.

is totally on target.

This, however:

(bar my-data)

is a bad idea. Trying to cover all the various cases of this via a traditional functional or class/method based api is far too constraining and limited. Even worse would be resorting to macros. A much better approach is to just use data and, since we are using a great functional language, data transformations. That’s the route Hanami has taken.

1 Like

@alanmarazzi: Thanks for the kind comments and seeing good discussion early is heartening.

I only have a small thing to add, that I took some time this morning to clearly explain the points behind the tech.ml-base system on the readme.

There is more to those systems than serialization, it comes down to being able to represent various parts of the problem as data (and datasets) that you can visualize in the repl and some details around how models are represented generically.

Anyway, again, great thread!

3 Likes

Haha :slight_smile: (Tomasz here). Yes I’m really happy. I also didn’t know about tech.smile existence and we have another island. However fastmath contains almost raw bindings to Smile and Apache Commons certain elements (besides classification I also added bunch of clustering algorithms). I definitely want to treat it as an step to build richer environment. Let’s start think of bridges now.

2 Likes

Thanks, everybody, for your kind comments. It seems good to continue this discussion a little bit further, before going back to discuss the format of a meeting, etc.

@alanmarazzi, @cnuernber – that was really enlightening, and I like the visions that you both suggested.

Here are some more suggested directions, that can possibly fit in.

grammer of graphics
R’s ggplot2 can work on pure JVM (through Renjin), and is easy to access from Clojure. I’ll write some examples soon.

interactive visualizations
Vega and Vega-Lite, Oz and Saite are wonderful.
For a richer collection of visual elements, personally I use some combination of hiccup, rojure, rmarkdown, htmlwidgets and crosstalk. I’ll write some examples soon. This is quite handy, but complicated and has some limitations. Here clojurescript could really shine with much simpler solutions.

calling python
It should be feasible to create a solution with, e.g., HyREPL for sending commands, and Apache Arrow for sharing memory. Thus, we can talk with live python sessions (which is more efficient when one needs to repeatedly access a large data structure).

probabilistic programming
A complete probabilistic programming library is arguably one of the important missing pieces. Anglican is useful and very expressive, but does not support the fastest sampling methods (that require differentiation). About Bayadera, probably @draganrocks can comment.
In the short run, maybe wrapping a non-clojure library would be the easiest. There are several very interesting candidates here.
In the long run, let us consider writing something more complete in Clojure. We have been discussing this with David Tolpin, who coded Anglican, and has some interesting new ideas.

1 Like

Bayadera is a (Bayesian-opinionated) Probabilistic Data Analysis environment (that implements GPU MCMC engines as an important implementation detail). Arguably, Bayesian data analysis it is the future of small sturctured data analysis, but whether it will be mainstream is yet to be seen. Currently, the major issue with (C++) tools is that they are super slow. Bayadera’s engine brings that to a lower, acceptable level, but, as the probabilistic data analysis is still niche, it’s a lot of work in convincing other people to try it. I wish I had more time to document it more and demonstrate how I use it more, but I can’t commit infinite resources to open-source work, and this has to compete with other things on my TODO. Additionally, the target audience usually don’t have GPU’s, and create script-based R analyses… Still a lot of steps for that community to become interested in scaling server-based software… Lots of interesting things there, since I have use for it, am very interested in the area, and I think this might be expanding. It is clearly useful in practice, especially for decision analysis and social sciences.

Probabilistic programming is something else, and, in my opinion is something that currently has yet to be shown practically useful outside programming(?) language research. I get the idea of PP, but I think that it does not scale. It’s super, super, slow even for basic models. It might be useful for some things as hyper parameters search etc. (even that is still research, though) but I can not see how this can work as a data analysis tool. The common confusion of PP and Bayesian data analysis is due to PP often using methods such as MCMC and creating prior/posterior/evidence models etc, which is also done in BDA. However, I can see how this can work in practice when searching numerical distributions. I’m skeptical this will work acceptably in practice for distributions of random program code output, especially if that is then used for numerical data analysis. Anyway, PP might be interesting or not for different people, but I’d say that this is not a pressing issue for Clojure as a “data science” or ML platform.

PS. It seems to me now that this might sound as a bit dismissive of PP in general and Anglican in particular. That was not my intention. Anglican is a quite interesting thing to try, and everyone can get a lot of intuition and fun from it. In addition, there is the Probabilistic Programming in Action book that can help learn it easier (it’s in Scala, but I believe it can be used to get many ideas). I’m just skeptical towards the usefulness of the whole PP concept as a general data analysis tool applied in practice vs other methods. But for learning and fun, I can even recommend it!

3 Likes

Thanks, @draganrocks, this explanation helps a lot. You are right, the distinction is important.

The way I see it, PP is just about being very expressive in describing probabilistic situations. At least for small problems, I find it quite useful. For anyone here curious, probably this interactive book tutorial by Goodman and Tenenbaum is one of the accessible ways to see what it is about. Anglican feels similar from the user perspective.

Anyway, personally, I really hope to be able to look into Bayadera and understand it better.

1 Like

Vega and Vega-Lite are entirely based on the grammar of graphics. The main practical differences between Vega/Vega-Lite and ggplot2 is that the former are dynamic, interactive and run on JS (browser based being the most important). You could make a reasonable argument that this means they supersede ggplot2 and that any future work should target them. Indeed the R folks are working on supporting Vega and Vega-Lite at a level comparable to Hanami/Saite/Altair(Python). They are not there yet but realize this is the future.

WRT ‘other visual elements’, Saite, Hanami, and Oz support hiccup - Saite and Hanami directly support this on the browser side as well. And Saite/Hanami also support full re-com.

OK, it appears I have reached some crazy “you can only reply a maximum of 3 times” in this silly platform. But I wanted to say that WRT PP, I basically agree with Dragan’s comments on this.

Since it appears I no longer can add any more to the conversation, I guess I may have to just abandon this effort after all…

4 Likes

Hi all, long time lurker here. I feel like I can contribute to this thread. I’m
somewhere in the middle between data eng and data sci, been doing Python for a
while. I’ve been meaning to get into Clojure, but always failed to learn by
building apps, so decided to learn it by doing data science and documenting my
journey.

Following the “building bridges” analogy, I feel something like Chris Albon’s Python data sci how-tos (scroll down to Data Wrangling) would help a lot of people to get started. I’m sure
there are a ton of people landing on that page while getting started in the
field, plus using it for references. Also, see his ML cards.

In general, the whole thread sounds great - I’m definitely in, however much I
can help.

2 Likes

@cnuernber. I’ve been watching ‘tech’ libraries for a while, though a lot of it is beyond me at the moment. Looking forward to the meeting.

2 Likes

I’m very happy to see this initiative! My involvement with data science is probably more peripheral than for some folks here, but I’m still very interested. I’m a philosopher of science who sometimes works with real data, but more often works with data that was generated from simulations. I also use OCaml and NetLogo quite a bit, and a lot of my work doesn’t involve coding or data at all, so I come and go from Clojure depending on the projects I’m working on.

(If anyone’s interested in discussing agent-based/individual-based modeling with Clojure, it probably doesn’t belong in a data science thread, but please contact me for a side conversation. I’ve been involved with agent-based modeling for a while, and since Clojure is one of my favorite languages, I like doing ABM work in it when that makes sense.)

2 Likes