Online meeting: clojure data science

cnuernber · January 17, 2019, 12:16am

@daslu: Yes, some overlap for sure, this is why we need to talk about this stuff openly and in a group!

daslu · January 17, 2019, 8:14am

Thanks @jsa-aerial, @Christopher_Small, your comments help!

(btw Dragan wrote on another platform that he is interested)

Let us discuss further the format (timing, platform, structure, etc.), and wait for some response of other timezones.

kamil · January 17, 2019, 9:15am

Hey,

I did a thing using Clojure MXNet. I think the general idea is interesting and I want to do more data science in Clojure.

Definitely interested in an event.

Kamil Hryniewicz

draganrocks · January 17, 2019, 9:36am

This is a great initiative. I’m in.

otfrom · January 17, 2019, 10:27am

The cortex people should be here as well

alanmarazzi · January 17, 2019, 11:30am

Hey everyone! I’m the author of clj-boost and one of the people involved in this together with @daslu.

@cnuernber your work is quite impressing, I didn’t know about it when I developed clj-boost. I’ll be very happy to ditch clj-boost in favor of something better for the community, and I’m very happy we will be able to discuss about these things all together!

About me

I’m currently a data scientist/engineer at a large Italian insurance company, but next month I’ll move into management at a new Fintech/Bank. I’ll always be involved in data science and I want reliable, simple and production ready stuff to move at a faster pace.

About Clojure

I discovered Clojure a couple years back and I’m currently moving from doing these things with Python to a full-stack Clojure experience. I think that there is a very high potential for doing data science with Clojure, but there are missing nuts and bolts here and there.

Are we scientists yet?

I really like how the Nim community is dealing with the same sorts of problems we’re facing, so I’ll try the same thing here to foster discussion. We might want to move these things in their own topic in the future or on other platforms, but that’s not the point right now.

The structure of this:

Name of the problem - data science is a stack of problems and one must have solutions to all of them to really be productive
Notable examples - what’s considered standard nowadays in other languages
Status - the current status of the matter
Forward - what is needed moving forward

Multidimensional arrays, Linear-algebra

Generic computation libraries. Here we should strive for the best: both GPU and CPU capability, multidimensional arrays, broadcasting, etc

Notable examples

Numpy
ND4J

Status

There are many libraries popping out at various levels of maturity, some of them are:

Forward

I think we can all agree that this degree of spread is not good, all these libraries represent wasted time and resources that might be spent on moving further other parts of the stack. We should settle on one-two of them and move on.

Plotting

Plotting is important for both analysis and presentation of results. Thanks to Clojurescript we might probably have an edge over other languages here.

Notable examples

Status

Here there are many libraries as well, *some of them are:

Forward

In this area taste is really important so it’s more normal to have more spread over different libraries. What we should do is to work on what is already available and make the plotting experience seamless:

(bar my-data)
;=> nil

The result would be a bar chart with reasonable defaults.

Geospatial library

Deal with coordinates on a map.

Notable examples

GeoPy
mapnik

Status

Not much that I’m aware of:

geo

Forward

This is another area where Clojure could shine thanks to its concurrency model. The fact it would be easy to deal with Spark or Onyx it’s certainly a plus.

Dataframe or similar

Today’s data scientists are used to work with tabular data, we have to deal with it.

Notable examples

Status

Not good: there are lots of stumps here and there but nothing has ever caught on. Some examples:

Forward

Here I would move on wrapping Arrow which have to potential to become the standard in the recent future, but anything that works is very welcome!

Statistics & probprog

Very important as the base for ML systems and evaluation of models.

Notable examples

Status

There are already many examples:

Forward

What is missing here is the tooling: we need more abstractions over basic functionality. For instance a function to get the ROC-AUC score for model validation.

Also better docs and examples of what is achievable with these libraries.

Machine learning

General modeling, the aim should be to have something simple, usable, reliable and with a consistent interface.

Notable examples

Status

Something is moving lately in this area:

Forward

As stated earlier either we pursue an R model (with many small libraries) or the scikit-learn way (one big framework with batteries included), the important thing should be to have a common interface to algorithms and utilities. This would be the opposite of what happens in the R world.

Deep learning

Important for computer vision, NLP and other problems.

Notable examples

Status

We’re pretty much covered especially thanks to @Carin_Meier’s work, what can be really improved are docs, examples and tutorials.

Forward

Just build on what’s already there

Disclaimer

None of the lists are to be considered complete, they are just some examples. Of course these are my opinions, but everything is amendable by the community and I would really love to get a productive discussion about these topics. If you think something is missing, wrong, misplaced or anything else just let the community know!
Yeah, I know about Incanter, I didn’t mention it on purpose, but if someone thinks that it is current and useful we can surely discuss it

henrygarner · January 17, 2019, 11:58am

Hi, author of kixi.stats here. I fully support this initiative, count me in.

leontalbot · January 17, 2019, 12:50pm

Hey! Love this goal you set!

Wanted to mention also flare a Dynamic Tensor Graph library in Clojure (think PyTorch, DynNet, etc.) and koala, both created by Aria Haghighi (Prismatic -> Facebook -> Amperity) which I think for sure should be part of your core team.

As for me, I’d be glad to join the user discussion

Webdev_Tory · January 17, 2019, 1:29pm

I’m excited about the prospects of a data science initiative in Clojure, and all the leads mentioned so far here. I’m a software engineer at a university and also a PhD Candidate in data science, particularly system and control theory. My colleagues are all in Python, while I try to eek by with DeepLearning4j and Incanter, and hopefully soon Neanderthal. My work is centrally NLP-related. I’m not sure what I have to contribute to this effort other than moral support and deep interest (I’m very much a newbie at the data science stuff), but I’m all for it!

alanmarazzi · January 17, 2019, 1:38pm

I just saw that I ditched NLP as a topic in the list I made, but it would probably make more sense to consider it as its own issue. If you know Clojure stuff for NLP don’t hesitate to share with us and I’ll add the topic in the list

gigasquid · January 17, 2019, 2:15pm

Happy to Participate. There are quite a few Clojure MXNet contributors now as well. It would be great for everyone to get involved.

Thank you for getting this initiative started.

cnuernber · January 17, 2019, 3:26pm

@alanmarazzi: Fantastic writeup; I think we both may end up going to fastmath!

I am not sure overlap really matters all that much, aside from documentation. I think, just in general, we should avoid criticizing architectures and stick to talking about features. We have enough tools for basic data science. My first step were this my job would be to evaluate exactly what we do have at the moment.

Reference Datasets & Solutions

What I would like to see is a set of datasets, starting with straight classification and regression. Ideally we can have datasets that have very different base attributes small-N, large feature dimension, very large N (larger than fits in ram), etc. We then solve these and get to great results; maybe we should pick from kaggle where we have examples of the best practices or at least the practices that are effective. I would stay away from computer vision, personally, as this is really time consuming and I do not think it matches the types of problems most clojurists are going to encounter. For that matter, after writing the majority of Cortex, I tend to avoid NN architectures but to each their own.

We then all try our different methods and talk about the results. Ideally we can develop best practices in e.g. visualization during this pathway. It really doesn’t matter who does best but why they do best does matter. And an object evaluation of the end result plus great visualization. I agree that clojurescript should present an advantage here but perhaps also beefing up the jupyter support for clojure would help.

So that is one direction; have a set of datasets and solutions so we can quantitatively compare the toolkits and bring together some of the really interesting pieces we have on the table right now. Note that I am most interested in classification and regression but others may differ (clustering, ranking, anomaly detection, the list is infinite). With this done well I think most clojurists can just cut&paste their way through their own particular exploration.

Missing Feature Dependency Graph

Another (parallel) pathway is, given aggregate total of everything done and accessible in clojure, what is missing? We can then just walk down the dependency graph, figure out good ways to get each thing and just sort of slowly over time fill in the things that are nice. Scikit wasn’t built in a day. Right now I feel like good hyperparameter optimization is a large missing piece and I am very doubtful at this point that anything is better than the best toolkits for python; this includes anglican based systems.

Speaking of which, potentially the best way to get a lot of the missing pieces is just to shell out to python. We can get multithreading and such by just using many shells and while I feel like right now if everyone had a rock they would be throwing it at me it just kind of makes sense. In this way we get great coverage in a format that matches the rest of the world. And getting good at this pathway guarantees us access to some of the cutting edge stuff that we won’t get any other way. If we have to install scala to use mxnet we can sure has hell install the python subsystems to get everything else on the planet. Finally we also help people to want to do data science in clojure not be marooned on a clojure-only island by working towards deeper python ML integration.

Final Thoughts

My feeling is that our community is very very small. Building bridges is going to get us a lot farther than building islands.

daslu · January 17, 2019, 5:20pm

Thanks @otfrom!

@cnuernber is here, and I let @mikera know through twitter. Will you please point me to anyone else relevant? (or invite them?)

alex314159 · January 17, 2019, 6:00pm

Interested as well. Great summary from @alanmarazzi .

When I first started Clojure I was missing Python dataframes a lot. With a bit more experience I’m ok with manipulating data the Clojure way, and it’s really the plotting / display capabilities that I find very limited. You can get results, but you can’t present them easily. Related is the lack of a simple web framework. I’d like to crunch data in Clojure and be able to display tables and charts easily in the browser, with limited web development knowledge. Oz I think is a nice step in that direction.

alanmarazzi · January 17, 2019, 6:30pm

Thanks! I guess that since Tomasz (fastmath creator) is among the starters of this thing he will be happy about all this love for his library! Anyway both of us were looking at tech.smile and tech.xgboost and we were really liking the data serialization and deserialization, so probably the best way to go would be to take the best ideas from many implementations and either come up with something new or merge them into one of the existing libraries.

This is exactly one of the ideas we had: give people more examples and best practices to draw from. I agree on the NN part, but I guess the MXNet people can help in that direction!

I tend to use scikit-learn as an example more for its scope, consistency, docs and tutorials than about code quality (as you said it wasn’t built in a day, and in my opinion it is clear by just looking at the code). My point here being that we might cover most of scikit’s scope with many different libraries but it would be very nice to have integrated examples and a consistent API.

I have mixed feelings about this , but I won’t deny the fact that Python right now is the de-facto lingua franca for machine learning. It might even be a way to cover some “holes” until something else comes out of the community.

I can only agree with you, and I would like to thank everyone for the great answer out of this! One of the main reasons I love Clojure is because of its community!

jsa-aerial · January 17, 2019, 8:05pm

Actually, I think the plotting/display capabilities are well underway. Thanks largely to the amazing work of the IDL folks who have given us Vega and Vega-Lite.

For plotting and charting with zero web dev knowledge, have a look at Saite, a general exploratory visualization app built on Hanami. You can work in the client (browser) and / or from an IDE/editor of your choice on the server side. Saite is evolving toward a new notebook capability. There are a number of issues about this (on the github) and if you think that sounds like an idea worth pursuing, I’d appreciate input on those issues (or others of your own).

If you want to create your own domain specific (or other general purpose) visualization app, look at Hanami. Of course that would involve web dev knowledge. Some new features supporting client only apps are coming. A lot of documentation now, but a lot more coming.

jsa-aerial · January 17, 2019, 8:13pm

One thing I would like to point out. This

make the plotting experience seamless: …
result would be a bar chart with reasonable defaults.

is totally on target.

This, however:

(bar my-data)

is a bad idea. Trying to cover all the various cases of this via a traditional functional or class/method based api is far too constraining and limited. Even worse would be resorting to macros. A much better approach is to just use data and, since we are using a great functional language, data transformations. That’s the route Hanami has taken.

cnuernber · January 17, 2019, 8:44pm

@alanmarazzi: Thanks for the kind comments and seeing good discussion early is heartening.

I only have a small thing to add, that I took some time this morning to clearly explain the points behind the tech.ml-base system on the readme.

There is more to those systems than serialization, it comes down to being able to represent various parts of the problem as data (and datasets) that you can visualize in the repl and some details around how models are represented generically.

Anyway, again, great thread!

generateme · January 17, 2019, 9:11pm

Haha (Tomasz here). Yes I’m really happy. I also didn’t know about tech.smile existence and we have another island. However fastmath contains almost raw bindings to Smile and Apache Commons certain elements (besides classification I also added bunch of clustering algorithms). I definitely want to treat it as an step to build richer environment. Let’s start think of bridges now.

daslu · January 17, 2019, 11:23pm

Thanks, everybody, for your kind comments. It seems good to continue this discussion a little bit further, before going back to discuss the format of a meeting, etc.

@alanmarazzi, @cnuernber – that was really enlightening, and I like the visions that you both suggested.

Here are some more suggested directions, that can possibly fit in.

grammer of graphics
R’s ggplot2 can work on pure JVM (through Renjin), and is easy to access from Clojure. I’ll write some examples soon.

interactive visualizations
Vega and Vega-Lite, Oz and Saite are wonderful.
For a richer collection of visual elements, personally I use some combination of hiccup, rojure, rmarkdown, htmlwidgets and crosstalk. I’ll write some examples soon. This is quite handy, but complicated and has some limitations. Here clojurescript could really shine with much simpler solutions.

calling python
It should be feasible to create a solution with, e.g., HyREPL for sending commands, and Apache Arrow for sharing memory. Thus, we can talk with live python sessions (which is more efficient when one needs to repeatedly access a large data structure).

probabilistic programming
A complete probabilistic programming library is arguably one of the important missing pieces. Anglican is useful and very expressive, but does not support the fastest sampling methods (that require differentiation). About Bayadera, probably @draganrocks can comment.
In the short run, maybe wrapping a non-clojure library would be the easiest. There are several very interesting candidates here.
In the long run, let us consider writing something more complete in Clojure. We have been discussing this with David Tolpin, who coded Anglican, and has some interesting new ideas.