Online meeting: clojure data science

daslu · January 16, 2019, 10:11pm

Dear Friends,

Recently, several of us have had some discussions about pushing forward the libraries and tooling for data science in clojure.

It sees that an online meeting could be a good step to create some coordination among the people acting in this field.

Here, we wish to discuss such possible meeting.

Background

Some exciting developments are taking place. Fastmath, clj-boost, tech.xgboost, tech.smile, tech.ml-base, clojure-mxnet, tvm-clj, kixi.stats, oz, hanami, saite, anglican, metaprob, uncomplicate, IClojure, lein-jupyter, etc., all are under development, or at least got some recent attention. Also, nice surprises are in the oven and coming soon.

Several pieces are arguably still missing, before clojure can become a beginner friendly tool for data scientists to just take and play with. Tooling, documentation, tutorials, a more complete dataframe library, standard machine-learning APIs, interop with other languages, tools for conducting machine learning experiments, and some critical mass of popularity, all seem essential to make the situation friendlier.

Moreover, new opportunities arise. GraalVM, Apache Arrow, Gandiva, Renjin all promise beautiful benefits, and are just waiting for a functional lisp that can bring them some joy.

Possibly, better coordination and some discussion of priorities could help us go forward wisely.

The Goal

Let us ask ourselves the following: can we make Clojure a beginner friendly tool for data science till 2020?

Suggesting an online meeting

Online chat and email can be just fine for discussions of design and getting some advice on development.

However, for finer coordination, it could be nice if people knew each other better, and possibly felt more comfortable about cooperation. Maybe a good meeting is a step in that direction.

Here is a suggested concept (please comment to make it better):

A online meeting can go more or less like this.

1st part: learning together - everybody are invited. Everybody present themselves briefly, and then 3-4 people show short demos they have prepared.
2nd part: discussing development - library developers are invited. We conduct a focused discussion around 2-3 specific topics.
Repeat every month or so.

What do you think?

Please tell me if you have any thoughts about the concept, and if you may like to participate.

About myself
Hi, just a little bit about myself. In the last 6 years, used Clojure as my main tool for data science at my workplaces. Never managed to open-source anything serious, but now working on several small libraries. These days, my main efforts are in organizing a local workshop, where we meet every 3 weeks, learn together and take on some projects.

cnuernber · January 16, 2019, 10:39pm

Hey, author of tvm-clj here.

There are a few libraries missing:

Those two are based off of a general framework,

tech.ml-base.

Using all of this together means you can generically gridsearch a classification or regression problem.

Next big step I have been researching a bit is building out a guassian-process-based hyperparam search system. I believe this is bopp but I want to be sure and there are simpler versions of that system.

Note that the above libraries are based on the same basis as tvm-clj, native buffers, so things like this come for free.

Interesting to note that clj-boost and the tech system both have basically the same dataset abstraction. The extra piece is the generic bindings to the gridsearch system so you can do hyperparam opt.

Also, smile has a ton of tools is great for just data exploration especially mixed with a bit of incanter.

daslu · January 16, 2019, 10:54pm

Thanks, @cnuernber!

I updated the post.

BTW there is some overlap not only with clj-boost, but also with fastmath.

Very curious to look into your libraries!

jsa-aerial · January 16, 2019, 11:26pm

Hi,

Author of Hanami and Saite here. I think the basic idea sounds good, but would really need to have a ‘critical mass’ of participants to make it work. Not entirely sure who and how many that would be. Certainly Dragan would be important to have. Also not sure how easy/likely getting that ‘critical mass’ would be.

In any event, count me as interested!

Christopher_Small · January 17, 2019, 12:09am

Sounds like a great idea!

I think there’s lot of value in building community around data science in Clojure, because it’s biggest problem right now (imo) is is that it’s an awfully select (though growing) bunch. Creating a sense of space around this intersection will help others outside our bubble know that there’s even a bubble to be had here, why it’s worth it, and where to get started. There’s a real challenge in articulating the value proposition, and making the case for an underdog, and it’s worth us joining together to make it, and coordinate our resources in fleshing out the missing pieces of the Clojure + Data Science puzzle.

Thanks for bringing this idea to the community! You can count me in.

Christopher Small

cnuernber · January 17, 2019, 12:16am

@daslu: Yes, some overlap for sure, this is why we need to talk about this stuff openly and in a group!

daslu · January 17, 2019, 8:14am

Thanks @jsa-aerial, @Christopher_Small, your comments help!

(btw Dragan wrote on another platform that he is interested)

Let us discuss further the format (timing, platform, structure, etc.), and wait for some response of other timezones.

kamil · January 17, 2019, 9:15am

Hey,

I did a thing using Clojure MXNet. I think the general idea is interesting and I want to do more data science in Clojure.

Definitely interested in an event.

Kamil Hryniewicz

draganrocks · January 17, 2019, 9:36am

This is a great initiative. I’m in.

otfrom · January 17, 2019, 10:27am

The cortex people should be here as well

alanmarazzi · January 17, 2019, 11:30am

Hey everyone! I’m the author of clj-boost and one of the people involved in this together with @daslu.

@cnuernber your work is quite impressing, I didn’t know about it when I developed clj-boost. I’ll be very happy to ditch clj-boost in favor of something better for the community, and I’m very happy we will be able to discuss about these things all together!

About me

I’m currently a data scientist/engineer at a large Italian insurance company, but next month I’ll move into management at a new Fintech/Bank. I’ll always be involved in data science and I want reliable, simple and production ready stuff to move at a faster pace.

About Clojure

I discovered Clojure a couple years back and I’m currently moving from doing these things with Python to a full-stack Clojure experience. I think that there is a very high potential for doing data science with Clojure, but there are missing nuts and bolts here and there.

Are we scientists yet?

I really like how the Nim community is dealing with the same sorts of problems we’re facing, so I’ll try the same thing here to foster discussion. We might want to move these things in their own topic in the future or on other platforms, but that’s not the point right now.

The structure of this:

Name of the problem - data science is a stack of problems and one must have solutions to all of them to really be productive
Notable examples - what’s considered standard nowadays in other languages
Status - the current status of the matter
Forward - what is needed moving forward

Multidimensional arrays, Linear-algebra

Generic computation libraries. Here we should strive for the best: both GPU and CPU capability, multidimensional arrays, broadcasting, etc

Notable examples

Numpy
ND4J

Status

There are many libraries popping out at various levels of maturity, some of them are:

Forward

I think we can all agree that this degree of spread is not good, all these libraries represent wasted time and resources that might be spent on moving further other parts of the stack. We should settle on one-two of them and move on.

Plotting

Plotting is important for both analysis and presentation of results. Thanks to Clojurescript we might probably have an edge over other languages here.

Notable examples

Status

Here there are many libraries as well, *some of them are:

Forward

In this area taste is really important so it’s more normal to have more spread over different libraries. What we should do is to work on what is already available and make the plotting experience seamless:

(bar my-data)
;=> nil

The result would be a bar chart with reasonable defaults.

Geospatial library

Deal with coordinates on a map.

Notable examples

GeoPy
mapnik

Status

Not much that I’m aware of:

geo

Forward

This is another area where Clojure could shine thanks to its concurrency model. The fact it would be easy to deal with Spark or Onyx it’s certainly a plus.

Dataframe or similar

Today’s data scientists are used to work with tabular data, we have to deal with it.

Notable examples

Status

Not good: there are lots of stumps here and there but nothing has ever caught on. Some examples:

Forward

Here I would move on wrapping Arrow which have to potential to become the standard in the recent future, but anything that works is very welcome!

Statistics & probprog

Very important as the base for ML systems and evaluation of models.

Notable examples

Status

There are already many examples:

Forward

What is missing here is the tooling: we need more abstractions over basic functionality. For instance a function to get the ROC-AUC score for model validation.

Also better docs and examples of what is achievable with these libraries.

Machine learning

General modeling, the aim should be to have something simple, usable, reliable and with a consistent interface.

Notable examples

Status

Something is moving lately in this area:

Forward

As stated earlier either we pursue an R model (with many small libraries) or the scikit-learn way (one big framework with batteries included), the important thing should be to have a common interface to algorithms and utilities. This would be the opposite of what happens in the R world.

Deep learning

Important for computer vision, NLP and other problems.

Notable examples

Status

We’re pretty much covered especially thanks to @Carin_Meier’s work, what can be really improved are docs, examples and tutorials.

Forward

Just build on what’s already there

Disclaimer

None of the lists are to be considered complete, they are just some examples. Of course these are my opinions, but everything is amendable by the community and I would really love to get a productive discussion about these topics. If you think something is missing, wrong, misplaced or anything else just let the community know!
Yeah, I know about Incanter, I didn’t mention it on purpose, but if someone thinks that it is current and useful we can surely discuss it

henrygarner · January 17, 2019, 11:58am

Hi, author of kixi.stats here. I fully support this initiative, count me in.

leontalbot · January 17, 2019, 12:50pm

Hey! Love this goal you set!

Wanted to mention also flare a Dynamic Tensor Graph library in Clojure (think PyTorch, DynNet, etc.) and koala, both created by Aria Haghighi (Prismatic -> Facebook -> Amperity) which I think for sure should be part of your core team.

As for me, I’d be glad to join the user discussion

Webdev_Tory · January 17, 2019, 1:29pm

I’m excited about the prospects of a data science initiative in Clojure, and all the leads mentioned so far here. I’m a software engineer at a university and also a PhD Candidate in data science, particularly system and control theory. My colleagues are all in Python, while I try to eek by with DeepLearning4j and Incanter, and hopefully soon Neanderthal. My work is centrally NLP-related. I’m not sure what I have to contribute to this effort other than moral support and deep interest (I’m very much a newbie at the data science stuff), but I’m all for it!

alanmarazzi · January 17, 2019, 1:38pm

I just saw that I ditched NLP as a topic in the list I made, but it would probably make more sense to consider it as its own issue. If you know Clojure stuff for NLP don’t hesitate to share with us and I’ll add the topic in the list

gigasquid · January 17, 2019, 2:15pm

Happy to Participate. There are quite a few Clojure MXNet contributors now as well. It would be great for everyone to get involved.

Thank you for getting this initiative started.

cnuernber · January 17, 2019, 3:26pm

@alanmarazzi: Fantastic writeup; I think we both may end up going to fastmath!

I am not sure overlap really matters all that much, aside from documentation. I think, just in general, we should avoid criticizing architectures and stick to talking about features. We have enough tools for basic data science. My first step were this my job would be to evaluate exactly what we do have at the moment.

Reference Datasets & Solutions

What I would like to see is a set of datasets, starting with straight classification and regression. Ideally we can have datasets that have very different base attributes small-N, large feature dimension, very large N (larger than fits in ram), etc. We then solve these and get to great results; maybe we should pick from kaggle where we have examples of the best practices or at least the practices that are effective. I would stay away from computer vision, personally, as this is really time consuming and I do not think it matches the types of problems most clojurists are going to encounter. For that matter, after writing the majority of Cortex, I tend to avoid NN architectures but to each their own.

We then all try our different methods and talk about the results. Ideally we can develop best practices in e.g. visualization during this pathway. It really doesn’t matter who does best but why they do best does matter. And an object evaluation of the end result plus great visualization. I agree that clojurescript should present an advantage here but perhaps also beefing up the jupyter support for clojure would help.

So that is one direction; have a set of datasets and solutions so we can quantitatively compare the toolkits and bring together some of the really interesting pieces we have on the table right now. Note that I am most interested in classification and regression but others may differ (clustering, ranking, anomaly detection, the list is infinite). With this done well I think most clojurists can just cut&paste their way through their own particular exploration.

Missing Feature Dependency Graph

Another (parallel) pathway is, given aggregate total of everything done and accessible in clojure, what is missing? We can then just walk down the dependency graph, figure out good ways to get each thing and just sort of slowly over time fill in the things that are nice. Scikit wasn’t built in a day. Right now I feel like good hyperparameter optimization is a large missing piece and I am very doubtful at this point that anything is better than the best toolkits for python; this includes anglican based systems.

Speaking of which, potentially the best way to get a lot of the missing pieces is just to shell out to python. We can get multithreading and such by just using many shells and while I feel like right now if everyone had a rock they would be throwing it at me it just kind of makes sense. In this way we get great coverage in a format that matches the rest of the world. And getting good at this pathway guarantees us access to some of the cutting edge stuff that we won’t get any other way. If we have to install scala to use mxnet we can sure has hell install the python subsystems to get everything else on the planet. Finally we also help people to want to do data science in clojure not be marooned on a clojure-only island by working towards deeper python ML integration.

Final Thoughts

My feeling is that our community is very very small. Building bridges is going to get us a lot farther than building islands.

daslu · January 17, 2019, 5:20pm

Thanks @otfrom!

@cnuernber is here, and I let @mikera know through twitter. Will you please point me to anyone else relevant? (or invite them?)

alex314159 · January 17, 2019, 6:00pm

Interested as well. Great summary from @alanmarazzi .

When I first started Clojure I was missing Python dataframes a lot. With a bit more experience I’m ok with manipulating data the Clojure way, and it’s really the plotting / display capabilities that I find very limited. You can get results, but you can’t present them easily. Related is the lack of a simple web framework. I’d like to crunch data in Clojure and be able to display tables and charts easily in the browser, with limited web development knowledge. Oz I think is a nice step in that direction.

alanmarazzi · January 17, 2019, 6:30pm

Thanks! I guess that since Tomasz (fastmath creator) is among the starters of this thing he will be happy about all this love for his library! Anyway both of us were looking at tech.smile and tech.xgboost and we were really liking the data serialization and deserialization, so probably the best way to go would be to take the best ideas from many implementations and either come up with something new or merge them into one of the existing libraries.

This is exactly one of the ideas we had: give people more examples and best practices to draw from. I agree on the NN part, but I guess the MXNet people can help in that direction!

I tend to use scikit-learn as an example more for its scope, consistency, docs and tutorials than about code quality (as you said it wasn’t built in a day, and in my opinion it is clear by just looking at the code). My point here being that we might cover most of scikit’s scope with many different libraries but it would be very nice to have integrated examples and a consistent API.

I have mixed feelings about this , but I won’t deny the fact that Python right now is the de-facto lingua franca for machine learning. It might even be a way to cover some “holes” until something else comes out of the community.

I can only agree with you, and I would like to thank everyone for the great answer out of this! One of the main reasons I love Clojure is because of its community!