Online meeting: clojure data science

Hey everyone! I’m the author of clj-boost and one of the people involved in this together with @daslu.

@cnuernber your work is quite impressing, I didn’t know about it when I developed clj-boost. I’ll be very happy to ditch clj-boost in favor of something better for the community, and I’m very happy we will be able to discuss about these things all together!

About me

I’m currently a data scientist/engineer at a large Italian insurance company, but next month I’ll move into management at a new Fintech/Bank. I’ll always be involved in data science and I want reliable, simple and production ready stuff to move at a faster pace.

About Clojure

I discovered Clojure a couple years back and I’m currently moving from doing these things with Python to a full-stack Clojure experience. I think that there is a very high potential for doing data science with Clojure, but there are missing nuts and bolts here and there.

Are we scientists yet?

I really like how the Nim community is dealing with the same sorts of problems we’re facing, so I’ll try the same thing here to foster discussion. We might want to move these things in their own topic in the future or on other platforms, but that’s not the point right now.

The structure of this:

  • Name of the problem - data science is a stack of problems and one must have solutions to all of them to really be productive
  • Notable examples - what’s considered standard nowadays in other languages
  • Status - the current status of the matter
  • Forward - what is needed moving forward

Multidimensional arrays, Linear-algebra

Generic computation libraries. Here we should strive for the best: both GPU and CPU capability, multidimensional arrays, broadcasting, etc

Notable examples

Status

There are many libraries popping out at various levels of maturity, some of them are:

Forward

I think we can all agree that this degree of spread is not good, all these libraries represent wasted time and resources that might be spent on moving further other parts of the stack. We should settle on one-two of them and move on.

Plotting

Plotting is important for both analysis and presentation of results. Thanks to Clojurescript we might probably have an edge over other languages here.

Notable examples

Status

Here there are many libraries as well, *some of them are:

Forward

In this area taste is really important so it’s more normal to have more spread over different libraries. What we should do is to work on what is already available and make the plotting experience seamless:

(bar my-data)
;=> nil

The result would be a bar chart with reasonable defaults.

Geospatial library

Deal with coordinates on a map.

Notable examples

Status

Not much that I’m aware of:

Forward

This is another area where Clojure could shine thanks to its concurrency model. The fact it would be easy to deal with Spark or Onyx it’s certainly a plus.

Dataframe or similar

Today’s data scientists are used to work with tabular data, we have to deal with it.

Notable examples

Status

Not good: there are lots of stumps here and there but nothing has ever caught on. Some examples:

Forward

Here I would move on wrapping Arrow which have to potential to become the standard in the recent future, but anything that works is very welcome!

Statistics & probprog

Very important as the base for ML systems and evaluation of models.

Notable examples

Status

There are already many examples:

Forward

What is missing here is the tooling: we need more abstractions over basic functionality. For instance a function to get the ROC-AUC score for model validation.

Also better docs and examples of what is achievable with these libraries.

Machine learning

General modeling, the aim should be to have something simple, usable, reliable and with a consistent interface.

Notable examples

Status

Something is moving lately in this area:

Forward

As stated earlier either we pursue an R model (with many small libraries) or the scikit-learn way (one big framework with batteries included), the important thing should be to have a common interface to algorithms and utilities. This would be the opposite of what happens in the R world.

Deep learning

Important for computer vision, NLP and other problems.

Notable examples

Status

We’re pretty much covered especially thanks to @Carin_Meier’s work, what can be really improved are docs, examples and tutorials.

Forward

Just build on what’s already there

Disclaimer

None of the lists are to be considered complete, they are just some examples. Of course these are my opinions, but everything is amendable by the community and I would really love to get a productive discussion about these topics. If you think something is missing, wrong, misplaced or anything else just let the community know!
Yeah, I know about Incanter, I didn’t mention it on purpose, but if someone thinks that it is current and useful we can surely discuss it :smile:

9 Likes