Hey everyone! I’m the author of clj-boost and one of the people involved in this together with @daslu.
@cnuernber your work is quite impressing, I didn’t know about it when I developed clj-boost. I’ll be very happy to ditch clj-boost in favor of something better for the community, and I’m very happy we will be able to discuss about these things all together!
About me
I’m currently a data scientist/engineer at a large Italian insurance company, but next month I’ll move into management at a new Fintech/Bank. I’ll always be involved in data science and I want reliable, simple and production ready stuff to move at a faster pace.
About Clojure
I discovered Clojure a couple years back and I’m currently moving from doing these things with Python to a full-stack Clojure experience. I think that there is a very high potential for doing data science with Clojure, but there are missing nuts and bolts here and there.
Are we scientists yet?
I really like how the Nim community is dealing with the same sorts of problems we’re facing, so I’ll try the same thing here to foster discussion. We might want to move these things in their own topic in the future or on other platforms, but that’s not the point right now.
The structure of this:
- Name of the problem - data science is a stack of problems and one must have solutions to all of them to really be productive
- Notable examples - what’s considered standard nowadays in other languages
- Status - the current status of the matter
- Forward - what is needed moving forward
Multidimensional arrays, Linear-algebra
Generic computation libraries. Here we should strive for the best: both GPU and CPU capability, multidimensional arrays, broadcasting, etc
Notable examples
Status
There are many libraries popping out at various levels of maturity, some of them are:
Forward
I think we can all agree that this degree of spread is not good, all these libraries represent wasted time and resources that might be spent on moving further other parts of the stack. We should settle on one-two of them and move on.
Plotting
Plotting is important for both analysis and presentation of results. Thanks to Clojurescript we might probably have an edge over other languages here.
Notable examples
Status
Here there are many libraries as well, *some of them are:
Forward
In this area taste is really important so it’s more normal to have more spread over different libraries. What we should do is to work on what is already available and make the plotting experience seamless:
(bar my-data)
;=> nil
The result would be a bar chart with reasonable defaults.
Geospatial library
Deal with coordinates on a map.
Notable examples
Status
Not much that I’m aware of:
Forward
This is another area where Clojure could shine thanks to its concurrency model. The fact it would be easy to deal with Spark or Onyx it’s certainly a plus.
Dataframe or similar
Today’s data scientists are used to work with tabular data, we have to deal with it.
Notable examples
Status
Not good: there are lots of stumps here and there but nothing has ever caught on. Some examples:
Forward
Here I would move on wrapping Arrow which have to potential to become the standard in the recent future, but anything that works is very welcome!
Statistics & probprog
Very important as the base for ML systems and evaluation of models.
Notable examples
Status
There are already many examples:
Forward
What is missing here is the tooling: we need more abstractions over basic functionality. For instance a function to get the ROC-AUC score for model validation.
Also better docs and examples of what is achievable with these libraries.
Machine learning
General modeling, the aim should be to have something simple, usable, reliable and with a consistent interface.
Notable examples
Status
Something is moving lately in this area:
Forward
As stated earlier either we pursue an R model (with many small libraries) or the scikit-learn way (one big framework with batteries included), the important thing should be to have a common interface to algorithms and utilities. This would be the opposite of what happens in the R world.
Deep learning
Important for computer vision, NLP and other problems.
Notable examples
Status
We’re pretty much covered especially thanks to @Carin_Meier’s work, what can be really improved are docs, examples and tutorials.
Forward
Just build on what’s already there
Disclaimer
None of the lists are to be considered complete, they are just some examples. Of course these are my opinions, but everything is amendable by the community and I would really love to get a productive discussion about these topics. If you think something is missing, wrong, misplaced or anything else just let the community know!
Yeah, I know about Incanter, I didn’t mention it on purpose, but if someone thinks that it is current and useful we can surely discuss it