Clojure data science public meeting

daslu · August 2, 2020, 3:47pm

We are organizing a Clojure data science public meeting with lightning talks and chat.

Please fill in your ideas & preferences.

daslu · August 2, 2020, 3:54pm

Any thoughts about this?

Any ideas regarding the content and the format?

didibus · August 2, 2020, 6:02pm

Maybe not the best for a talk/chat format. And I’m coming from someone without much knowledge of data science, who wants to dable in it for fun, and not in any serious professional setting.

So with that, for me one thing I’d like to find are some tutorials that are like: “How to build prediction model for X”. Or “How to build data insight visualization for Y dataset”.

Obviously in Clojure, since I only want to do this for fun, I’m not interested in doing it in other languages.

bartuka · August 3, 2020, 3:05am

My two cents on data science and clojure. I worked in a ML company on the data science and the engineer side of the table and I developed this rationale about the field:

I really don’t see why would be beneficial for someone to do a “prediction model for X in clojure” other than curiosity and ofc “because we can!”. The reason is that I believe there is little to not value added or real life improvements.

What I noticed is that there are some conflated subjects and goals inside the term (data science) which makes every discussion very difficult. I will try to segregate some sub-categories based only on guesswork:

Data exploration
Problem finding
Descriptive analysis e.g. reports, monitoring, etc towards some purpose
Predictive analysis e.g. ‘encoding’ your problem as a regression/classification problem
Feature Engineering e.g. ‘now, what if I create this crazy new combination of informations and see what happens?’
Parameter tuning e.g. I found LSTM is the best model to encode my problem on, how can I avoid these N problems that might occur? try to avoid local minima, overffiting, biases, etc etc…
Deployment

There are lots of tools around descriptive analysis, python and R are great tools for data exploration, problem finding, and feature engineering (I had people from Economy major doing R programming and creating really insightful/relevant features with no prior experience in coding). Now my case is that, what is the real benefit to bring predictive analysis and parameter tuning into Clojure? The implementation of a Neural Network will not be better, C++ is already backing-up all the python major libraries, GPU support all over the place already being leveraged if performance was the issue.

By no means, I am saying you should not invest time on this area. Options are really good and your interests may be your sole motivation to create something really awesome and change the tides. Please do!

However, where I see the real value? At the “runtime” and “engineer” side of the job. What I mean by that? Most of the ML models (specially at small scales e.g. not google/amazon scales) are mainly based on correlations (not causation) and this is interesting per se because now, you should keep track of the assumptions you made during model developing and see if the world is not changing under you in a way that completely invalidate your production model.

To be more specific, if a model is using 3 features, what are the characteristics of these features that directly affect the model performace? How are the distribution of those values? Now, would be nice to test if the distribution of the features changes X% does your model still performes in an acceptable way?

Would be nice to have tooling around your production environment that keep track of that and alarms the team when something goes off the bridge?

I think in this regard would be great to have tools like Michelangelo from Uber available to small shops to put their models into production with confidence and understanding about what is going on. Papers such Hidden Technical Debt in ML systems seems to point towards a lack of “best software development practices” around the ML processes. Would be nice to provide tooling for that?

Other example is Feature Stores, last week a Brazilian startup just announced they had made their internal feature store open source (Butterfree). This is a problem to everyone working on this, and probably several non-optimal solutions exist everywhere.

Other necessity I see, the model results and studies needs to be reproducible! How not think about immutability on that.

I remember watching some talk from Uber where they had models for Country, State, City, Neighboorhood level and they compete with each other at runtime to decide how better serve a request.

Ok, all of that to say that I see most if not all of these situations as great opportunities to Clojure products because they are heavily data driven, immutability is sometimes a must to have, they are concurrently, and they require good performance and not brittle systems.

Data scientists would not directly develop those (maybe it is the job for ML Engineers at the market now idk) but they will definitely come up with nice ideas around and might be sufficient for them to be willing to learn some clojure on the side to contribute.

Sorry if I took your post to a different path. But a meeting/chat to explore and try to de-complect the main pain-points the data science community is facing, can help brave clojure soldiers to start working on possible solutions.

daslu · August 3, 2020, 8:19pm

@didibus @bartuka thanks a lot for the thoughtful comments!

Save the date: August 30th, 5pm UTC.
More details will be announced soon.

daslu · August 3, 2020, 8:34pm

We would love to hear any further suggestions for the lightning talks or the discussion topics.

system · February 2, 2021, 8:34am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.