2021-09 - Plans & Hopes for Clojure Data Science

Since the beginning of 2021, we’ve had a habit of a monthly thread where people could share their hopes for the emerging data science ecosystem. The place for those threads has been the Clojurians Zulip . We decided to move that to Clojureverse. There are many new friends getting involved , and it seems important to have this dialogue in a more visible place.

Through periodical updates, we may help each other catch up and think about the bigger picture, and the way our efforts may tie together. It’s also a good way for each of us to remind ourselves individually of what we have done, and what we would like to do in the near future.

It would be great if you all would consider the following questions and briefly mention your views towards them. Please skip anything that you find irrelevant. Keep in mind, these are only prompts to get you thinking.

  • Are you working on anything related to the Clojure ecosystem for data science / scientific computing / data tooling / data engineering? Let us know about it.
  • Have you been doing anything interesting in the last month?
  • Is there any new realization or change in your hopes and beliefs about the ecosystem’s future?
  • What are you hoping to create/learn/explore in the coming month? … and in the coming 3 months?
  • What developments are you hoping to see in the ecosystem and community in the coming month? … and in the coming 3 months?

Also: if you are interested to see what you or others have written in the past few months here are some links to the previous threads:

Looking forward to hearing about what everyone has been up to and hopes to be up to!

:pray:

6 Likes

I started working on my vision for a feature processing service. The super high-level concept can be seen here https://gist.github.com/jcpsantiago/320e3665a9bd749fc25ede0341c6323c . Such a system would compute features for models on-demand, and also store any computations in a database (as part of a larger “feature store” system) to enable data scientists to train models without having to rewrite code for data transformations multiple times.

For me personally, it’s a necessary step to finally deploy my company’s anti-fraud (XGBoost) model using clojure. At the moment I’m stuck with R because of the recipes package doing all the preprocessing, which is then used in the workflows package during cross-validation.

I’m still surprised nobody has done this (especially the larger companies using Clojure (looking at you Nubank), instead of using single threaded python pipelines/rewriting code in Scala/dumping everything in complicated pieces of software such as Kafka and Spark.

1 Like