Learning data science with Clojure?

data-science

#1

After seeing the big thread about data-science, I got kinda interested to learn some of it for fun. I have a CS background, distributed systems, backend and embedded. Though I work with what I call pseudo NLP, which includes text parsing and cleaning, simple labeling, as well as pseudo AI, such as fuzzy string similarity, TF/IDF, fuzzy search, random forests and simple decision tree models, some simple regressions, use of forward chaining rules a la expert systems, some logic backward chaining rules as well here and there, etc.

When I learn, I like to start with real problems and exercises and work backwards, a bit in the style of the Feynman Technique.

But, I havn’t been able to find exercises and realistic problem exercises (think real business problems you’d get on the job as a data scientist) for me to go through and learn data science.

Anyone can list out or point to resources that have proper exercises and realistic problem statements which I can go through to learn me some data-science ?

P.S.: I’ve tried Kaggle, and while it has data sets to mess with, it doesn’t seem to have exercises. Like it’s very much just, here’s data you can mess around with.


#2

Kaggle has some interesting real life scenarios as part of their free learning program:
https://www.kaggle.com/learn/overview
You can also view their copetitions, tackle some by yourself and review the winning solutions.

fast.ai have both best practices and a great tech blog about AI/ML.


#3

Bump!

Can no one give me a single practice problem example?


#4

Hi @didibus!

I am not so much informed about learning materials, but I think the suggestions of @silicakes are great.

Here are some more suggestions, from my very biased and partial point of view.

Reading into popular solutions of Kaggle questions like these, breaking them into steps, understanding the rationale of each step, re-implementing it in clojure, playing with some variations, reflecting, then going to the next step – that can be a great learning process imho.

Some of us are going through this process these days, trying to figure out some “best practices” on the clojure side, that can hopefully later become of some useful teaching materials. If you wish to join us with that, that would be of great help! :slight_smile:

For computational probability, you may like this book by Goodman and Tenenbaum, that does have exercises (should be fun to solve in Clojure with Anglican).

For Bayesian methods, some people recommend this book by Cam Davidson-Pilon, that does have exercises (at least some of them can be handled conveniently in Clojure).

For theoretical exercises about machine learning you may like this book by Hastie, Tibshirani and Friedman (though surely there are more recent ones).

If you run into any specific problem, let us discuss it!
Some of us are more present at the data-science stream at the clojurians zulip nowadays.


#5

Hum, I’ll try out that Kaggle exercise.

I feel I might be confused about what exactly is data science. As it seems to only focus on predictions and forecasting.

If that’s the case, Kaggle makes sense as a source for exercises.

It also seem that it has quite a lot in common with creative art disciplines.

I’ve been thinking, what would it mean to apply data science to fruits? And try that as an exercise for learning.

So, I’m thinking:

  1. Find data about fruits. Things like list of all fruits, their color, size, origin, price, seasons, etc.

  2. Clean and normalize the data, so things are related and format are consistent.

  3. Start to put together interesting charts and graphs from the data. Here I might try to find patterns and revealing details about fruits that are interesting and maybe none obvious.

  4. Try to apply automated methods to try and find even more interesting patterns and correlations which I failed to see. Not sure what techniques exists for this? I’m thinking unsupervised ML models, are there non ML techniques as well?

  5. Now come up with some fun predictive challenges, maybe inspired by the correlations i found. Say, can I predict a fruit’s color? Or size? Or it’s price? I’m thinking supervised ML models here, but again, is there any non ML technique for this as well? Appart from good old rules based AI?

What do people think of this? Does it seem like a typical data science problem?


#6

@didibus
I think that makes sense. I work with a math/stats guy that I consider to be a data scientist (PhD hydrodynamics). He would just say that Python is a neat tool for doing math.

When he works, he uses visualization for his primary tool for understanding. He’s not limiting himself to “repl-debuggable” functions, but working on larger data sets.


#7

@didibus this looks like a wonderful plan - it would be great if you could share and discuss your journey through it.

As you hinted, data science is not just about prediction. It is about interpretation, discovery, reflection, storytelling and many other aspects. However, having an external criterion is usually something that may provide a sense of direction, as well as a way do figure out what is important. Even if artificial, this may come up handy in the process (for example, to distinguish actually valid discoveries from mere anecdotes).


#8

This is realistic, unfortunately I don’t expect many business problems to come well structured, for the most part either you have a slightly defined issue and then you have to understand if you can solve it with the data you have, or the exact opposite: you’re given a dataset and you can play with it.

I don’t think that Kaggle reflects the real world, because usually you don’t get all your data in a single and clean file. Most likely you’ll start from some data you have available, merge it with another dataset and then enrich everything with other data you might find on the web.

Moreover, after training a decent model don’t stop there! You want to serve that model right? So you have to make all the data transformations you did into an ETL pipeline, get the resulting data somewhere, think about a strategy to regularly update the data by applying the same ETL pipeline and finally serve predictions in some way (usually via a REST API).

For bonus points you might want to deliver an online learning model: if after serving the prediction you can get a feedback on the quality of the prediction, you can make the model predict than learn incrementally after every new observation.

By the way, this IS data science :smile: you might be surprised of what people call AI/Machine learning these days (most of the stuff you find in ready-made products is either K-means and/or KNN)


#9

@didibus -

If you can build a thing that makes solid predictions then you can test your predictions by buying futures and then selling them…

I am kidding a little bit. But interesting things come from being able to say something concrete about the future. If you take the time to find data sources on fruits and fruit markets then you may be able to say things like:

  1. This crop will be about X tons this year.
  2. The price of this item will be about X at market Y.
  3. The lack of input X to the system will have Y effect.

Just some thoughts of interesting things to talk about fruit. Basically, take a couple months and become an expert on the fruit market from farming to selling to consumers. During this time do your data exploration and potentially try to answer your own questions. At the end of this period try to make concrete predictions about the fruit market and see if they work or not.

This would be an absolutely realistic way to both become an expert in the fruit market and learn a solid amount of data science. Then the ML is the cherry on top. And you in a position to either invest or to at least have very interesting conversations with lots of people from fruit vendors on the street to farmers to people playing in the fruit market.