Online meeting: clojure data science

stathissideris · January 24, 2019, 9:34am

My feeling is that a dataframe library hasn’t emerged because a vector of maps+clojure.core is a good-enough “dataframe” for most uses. What are we missing in comparison to pandas?

alanmarazzi · January 24, 2019, 11:36am

I completely agree with you, but I mean what I said: dataframes are used for ETL, analytics, machine learning and even by some for regular backend development. If you take away such a fundamental abstraction to people doing these things maybe 70%-80% of them will go back to what they already know.

In data science the “beginner curse” is even worse than with regular software development: in the next couple of years nearly half of the data scientist/engineer in the workforce (or close to join it) as such will have 2 years or less of working experience.

This means that if we don’t get beginners on board we’re simply done. IMHO the real issue to solve is: how do we get beginners on board? And one of the possible answers is give well documented and simple dataframes to them.

About pandas, there’s a lot of tooling built around it, and doing something like the code below in clojure.core + vector of maps is not so trivial and would likely have to be split in more parts. Moreover docs are somewhat lacking under this front, and let’s not talk about examples, tutorials and such.

(df.dropna(subset=['dep_time', 'unique_carrier'])
   .loc[df['unique_carrier']
       .isin(df['unique_carrier'].value_counts().index[:5])]
   .set_index('dep_time')
   .groupby(['unique_carrier', pd.TimeGrouper("H")])
   .fl_num.count()
   .unstack(0)
   .fillna(0)
   .rolling(24)
   .sum()
   .rename_axis("Flights per Day", axis=1)
   .plot()
)

stathissideris · January 24, 2019, 12:19pm

I’m not familiar with pandas, so I can’t read your example fully, but I appreciate your point. My initial reaction would be to see if I can provide functions that achieve the equivalent but using a collection of maps as input. I would then try to see if/how it would be possible to make this beginner-friendly, maybe by introducing some abstraction on top. Maybe the new mechanism of protocol implementation via metadata (Clojure 1.10) would allow us to have our cake and eat it (have plain data but with some extra convenience for beginners/everyone).

draganrocks · January 24, 2019, 12:40pm

In my experience, beginners are not looking for data frames (at least not actively). They are looking for easy way to solve problems, and in most cases they don’t even know exactly what problems they have. Simply, they would like to learn to somehow process data and get interestingly looking results. They will opt for whatever you teach them to do, if there is a way to convince them to learn from you, and if you deliver that teaching.

R DataFrames look nice for the simplest task, but quickly get messy and inconsistent. Is the example you’ve posted easy, even for non-beginners? I don’t think so. However, the beginner will have to deal with it, because they will be looking for “learn data science” resources, and everything they find will be countless R or Python resources. In that case, Clojure can offer whatever awesome stack, and no-one will look at it regardless of it having data frames or not.

I want to stress the importance of learning resources, especially detailed high-quality ones that teach concepts versus showing the technology. Now it might be too late, or not, but there need to be a book called “Beginner’s Guide To Data Science” or something like that, and that book needs to teach all common tasks that the prospect reader will have, such as loading data, cleaning data, calling a selection of common algorithms, displaying results, discussing results. It doesn’t matter whether the code is similar to R and Python, but it absolutely matters that whatever we’re showing stays focused and deliver results. We don’t need to show 10 ways to skin the deer, but we absolutely have to show exactly 1 way to do that common task that we are teaching.

The common example of what I’m talking about is Clojure web development. A huge, huge, disadvantage of Clojure in that area (that is, otherwise, well covered in Clojure ecosystem) is that the prospective Clojurist is required to make too many choices too early. There is no resource to learn the concepts using one way. You have to patch 100 brief resources up by yourself. At each step, you’re presented with 10 choices, and you have to pick one. That means that you already have to be an experienced web developer to be able to find your way here.

The same is with data science/data analysis/machine learning etc. You have to already know R way or the Python way, then there is 0.001% you come to Clojure because you like it specifically, then you start searching how to do task X, and of course that you’ll look for the tools that look exactly like in R. By then, your Clojure experience will be the same as R at best, or totally spartan in most cases. To attract even a small percentage of R or Python community with that approach, you’d have to mach their offering, bugs and bad solutions and everything…

On the topic of data frames specifically, is there a way to offer an easier and more powerful way to accomplish the same set of tasks, but with something more elegant, such as DataScript and similar approaches? (I don’t know, I’m asking)

And, I think we also miss a huge number of people that are not looking to become data scientists but are programmers that are interested in working with data. There are millions of Java programmers, or .Net programmers, who would consider the JVM experience over RStudio or Python notebooks, only if:

There is a straightforward code that they could call (they don’t know R, and don’t know what a data frame is yet)
There is a good detailed book (or a video course or whatever people use these days) that will teach them data analysis etc. without digressing into what a transducer is or whatever Clojure implementation details are.

alanmarazzi · January 24, 2019, 1:34pm

@draganrocks @stathissideris

I agree with @draganrocks, I made a mistake by talking specifically about “dataframe” when I should really have said “dataset abstraction” (one could even try to build an abstraction that makes possible working with both flat tabular data and recursive data structures with the same set of functions).

As I said if we’ll end up building something, or just collecting and gluing together what’s already there, the next step is exactly to start writing tons of examples, tutorials and so on.

Clojure might well keep being something one grows into, but for most of these people Clojure’s syntax will make them feel as they are starting from scratch (direct personal experience, I saw it as well when teaching the basics to coworkers).

mars0i · January 24, 2019, 2:28pm

I have not read the books below. I don’t know what they cover or how well they do it, but I remembered seeing them, and thought they would be worth mentioning in this thread.

https://www.amazon.com/Clojure-Data-Analysis-Cookbook-Rochester-ebook/dp/B00BECVV9C/ref=sr_1_5?ie=UTF8&qid=1548339186&sr=8-5&keywords=clojure+data+science
https://www.amazon.com/Clojure-Machine-Learning-Akhil-Wali-ebook/dp/B00JXLF78M/ref=sr_1_3?ie=UTF8&qid=1548339186&sr=8-3&keywords=clojure+data+science

alanmarazzi · January 24, 2019, 2:39pm

All of them are somewhat dated, the first one is a classic cookbook, the second one uses clatrix which is abandonware (at least as far as I know) and the 3rd one is based mostly on Incanter and Spark (@henrygarner is in this thread, by the way).

All 3 came out before transducers, core.async and spec if I’m not mistaken.

draganrocks · January 24, 2019, 3:08pm

I don’t think the main problem is that they’re dated, but that by this time many readers consider self publishing a better brand than Packt. Of course, maybe some of these are good (I might give the benefit of the doubt to Clojure for Data Science) but overall the competition in the space is fierce, so just average is not good enough.

And, judging by skimming these books, I still think they wrangle with Clojure specifics much more than teaching concepts. The reader is still left with the job of figuring out how to apply these concepts even to something like introductory Kaggle examples. I myself don’t have a problem with this, I am talking specifically about the beginners (since that was the topic).

daslu · January 24, 2019, 3:15pm

The need for tutorials is so important!

I hope to share some experience about trying to teach people, and what can go wrong.

If we try to have a workplan for a year, then maybe it would be wise to wait a little bit with the effort of details book and tutorials.
Hopefully, after 3-4 months we will have a better notion of agreed practices and APIs for typical problems, try to expose them to newcomers and learn a little bit from their reaction. .

By the way, congrats, @draganrocks!

draganrocks · January 24, 2019, 3:28pm

Thanks, @daslu

alanmarazzi · January 24, 2019, 3:33pm

Ok, so one idea could be to get up a repo on Github or Gitlab where we could start translating Kaggle kernels in Clojure.

We might choose 10-15 of them, then people would take one or more, translate it and then share it on the repo. Afterwards all the others would have to review the implementations and try to converge on common libraries/solutions/APIs.

What do you think about this?

daslu · January 24, 2019, 3:36pm

That would be a great way to direct ourselves towards writing good solutions.

henrygarner · January 24, 2019, 3:39pm

Thanks @mars0i for the Clojure for Data Science book shout-out!

As @alanmarazzi says, it’s now unfortunately somewhat outdated as far as the technology is concerned.
But this was anticipated, and as a result I tried to focus primarily on teaching fundamental concepts first of all, and library support for non-trivial implementations very much second. I hope the book still has a place for anyone who seeks to learn the absolute basics (which is to say statistical inference, regression, and various canonical machine learning models for classification, recommendation, time series and graph analysis). The intended audience were intermediate to advanced Clojurians so I tried not to get too bogged down in the language itself, but inevitably some examples ended up more unwieldy than I would have liked, and required more technical explanation than I wanted. I understand anyone who makes this criticism. (And yes, I think Packt’s brand is actively harmful).

The most glaring library omission I recall encountering whilst writing (which was hard to gloss over or write myself) was not so much to do with dataframes, but visualisation. I mostly used Incanter since I became aware of thi.ng too late to include it in the book (and in any case what I really wanted was a high-level API closer to what Vega (Light) provides). Exploratory visualisation is still the primary reason that keeps me returning to R.

Where I miss dataframe-type behaviours in Clojure, it’s mainly also to do with visualisation. For example, a dataframe with semantically-ordered factors for categorical data can retain that same ordering when plotted in a variety of different ways. I recently started on a Clojure library to plug this gap (with Vega), but I currently have a 4-month old daughter and so it’s still very much in the REPL / hammock stage.

dimovich · January 24, 2019, 4:38pm

Yes!! That’s why thi.ng/geom is so great. It is self contained, with no other JS dependencies… And it’s written in cljc. And it’s very fast.

didibus · January 24, 2019, 4:42pm

Well, from my limited knowledge of data.frames. I believe it’s simply an in-memory table format, similar to columnar stores or row based stores. Not sure if there are data.frames of both orientation.

So effectively, you can think of it as a vector of vectors of equal length. With some meta about each index into the vectors as well as the overall vector.

This would have the advantage of space, since you don’t need to repeat the keys over and over.

But also, as an abstraction, it can allow more performance, by combining the structure with indexes for example.

Also, if going with say Apache Arrow, I think they can even memory map the columns and that can greatly optimize aggregation performance even in-memory.

In the end though, I feel it is logical to want a highly efficient in-memory tabular data structure. And maybe even provide a full query and transformation DSL over it. Even though you can easilly compose Clojure primitives to mimic one, like a map of vector, or vector of map, and use composable Clojure functions as your query and transformation DSL, I’m not sure you can get the same level of optimisation and standardisation that a full on in-memory tabular datastructure could give.

stathissideris · January 24, 2019, 4:46pm

Thanks for explaining, and good point about everything else, but just one clarification: A collection of maps with keyword keys in Clojure does not take extra space because of the “repetition” of the keys, because keywords are interned. This means that in:

[{:a 10} {:a 20}]

:a refers to the same object in memory, and takes no extra space whether it appears once or one million times.

DjebbZ · January 24, 2019, 5:51pm

Hello every one. I’m not at all into data-science/machine learning (not even interested), but the proposal of @alanmarazzi is interesting. Reminds me of the “Working groups” of the Rust community, each focused on a specific problem of the Rust ecosystem (whether it’s language, documentation, on boarding, Rust for a specific domain). Here it’s clearly about Clojure for a specific domain, so starting to work on something like this sounds like a nice idea. It would be awesome if the Clojure community could organize like this.

Keep up the discussion, it’s very interesting.

jackrusher · January 24, 2019, 6:24pm

While it’s true that the overhead of a symbol is nothing like the overhead of a string of the same length as the symbol’s name, it’s incorrect to say there’s no difference in overhead between different in-memory formats. Here’s the memory use for a single two element row encoded as a map, a vector, and an array:

(mm/measure [{:a 1 :b 1}])         => "544 B"
(mm/measure [[1 1]])               => "336 B"
(mm/measure (to-array-2d [[1 1]])) => "72 B"

This is obviously not a big deal for small data tables – and I default to vectors of maps myself – but if one is dealing with a very large collection it can add up.

daslu · January 24, 2019, 6:34pm

Another option to represent rows would be Records.

stathissideris · January 24, 2019, 6:55pm

I think you missed my point! Of course a single “row” will be more expensive as a map with keyword keys, the real benefit shows with multiple rows:

» clj -J-Djdk.attach.allowAttachSelf -Sdeps '{:deps {com.clojure-goes-fast/clj-memory-meter {:mvn/version "0.1.2"}}}' -O:Djdk.attach.allowAttachSelf

Clojure 1.10.0
user=> (require '[clj-memory-meter.core :as mm])
nil
user=> (mm/measure (vec (repeat 1000 {:a 1 :b 1})))
"5.6 KB"
user=> (mm/measure (vec (repeat 1000 [1 1])))
"5.6 KB"