Online meeting: clojure data science

jackrusher · January 24, 2019, 7:47pm

O, I understood your point… and I absolutely agree that the effect is more clear with more rows–especially if you use something closer to real data :

(mm/measure (vec (repeatedly 10000 #(vector (Math/random) (Math/random))))) 
=> "1.1 MB"
(mm/measure (vec (repeatedly 10000 #(hash-map :a (Math/random) :b (Math/random))))) 
=> "1.7 MB"

In case this benchmark is freaking you out, consider:

(map #(System/identityHashCode %)
     (take 5 (vec (repeat 1000 {:a 1 :b 1}))))
=> (234737343 234737343 234737343 234737343 234737343)
;; all the same! 😱

(map #(System/identityHashCode %)
     (take 5 (vec (repeatedly 10000 #(hash-map :a (Math/random) :b (Math/random))))))
=> (408457411 1767048025 1520572923 748851222 1889588169)
;; all different 😎

Which is to say that your benchmark is only showing you the utilization for a vector containing 1000 references to the same object, which means basically 1000 pointers of all the same size no matter the type of the object to which they’re pointing.

Again, you’re right that the first use of a symbol results in it being interned (cost on my machine: 104B), after which all other references are just indices into a table (cost: 4 bytes), but that doesn’t change the fact that every map has twice as many objects in it than the equivalent vector, and thus costs (* 4 number-of-fields) bytes more to store.

stathissideris · January 24, 2019, 8:14pm

OK, now I’m seriously confused, your benchmark works the same on my machine, but:

(def data (vec (repeatedly 10000 #(vector (Math/random) (Math/random)))))
(mm/measure data)
=> "1.1 MB"
(def data2 (mapv (partial zipmap [:a :b]) data))
(mm/measure data2)
=> "1.1 MB"

OK, did a bit more testing, sanity restored:

(mm/measure (vec (repeatedly 10000 (fn [] {:a (Math/random) :b (Math/random)}))))
=> "1.1 MB"

Your use of the hash-map constructor forces Clojure to use clojure.lang.PersistentHashMap but with the literal and with zipmap it uses clojure.lang.PersistentArrayMap which is much more compact and is used for maps of up to 8 keys. So I guess you start to see differences in size if your dataset has more than 8 columns.

linpengcheng · January 25, 2019, 1:30am

I prefer pure clojure hash-map to implement a model similar to Dataframe, which, by its characteristics, is more like a super Execl with spec and s expressions.

Clojure manipulation Hash-map is very convenient, hash index performance is also very strong.

Dataframe is similar to DB, It don’t have to keep the order, Therefore, It don’t need to use vector, just sort when you need it.

jackrusher · January 25, 2019, 6:29am

Note also that the ramp up at eight columns is not, shall we say, “smooth”:

(mm/measure (vec (repeatedly 10000 (fn [] {:a 1 :b 2 :c 3 :d 4 :e 5 :f 6 :g 7 :h (Math/random)}))))
=> "1.3 MB"
(mm/measure (vec (repeatedly 10000 (fn [] {:a 1 :b 2 :c 3 :d 4 :e 5 :f 6 :g 7 :h 8 :i (Math/random)}))))
=> "3.9 MB"

Again, my preference is to start to with a vector of maps as the representation for smaller datasets, but if we’re building a library for “data science” it should probably offer an easy way to swap in a more compact table format. I would personally skip over vector-of-vectors directly to something array-based, and probably use a library (Apache Arrow?).

stardiviner · January 25, 2019, 8:50am

I agree your opinion, I’m a beginner, newbie programmer, newbie Clojurist, I want to have a play on data science. get beginner on board is very important, I realized some thing need so many basic concepts to be understood to understand another concept. And need to setup a workable environment is hard, all so those similar stuff cost me a lot of time. The feedback of learning is so long, it just reduce my passion. Do’t make a thing hard to reject beginners might grown up on this. Honestly.

stathissideris · January 25, 2019, 8:56am

Yeah, it’s definitely more expensive (and abruptly so!)

About the more compact table format: Sean Corfield mentioned that in the next version of clojure.java.jdbc he does not convert to Clojure data structures at all, but instead he implemented a reducible sequence over the ResultSet instance. That seems like a good way to keep the convenience of working with the regular core functions when wrapping external libs. If they’re all reducible, it won’t matter what the underlying implementation is (and we can probably swap them transparently).

phoenixai · January 25, 2019, 10:17am

Yep. Tutorials are missing.

That’s why I have added some samples on my GitHub repo with some visualisation, because I was not able to find some good one online. May be not looked hard enough.

I am Clojure beginner, so they may not be the best one, But at least people visiting will get the idea, that Clojure can be used for DataAnalysis. I have used Incanter.

Also wanted to use @draganrocks 's Neanderthal project for putting some deep learning tutorial. But I use Mac which has AMD processor, so no CUDA support . Waiting for OpenCL support for now.

Going to put more in upcoming months. Need to get a Job first though .

By the way I Clojure.

draganrocks · January 25, 2019, 11:08am

Neanderthal had been supporting OpenCL even before it supported CUDA, and it still supports all: MKL, CUDA and OpenCL with both AMD and Nvidia GPUs.

phoenixai · January 25, 2019, 11:18am

Oh. May be I misread something then.

I am going to use it then, and see what I can do with it.

With 16GB RAM on my MAC and AMD Radeon R9 M370X 2048 MB, Intel Iris Pro 1536 MB - I am sure I can come up with some Deep Learning tutorials.

@draganrocks - Great work on the project .

joinr · January 25, 2019, 11:33am

I think your text fills the niche @draganrocks mentioned. It kind of bites that your text is often overlooked though. I found the format, layout, and coverage of topics excellent. The explanations of theory, made manifest in clojure source (for pedagogical value), then directed use of existing solutions made for a great intro and/or refresher. The only weak criticism to date is the lack of contemporary libraries. At this point, a 2nd edition would likely be helpful (porting examples to updated libraries, some of which you’d mentioned), but I’m uncertain of that happening since you’re raising a new child

For me, dataframes / datasets, etc. really provide a convenience layer for tabular operations, and a base layer of optimization to keep loading and transforms relatively efficient - particularly for analysts vs. programmers. I haven’t run into the plotting problem - I typically treat the layout and ordering of visual elements as a separate problem, akin to styling.

@draganrocks hit on an interesting point, that no one’s really argued for the necessity of dataframes per se. Rather, they are a manifestation of the popularity (stemming from R, then to python) of the properties of convenience and performance, manifesting as singular “solutions” in the dominant data science platforms. Since R + Python = data science (according to the conventional wisdom and mass tutorials and courseware), if you don’t have a familiar dataframe thing, you have no familiarity - and thus no credibility. Sequences of maps (“the People’s choice” ) won’t cut the mustard either, particularly when packed representations are needed for storage efficiency and possibly speed. That was one of the early value prospects of Incanter - providing and interface close to base R (without a lot of weirdness) that you could loosely migrate to (excluding R library dependencies). New but not necessarily alien.

I definitely think there’s value in providing access to Datomic / DataScript via a dataframe facade.

joinr · January 25, 2019, 11:40am

The memory use gets nastier if your data includes strings. Memory use can blow up if you’re creating many equal strings (typically by naively reading delimited lines), when you could be sharing references via pooling.

I’m still trying to wrap my head around arrow, even after looking at the (current) java bindings.

didibus · January 26, 2019, 4:24am

I’m just a CS guy, not a data scientist, but I feel Map of Vectors would be my ideal compromise. If I had to use pure Clojure structures for in-memory tabular aggregations.

Just saw no one mentioned map of vectors, and was wondering why not?

{:col1 [r1 r2 r3]
 :col2 [r1 r2 r3]}

phoenixai · January 26, 2019, 6:17am

Incanter dataset uses map of vectors. I have used it for data analysis. May be we revive the project and start “Incanter 2.0”.

joinr · January 26, 2019, 10:46am

Generally, yes. Incanter dataset is actually core.matrix dataset. It also provides a seq-of-maps abstraction via row-maps (which are flyweight row objects that index into the backing column store). My only gripe with the core.matrix implementation is the semantics of the row implementation (currently, perhaps forever…) expects a row to not have associative semantics (e.g. able to project from a field/column name onto the value at the row, via get or function application). It’s not a huge deal breaker, but differed from the semantics the original incanter dataset offered (and other typical seq|vector-of-maps implementations).

I don’t get the negativity on incanter either. I’ve gotten use out of it, although in trying to re-design some stuff (like decoupling the plotting API from jfree chart, dealing with macros and multimethods…) I’ve found some spots that definitely could use a refresh. Still, the bones aren’t bad IMO. 1.9.3 is somewhat of a resuscitation; I was on 1.52 for years due to the incompatibilities and bugs introduced from the core.matrix swap-out, which somehow jumped incanter to 1.9x like overnight. Last year, I started migrating and pushing for unification, to get 1.52 divergent branch merged with 1.92 (along with fixing a bunch of long-standing minor issues). We’re at the point now where 1.93 and beyond are available for simpler contributions, the main landing page / repo is actually the current release, etc.

@didbus

I feel Map of Vectors would be my ideal compromise.

That’s what I’ve used for my table implementation for years. It ended up mapping pretty closely to the eventual core.matrix dataset api (to the point where extending the relevant protocols was pretty straightforward). The primary ambiguity is what’s the row representation end up looking like. In the naive seq|vector-of-maps implementation(s), they’re already maps. With a column store, you have to project the row onto something. I opt for lazy or flyweight associative maps. core.matrix has a flyweight Row type that serves as a sequence-of-values rather than a mapping of column-name->value. Other libraries (like jtablesaw) provide an iterator, with explicit primitive conversions that you’re responsible for calling with the appropriate field name. You’d face a similar issue going with a row-store and providing a column abstraction.

linpengcheng · January 26, 2019, 11:11am

@joinr

I use “hash-map * hash-map” to map (metaphor) semantically on demand as a database structure of NoSQL or RMDB.

@didibus : your map of vectors is a useful view.

Finally, if you’re having trouble modeling your data, you can refer to PostgreSQL how to do it.

teodorlu · January 26, 2019, 12:39pm

Hello! Chiming in (as a beginner/intermediate) with two things that are really simple with Clojure but weirder with Python:

Dependency handling. In Clojure, I can declare a dependency on a library, and use it. No hassle. No breaking globally installed things. In Python, I’ve resigned, installed Anaconda globally, and try to avoid depending on other things.
Server programming. Spinning up a server that’s listening to a port on a different thread is a breeze. For example with Oz, you don’t view an image. You preview the actual dashboard where you could place your application results.

mkvlr · January 28, 2019, 10:04am

Thanks for putting this together! I work on Nextjournal, a hosted notebook platform that supports Clojure (including MXNet), Julia, R and Python. I’m passionate about bringing the benefits of functional programming, to a larger audience by applying immutability to data and environments.

We want to enable people to share their work in a way that others can easily play with it. We also support GraalVM on the platform (see my notebook creating the environment) and are excited about its potential benefits.

zcaudate · January 28, 2019, 11:10am

wow this is great. does it run kaggle kernels?

alexandrkozyrev · February 25, 2019, 8:22am

I had/have great dev experience with the following stack:

Vega-lite/Reagent
Datomic
Hyperfiddle

http://alexandrkozyrev.hyperfiddle.com/:reagent-dashboard/
http://alexandrkozyrev.hyperfiddle.com/:fm!dactylography/

Also, JS community offer amazing intro-level AI/nlp/DS libs. Just google it.

daslu · February 25, 2019, 9:21am

@alexandrkozyrev, beautiful, thanks for sharing!

Hyperfiddle is fascinating.