Which is to say that your benchmark is only showing you the utilization for a vector containing 1000 references to the same object, which means basically 1000 pointers of all the same size no matter the type of the object to which they’re pointing.
Again, you’re right that the first use of a symbol results in it being interned (cost on my machine: 104B), after which all other references are just indices into a table (cost: 4 bytes), but that doesn’t change the fact that every map has twice as many objects in it than the equivalent vector, and thus costs (* 4 number-of-fields) bytes more to store.
Your use of the hash-map constructor forces Clojure to use clojure.lang.PersistentHashMap but with the literal and with zipmap it uses clojure.lang.PersistentArrayMap which is much more compact and is used for maps of up to 8 keys. So I guess you start to see differences in size if your dataset has more than 8 columns.
I prefer pure clojure hash-map to implement a model similar to Dataframe, which, by its characteristics, is more like a super Execl with spec and s expressions.
Clojure manipulation Hash-map is very convenient, hash index performance is also very strong.
Dataframe is similar to DB, It don’t have to keep the order, Therefore, It don’t need to use vector, just sort when you need it.
Again, my preference is to start to with a vector of maps as the representation for smaller datasets, but if we’re building a library for “data science” it should probably offer an easy way to swap in a more compact table format. I would personally skip over vector-of-vectors directly to something array-based, and probably use a library (Apache Arrow?).
I agree your opinion, I’m a beginner, newbie programmer, newbie Clojurist, I want to have a play on data science. get beginner on board is very important, I realized some thing need so many basic concepts to be understood to understand another concept. And need to setup a workable environment is hard, all so those similar stuff cost me a lot of time. The feedback of learning is so long, it just reduce my passion. Do’t make a thing hard to reject beginners might grown up on this. Honestly.
Yeah, it’s definitely more expensive (and abruptly so!)
About the more compact table format: Sean Corfield mentioned that in the next version of clojure.java.jdbc he does not convert to Clojure data structures at all, but instead he implemented a reducible sequence over the ResultSet instance. That seems like a good way to keep the convenience of working with the regular core functions when wrapping external libs. If they’re all reducible, it won’t matter what the underlying implementation is (and we can probably swap them transparently).
That’s why I have added some samples on my GitHub repo with some visualisation, because I was not able to find some good one online. May be not looked hard enough.
I am Clojure beginner, so they may not be the best one, But at least people visiting will get the idea, that Clojure can be used for DataAnalysis. I have used Incanter.
Also wanted to use @draganrocks 's Neanderthal project for putting some deep learning tutorial. But I use Mac which has AMD processor, so no CUDA support . Waiting for OpenCL support for now.
Going to put more in upcoming months. Need to get a Job first though .
I think your text fills the niche @draganrocks mentioned. It kind of bites that your text is often overlooked though. I found the format, layout, and coverage of topics excellent. The explanations of theory, made manifest in clojure source (for pedagogical value), then directed use of existing solutions made for a great intro and/or refresher. The only weak criticism to date is the lack of contemporary libraries. At this point, a 2nd edition would likely be helpful (porting examples to updated libraries, some of which you’d mentioned), but I’m uncertain of that happening since you’re raising a new child
For me, dataframes / datasets, etc. really provide a convenience layer for tabular operations, and a base layer of optimization to keep loading and transforms relatively efficient - particularly for analysts vs. programmers. I haven’t run into the plotting problem - I typically treat the layout and ordering of visual elements as a separate problem, akin to styling.
@draganrocks hit on an interesting point, that no one’s really argued for the necessity of dataframes per se. Rather, they are a manifestation of the popularity (stemming from R, then to python) of the properties of convenience and performance, manifesting as singular “solutions” in the dominant data science platforms. Since R + Python = data science (according to the conventional wisdom and mass tutorials and courseware), if you don’t have a familiar dataframe thing, you have no familiarity - and thus no credibility. Sequences of maps (“the People’s choice” ) won’t cut the mustard either, particularly when packed representations are needed for storage efficiency and possibly speed. That was one of the early value prospects of Incanter - providing and interface close to base R (without a lot of weirdness) that you could loosely migrate to (excluding R library dependencies). New but not necessarily alien.
I definitely think there’s value in providing access to Datomic / DataScript via a dataframe facade.
The memory use gets nastier if your data includes strings. Memory use can blow up if you’re creating many equal strings (typically by naively reading delimited lines), when you could be sharing references via pooling.
I’m still trying to wrap my head around arrow, even after looking at the (current) java bindings.
I’m just a CS guy, not a data scientist, but I feel Map of Vectors would be my ideal compromise. If I had to use pure Clojure structures for in-memory tabular aggregations.
Just saw no one mentioned map of vectors, and was wondering why not?
Generally, yes. Incanter dataset is actually core.matrix dataset. It also provides a seq-of-maps abstraction via row-maps (which are flyweight row objects that index into the backing column store). My only gripe with the core.matrix implementation is the semantics of the row implementation (currently, perhaps forever…) expects a row to not have associative semantics (e.g. able to project from a field/column name onto the value at the row, via get or function application). It’s not a huge deal breaker, but differed from the semantics the original incanter dataset offered (and other typical seq|vector-of-maps implementations).
I don’t get the negativity on incanter either. I’ve gotten use out of it, although in trying to re-design some stuff (like decoupling the plotting API from jfree chart, dealing with macros and multimethods…) I’ve found some spots that definitely could use a refresh. Still, the bones aren’t bad IMO. 1.9.3 is somewhat of a resuscitation; I was on 1.52 for years due to the incompatibilities and bugs introduced from the core.matrix swap-out, which somehow jumped incanter to 1.9x like overnight. Last year, I started migrating and pushing for unification, to get 1.52 divergent branch merged with 1.92 (along with fixing a bunch of long-standing minor issues). We’re at the point now where 1.93 and beyond are available for simpler contributions, the main landing page / repo is actually the current release, etc.
@didbus
I feel Map of Vectors would be my ideal compromise.
That’s what I’ve used for my table implementation for years. It ended up mapping pretty closely to the eventual core.matrix dataset api (to the point where extending the relevant protocols was pretty straightforward). The primary ambiguity is what’s the row representation end up looking like. In the naive seq|vector-of-maps implementation(s), they’re already maps. With a column store, you have to project the row onto something. I opt for lazy or flyweight associative maps. core.matrix has a flyweight Row type that serves as a sequence-of-values rather than a mapping of column-name->value. Other libraries (like jtablesaw) provide an iterator, with explicit primitive conversions that you’re responsible for calling with the appropriate field name. You’d face a similar issue going with a row-store and providing a column abstraction.
Hello! Chiming in (as a beginner/intermediate) with two things that are really simple with Clojure but weirder with Python:
Dependency handling. In Clojure, I can declare a dependency on a library, and use it. No hassle. No breaking globally installed things. In Python, I’ve resigned, installed Anaconda globally, and try to avoid depending on other things.
Server programming. Spinning up a server that’s listening to a port on a different thread is a breeze. For example with Oz, you don’t view an image. You preview the actual dashboard where you could place your application results.
Thanks for putting this together! I work on Nextjournal, a hosted notebook platform that supports Clojure (including MXNet), Julia, R and Python. I’m passionate about bringing the benefits of functional programming, to a larger audience by applying immutability to data and environments.
We want to enable people to share their work in a way that others can easily play with it. We also support GraalVM on the platform (see my notebook creating the environment) and are excited about its potential benefits.