Online meeting: clojure data science

On the subject of bridges and better python support, we should note that the javacpp-presets package has a lot of bridges.

  • mxnet alternative
  • tensorflow
  • cpython
  • tensorrt
  • mkl-dnn
  • cuda, cudnn

Note that saudet is also a regular contributor to nd4j.

As an aside, I wasn’t able to link to that nor was I able to link to our library showing how to use a javacpp library and our post explaining details around this area. So I apologize, at least check out the javacpp-presets github project and you will see. I listed maybe 10% of the libraries.

1 Like

Also, for a nice background with lispy examples, please see the free book “An Introduction to Probabilistic Programming” https://arxiv.org/pdf/1809.10756.pdf written by Frank Wood and team.

2 Likes

If anyone’s interested in discussing agent-based/individual-based modeling with Clojure, it probably doesn’t belong in a data science thread, but please contact me for a side conversation. I’ve been involved with agent-based modeling for a while, and since Clojure is one of my favorite languages, I like doing ABM work in it when that makes sense.

I’m in Operations Research, primarily focusing on discrete event simulation and various forms of optimization (often times via mathematical programming). Simulating complex processes and plans is the bulk of my work though. Distributed simulation is a current topic of interest. Basic stats and analysis also pops up regularly.

[general data science topics]

Over the last ~7 years or so, I’ve written gobs of stuff for internal use, although large chunks are publicly available in a monolith. My primary intersection with “data science” has been via Incanter, of which I maintain a fork, trying to revamp the extant design (case in point, plotting functions) and porting internally developed fixes and extensions. There seems to be little love for Incanter on this thread so far :slight_smile:

Regarding dataframe-ish stuff, I built a typed columnar table years ago, with a little SQL-like edsl ported from Practical Common Lisp. After core.matrix came out with its dataset implementation, I extended the relevant protocols to spork.util.table to enable use in incanter. I haven’t really needed or missed any dataframe stuff as a consequence. The table implementation provides typed schemas for all the fields, as well as efficient mutable construction, string canonicalization (this is pretty huge in practice for me), and some other goodies.

I’ve only recently began pushing to apply ML front; definitely interested in opinions of practitioners here.

I’ve messed with vega (via Oz and a fork I wrote to use javafx webview as the canvas). I think the grammar of graphics approach is powerful, but also a bit static…You can encode all sorts of higher minded things, which are then compiled into something Vega understands to create slick graphics, but lack of IMO low-level access to the actual plotting and rendering (via the resulting scene) is a bit of a downside to me. I’ve been looking at a solution that affords the data-driven specification of vega, but allows better control of the resultant product (via something like Processing/Quil or another renderer). In my ideal world, I should be allowed to mess with the plot however I want; Vega/ggplot would provide a nice porcelain layer to get up and running fast.

3 Likes

Beautiful.

So many wonderful surprises here, that I guess many of use were not aware of.

We will soon create a separate clojureverse topic to discuss meta questions such as the meeting format, timing, better platforms for discussion, etc.

I think this would be really good. Clojure would be an ideal language for data science. I have a role between data scientist and analyst and I struggle with the following points in Clojure.

Even though we prefer sequences of maps, we desperately need a dead simple replacement for python pandas library (and a concept for time series). For most beginners, the struggle is to read data, manipulate it and dump it. Moreover I think the best introduction would be with cljs thanks to the connection with interactive visualizations and all available libraries in JS. Just by having a data frame structure would allow us to attract many data analysts using R and Python (We need more for ML but that would be a great start).

That being said, I think we need a way to interact with python in a seamless way to use the functionality of several existing packages (tensorflow and scikit). I understand Clojure bias towards MXNet, but many beginners still starts with TF and it could be easier to make the move if concept were similar.

Finally, I think we need one frameworks and make choices for beginners. As a community we need to decide about a single way of doing things first, otherwise we will end up having many librairies sharing the same responsibilities and the user struggling to know which library to choose.

I am thrilled by this initiative and I hope I can contribute and participate.

1 Like

Hello, everybody.

Not willing do disrupt this fruitful discussion, we opened a separate topic to discuss meta questions like the format of the meeting.

Please look inside and comment.

As @jsa-aerial mentioned, maybe this platform (clojureverse) has some limits, and another place (Zulip? Reddit?) could be useful for focused discussions.
Let us talk about that, too, at that meta discussion that we just opened.

Slightly offtop:

I currently develop cljfx, data-driven wrapper for JavaFX that has charts in it, once I’ll release it will be really easy to build this bar function on top of it. I created charts example as an illustration:

(fx/on-fx-thread
  (fx/create-component
    {:fx/type :stage
     :showing true
     :scene {:fx/type :scene
             :root {:fx/type :bar-chart
                    :title "Top headline phrases"
                    :legend-visible false
                    :x-axis {:fx/type :number-axis}
                    :y-axis {:fx/type :category-axis}
                    :data [{:fx/type :xy-chart-series
                            :data [{:fx/type :xy-chart-data :x-value 8961 :y-value "will make you"}
                                   {:fx/type :xy-chart-data :x-value 4099 :y-value "this is why"}
                                   {:fx/type :xy-chart-data :x-value 3199 :y-value "can we guess"}
                                   {:fx/type :xy-chart-data :x-value 2398 :y-value "only X in"}
                                   {:fx/type :xy-chart-data :x-value 1610 :y-value "the reason is"}
                                   {:fx/type :xy-chart-data :x-value 1560 :y-value "are freaking out"}
                                   {:fx/type :xy-chart-data :x-value 1425 :y-value "X stunning photos"}
                                   {:fx/type :xy-chart-data :x-value 1388 :y-value "tears of joy"}
                                   {:fx/type :xy-chart-data :x-value 1337 :y-value "is what happens"}
                                   {:fx/type :xy-chart-data :x-value 1287 :y-value "make you cry"}]}]}}}))

results in:

1 Like

I’ve very much enjoyed using Vega via Oz for visualizations recently. Not only does it offer a very low-boilerplate way to quickly produce graphics, it is also very easy to export the specifications used by Oz as JSON that can be used in building front-end tools and dashboards–even by people whose preferred tool is not clojure.

Also, Vega works well with TopoJSON cartographic data, making it a good choice for visualizing geographic data.

2 Likes

General note: while I agree that NNs and probabilistic programming languages are not a core part of the expected statistical toolkit, they do each have a place. NNs are receiving enough hype at the moment, but for those of you who might be interested in PPLs, I recommend this video from Stuart Russell to get a sense of what they are and how well they can perform when used properly.

2 Likes

I recently started a text classification project and did in parallel using python and clojure (in my free time at home).

I don’t have the impression that the combined Clojure / JAVA world misses a lot for any type of data analysis.
This has as a consequence that I need to embrace Java and not try to avoid it.

After some experience i came to the conclusion that Java inter-op is not a problem as such, but the “tooling” for Java Interop could improve.
Even the most complete Clojure IDE’s (Intellij + Emacs+CIDER) don’t have “perfect” autocompletion or help on Java methods each time it would be possible. (It can never work always, due to Clojure being a dynamic language)
But I think we will get there, as the toolling keeps improving.

I got as well the impression that “wrapping” existing Java libraries is mostly a waste of time and should not be done. I saw a lot of abandoned “wrapping” projects. They were probably abandoned, because it is simply too much work to “wrap” the always changing Java library.

In any given project, I need only a very small subset of the big “Java data science librarys” and it is therefore easier to write a small wrapper my self, only for the code I will call.

The huge difference in using Python vs Clojure I saw in the ease of “searching solutions in stackoverflow or google”.

For every problem I had, I found the python solution in a few seconds, while for Clojure it took far longer.

So we should invest more time in blogging about our Clojure based data science projects then spending time in wrapping Java code.

3 Likes

I’ve been given a new ‘status’ and can reply again - whoohoo!

In Saite, this is just:

(->>
 (hc/xform ht/bar-chart
  :TITLE "Top headline phrases"
  :X :x-value :Y :y-value :YTYPE "nominal"
  :DATA
  [{:x-value 8961 :y-value "will make you"}
   {:x-value 4099 :y-value "this is why"}
   { :x-value 3199 :y-value "can we guess"}
   {:x-value 2398 :y-value "only X in"}
   {:x-value 1610 :y-value "the reason is"}
   {:x-value 1560 :y-value "are freaking out"}
   {:x-value 1425 :y-value "X stunning photos"}
   {:x-value 1388 :y-value "tears of joy"}
   {:x-value 1337 :y-value "is what happens"}
   {:x-value 1287 :y-value "make you cry"}])
 hmi/sv!)

barchart-example

Actually, for something this simple we don’t even need the server. We can just use the client editor and popup render to check if it looks like what we want:

sigh more platform limitations - you cannot have more than 3 consecutive comments/replies. So, I have to edit this original in order to add the following which probably should be separate.

Just occurred to me that it is probably worth noting that if you would be working with many datasets with the same x/y channel names and types, you could simply change the defaults for those channels so that you would not need to keep specifying them in your explorations:

(hc/update-defaults :X :x-value, :Y :y-value, :YTYPE "nominal")
;;; Now the above example could be
(-> ht/bar-chart (hc/xform :TITLE "Top headline phrases") hmi/sv!)

Same with the client. The editor content would just be:

[ht/bar-chart :TITLE "Top headline phrases"]

And the renderings would be the same as before

3 Likes

The ergonomics of this platform are bit difficult to navigate correctly. This should have been this reply, not a new comment:

Sure, but I believe Saite (by way of Hanami) is simpler, with an even lower “boilerplate”, and much more capable. You can certainly use it in exactly the same way as Oz (a Vega/Vega-Lite spec is a legal template), but you can do a lot more with very little extra effort.

This example worksheet shows how one may use Renjin (a pure-JVM port of R) from Clojure.

It will be generalized as a library we are building – would be thankful for comments!

Renjin is possibly the largest collection of statistical functions in the JVM (but many R libraries are not supported yet).

3 Likes

My comment wasn’t meant to express a preference between Oz and Saite/Hanami (I haven’t used the latter, and thus have no opinion), but rather to suggest that there are advantages to targeting a browser via VEGA rather than, for example, building a Grammar of Graphics implementation in clojure or wrapping R’s ggplot.

1 Like

What I like about the python community is that although there are competing libraries and a lot of overlap, every single library has a very strong statement of intent as well as a defined set of features.

I find this lacking in clojure to the point where just by going through the thread, I’m reading about what everyone’s done but I’m still confused as to how different libraries can be combined together to do something greater.

Tutorials are a great start but my preference would be that a standard be chosen for me and I’ll just go with it. I don’t need 5 ways of doing the same thing. I need 5 things that when combined, gives me much more flexibility than before. It’s the same with libraries as with functions. Right now, as a beginner, I do need some help with this.

6 Likes

I totally second this, though in Python there are a ton of libraries, for doing these things people will fire up a Jupyter Notebook, import pandas, numpy and scikit-learn and that’s it. Now you might like or not these packages/libraries (for instance I dislike notebooks), but at least it’s going to be clear to beginners how to start.

After some time when they are more independent and have the ability to evaluate code & packages they might choose something else, but then who cares.

I guess the point is similar to what Cognitect has said many times: they chose to build stuff for experienced programmers, we will have to choose whether we want to do the same or improve support for the beginners.

2 Likes

This reminds me: while I think everyone should use what they like, I find notebooks worse than a good editor connected REPL for most work. It would be nice if we tried to center our work around libraries that work well in a normal Clojure REPL+editor workflow while leaving the option open for Jupyter (or whatever) on top.

5 Likes

I’d like to see many existing projects merged into one very good set of libraries, as I agree with the point made above that the Clojure community too small to support as much fragmentation as one sees in the Python world.

3 Likes

I think you should aim at RStudio, tidyverse, rstan, Rmarkdown, etc of R Language.

2 Likes

We use kixi.stats for doing an MCMC like model for special education projections. We’re doing some discrete event simulation work at the moment, but that is in python using simpy (code not released yet).

I’ve found two other discrete event simulation libraries:

I think these kinds of simulations are important parts of data science too and I’d like to see good libraries for them.

3 Likes