Online meeting: clojure data science

Actually, I think the plotting/display capabilities are well underway. Thanks largely to the amazing work of the IDL folks who have given us Vega and Vega-Lite.

For plotting and charting with zero web dev knowledge, have a look at Saite, a general exploratory visualization app built on Hanami. You can work in the client (browser) and / or from an IDE/editor of your choice on the server side. Saite is evolving toward a new notebook capability. There are a number of issues about this (on the github) and if you think that sounds like an idea worth pursuing, Iā€™d appreciate input on those issues (or others of your own).

If you want to create your own domain specific (or other general purpose) visualization app, look at Hanami. Of course that would involve web dev knowledge. Some new features supporting client only apps are coming. A lot of documentation now, but a lot more coming.

2 Likes

One thing I would like to point out. This

make the plotting experience seamless: ā€¦
result would be a bar chart with reasonable defaults.

is totally on target.

This, however:

(bar my-data)

is a bad idea. Trying to cover all the various cases of this via a traditional functional or class/method based api is far too constraining and limited. Even worse would be resorting to macros. A much better approach is to just use data and, since we are using a great functional language, data transformations. Thatā€™s the route Hanami has taken.

1 Like

@alanmarazzi: Thanks for the kind comments and seeing good discussion early is heartening.

I only have a small thing to add, that I took some time this morning to clearly explain the points behind the tech.ml-base system on the readme.

There is more to those systems than serialization, it comes down to being able to represent various parts of the problem as data (and datasets) that you can visualize in the repl and some details around how models are represented generically.

Anyway, again, great thread!

3 Likes

Haha :slight_smile: (Tomasz here). Yes Iā€™m really happy. I also didnā€™t know about tech.smile existence and we have another island. However fastmath contains almost raw bindings to Smile and Apache Commons certain elements (besides classification I also added bunch of clustering algorithms). I definitely want to treat it as an step to build richer environment. Letā€™s start think of bridges now.

2 Likes

Thanks, everybody, for your kind comments. It seems good to continue this discussion a little bit further, before going back to discuss the format of a meeting, etc.

@alanmarazzi, @cnuernber ā€“ that was really enlightening, and I like the visions that you both suggested.

Here are some more suggested directions, that can possibly fit in.

grammer of graphics
Rā€™s ggplot2 can work on pure JVM (through Renjin), and is easy to access from Clojure. Iā€™ll write some examples soon.

interactive visualizations
Vega and Vega-Lite, Oz and Saite are wonderful.
For a richer collection of visual elements, personally I use some combination of hiccup, rojure, rmarkdown, htmlwidgets and crosstalk. Iā€™ll write some examples soon. This is quite handy, but complicated and has some limitations. Here clojurescript could really shine with much simpler solutions.

calling python
It should be feasible to create a solution with, e.g., HyREPL for sending commands, and Apache Arrow for sharing memory. Thus, we can talk with live python sessions (which is more efficient when one needs to repeatedly access a large data structure).

probabilistic programming
A complete probabilistic programming library is arguably one of the important missing pieces. Anglican is useful and very expressive, but does not support the fastest sampling methods (that require differentiation). About Bayadera, probably @draganrocks can comment.
In the short run, maybe wrapping a non-clojure library would be the easiest. There are several very interesting candidates here.
In the long run, let us consider writing something more complete in Clojure. We have been discussing this with David Tolpin, who coded Anglican, and has some interesting new ideas.

1 Like

Bayadera is a (Bayesian-opinionated) Probabilistic Data Analysis environment (that implements GPU MCMC engines as an important implementation detail). Arguably, Bayesian data analysis it is the future of small sturctured data analysis, but whether it will be mainstream is yet to be seen. Currently, the major issue with (C++) tools is that they are super slow. Bayaderaā€™s engine brings that to a lower, acceptable level, but, as the probabilistic data analysis is still niche, itā€™s a lot of work in convincing other people to try it. I wish I had more time to document it more and demonstrate how I use it more, but I canā€™t commit infinite resources to open-source work, and this has to compete with other things on my TODO. Additionally, the target audience usually donā€™t have GPUā€™s, and create script-based R analysesā€¦ Still a lot of steps for that community to become interested in scaling server-based softwareā€¦ Lots of interesting things there, since I have use for it, am very interested in the area, and I think this might be expanding. It is clearly useful in practice, especially for decision analysis and social sciences.

Probabilistic programming is something else, and, in my opinion is something that currently has yet to be shown practically useful outside programming(?) language research. I get the idea of PP, but I think that it does not scale. Itā€™s super, super, slow even for basic models. It might be useful for some things as hyper parameters search etc. (even that is still research, though) but I can not see how this can work as a data analysis tool. The common confusion of PP and Bayesian data analysis is due to PP often using methods such as MCMC and creating prior/posterior/evidence models etc, which is also done in BDA. However, I can see how this can work in practice when searching numerical distributions. Iā€™m skeptical this will work acceptably in practice for distributions of random program code output, especially if that is then used for numerical data analysis. Anyway, PP might be interesting or not for different people, but Iā€™d say that this is not a pressing issue for Clojure as a ā€œdata scienceā€ or ML platform.

PS. It seems to me now that this might sound as a bit dismissive of PP in general and Anglican in particular. That was not my intention. Anglican is a quite interesting thing to try, and everyone can get a lot of intuition and fun from it. In addition, there is the Probabilistic Programming in Action book that can help learn it easier (itā€™s in Scala, but I believe it can be used to get many ideas). Iā€™m just skeptical towards the usefulness of the whole PP concept as a general data analysis tool applied in practice vs other methods. But for learning and fun, I can even recommend it!

3 Likes

Thanks, @draganrocks, this explanation helps a lot. You are right, the distinction is important.

The way I see it, PP is just about being very expressive in describing probabilistic situations. At least for small problems, I find it quite useful. For anyone here curious, probably this interactive book tutorial by Goodman and Tenenbaum is one of the accessible ways to see what it is about. Anglican feels similar from the user perspective.

Anyway, personally, I really hope to be able to look into Bayadera and understand it better.

1 Like

Vega and Vega-Lite are entirely based on the grammar of graphics. The main practical differences between Vega/Vega-Lite and ggplot2 is that the former are dynamic, interactive and run on JS (browser based being the most important). You could make a reasonable argument that this means they supersede ggplot2 and that any future work should target them. Indeed the R folks are working on supporting Vega and Vega-Lite at a level comparable to Hanami/Saite/Altair(Python). They are not there yet but realize this is the future.

WRT ā€˜other visual elementsā€™, Saite, Hanami, and Oz support hiccup - Saite and Hanami directly support this on the browser side as well. And Saite/Hanami also support full re-com.

OK, it appears I have reached some crazy ā€œyou can only reply a maximum of 3 timesā€ in this silly platform. But I wanted to say that WRT PP, I basically agree with Draganā€™s comments on this.

Since it appears I no longer can add any more to the conversation, I guess I may have to just abandon this effort after allā€¦

4 Likes

Hi all, long time lurker here. I feel like I can contribute to this thread. Iā€™m
somewhere in the middle between data eng and data sci, been doing Python for a
while. Iā€™ve been meaning to get into Clojure, but always failed to learn by
building apps, so decided to learn it by doing data science and documenting my
journey.

Following the ā€œbuilding bridgesā€ analogy, I feel something like Chris Albonā€™s Python data sci how-tos (scroll down to Data Wrangling) would help a lot of people to get started. Iā€™m sure
there are a ton of people landing on that page while getting started in the
field, plus using it for references. Also, see his ML cards.

In general, the whole thread sounds great - Iā€™m definitely in, however much I
can help.

2 Likes

@cnuernber. Iā€™ve been watching ā€˜techā€™ libraries for a while, though a lot of it is beyond me at the moment. Looking forward to the meeting.

2 Likes

Iā€™m very happy to see this initiative! My involvement with data science is probably more peripheral than for some folks here, but Iā€™m still very interested. Iā€™m a philosopher of science who sometimes works with real data, but more often works with data that was generated from simulations. I also use OCaml and NetLogo quite a bit, and a lot of my work doesnā€™t involve coding or data at all, so I come and go from Clojure depending on the projects Iā€™m working on.

(If anyoneā€™s interested in discussing agent-based/individual-based modeling with Clojure, it probably doesnā€™t belong in a data science thread, but please contact me for a side conversation. Iā€™ve been involved with agent-based modeling for a while, and since Clojure is one of my favorite languages, I like doing ABM work in it when that makes sense.)

2 Likes

Great initiative!

Please consider adding clj-stan (https://github.com/thomasathorne/clj-stan) and clj-vw (https://github.com/engagor/clj-vw).

-A

1 Like

On the subject of bridges and better python support, we should note that the javacpp-presets package has a lot of bridges.

  • mxnet alternative
  • tensorflow
  • cpython
  • tensorrt
  • mkl-dnn
  • cuda, cudnn

Note that saudet is also a regular contributor to nd4j.

As an aside, I wasnā€™t able to link to that nor was I able to link to our library showing how to use a javacpp library and our post explaining details around this area. So I apologize, at least check out the javacpp-presets github project and you will see. I listed maybe 10% of the libraries.

1 Like

Also, for a nice background with lispy examples, please see the free book ā€œAn Introduction to Probabilistic Programmingā€ https://arxiv.org/pdf/1809.10756.pdf written by Frank Wood and team.

2 Likes

If anyoneā€™s interested in discussing agent-based/individual-based modeling with Clojure, it probably doesnā€™t belong in a data science thread, but please contact me for a side conversation. Iā€™ve been involved with agent-based modeling for a while, and since Clojure is one of my favorite languages, I like doing ABM work in it when that makes sense.

Iā€™m in Operations Research, primarily focusing on discrete event simulation and various forms of optimization (often times via mathematical programming). Simulating complex processes and plans is the bulk of my work though. Distributed simulation is a current topic of interest. Basic stats and analysis also pops up regularly.

[general data science topics]

Over the last ~7 years or so, Iā€™ve written gobs of stuff for internal use, although large chunks are publicly available in a monolith. My primary intersection with ā€œdata scienceā€ has been via Incanter, of which I maintain a fork, trying to revamp the extant design (case in point, plotting functions) and porting internally developed fixes and extensions. There seems to be little love for Incanter on this thread so far :slight_smile:

Regarding dataframe-ish stuff, I built a typed columnar table years ago, with a little SQL-like edsl ported from Practical Common Lisp. After core.matrix came out with its dataset implementation, I extended the relevant protocols to spork.util.table to enable use in incanter. I havenā€™t really needed or missed any dataframe stuff as a consequence. The table implementation provides typed schemas for all the fields, as well as efficient mutable construction, string canonicalization (this is pretty huge in practice for me), and some other goodies.

Iā€™ve only recently began pushing to apply ML front; definitely interested in opinions of practitioners here.

Iā€™ve messed with vega (via Oz and a fork I wrote to use javafx webview as the canvas). I think the grammar of graphics approach is powerful, but also a bit staticā€¦You can encode all sorts of higher minded things, which are then compiled into something Vega understands to create slick graphics, but lack of IMO low-level access to the actual plotting and rendering (via the resulting scene) is a bit of a downside to me. Iā€™ve been looking at a solution that affords the data-driven specification of vega, but allows better control of the resultant product (via something like Processing/Quil or another renderer). In my ideal world, I should be allowed to mess with the plot however I want; Vega/ggplot would provide a nice porcelain layer to get up and running fast.

3 Likes

Beautiful.

So many wonderful surprises here, that I guess many of use were not aware of.

We will soon create a separate clojureverse topic to discuss meta questions such as the meeting format, timing, better platforms for discussion, etc.

I think this would be really good. Clojure would be an ideal language for data science. I have a role between data scientist and analyst and I struggle with the following points in Clojure.

Even though we prefer sequences of maps, we desperately need a dead simple replacement for python pandas library (and a concept for time series). For most beginners, the struggle is to read data, manipulate it and dump it. Moreover I think the best introduction would be with cljs thanks to the connection with interactive visualizations and all available libraries in JS. Just by having a data frame structure would allow us to attract many data analysts using R and Python (We need more for ML but that would be a great start).

That being said, I think we need a way to interact with python in a seamless way to use the functionality of several existing packages (tensorflow and scikit). I understand Clojure bias towards MXNet, but many beginners still starts with TF and it could be easier to make the move if concept were similar.

Finally, I think we need one frameworks and make choices for beginners. As a community we need to decide about a single way of doing things first, otherwise we will end up having many librairies sharing the same responsibilities and the user struggling to know which library to choose.

I am thrilled by this initiative and I hope I can contribute and participate.

1 Like

Hello, everybody.

Not willing do disrupt this fruitful discussion, we opened a separate topic to discuss meta questions like the format of the meeting.

Please look inside and comment.

As @jsa-aerial mentioned, maybe this platform (clojureverse) has some limits, and another place (Zulip? Reddit?) could be useful for focused discussions.
Let us talk about that, too, at that meta discussion that we just opened.

Slightly offtop:

I currently develop cljfx, data-driven wrapper for JavaFX that has charts in it, once Iā€™ll release it will be really easy to build this bar function on top of it. I created charts example as an illustration:

(fx/on-fx-thread
  (fx/create-component
    {:fx/type :stage
     :showing true
     :scene {:fx/type :scene
             :root {:fx/type :bar-chart
                    :title "Top headline phrases"
                    :legend-visible false
                    :x-axis {:fx/type :number-axis}
                    :y-axis {:fx/type :category-axis}
                    :data [{:fx/type :xy-chart-series
                            :data [{:fx/type :xy-chart-data :x-value 8961 :y-value "will make you"}
                                   {:fx/type :xy-chart-data :x-value 4099 :y-value "this is why"}
                                   {:fx/type :xy-chart-data :x-value 3199 :y-value "can we guess"}
                                   {:fx/type :xy-chart-data :x-value 2398 :y-value "only X in"}
                                   {:fx/type :xy-chart-data :x-value 1610 :y-value "the reason is"}
                                   {:fx/type :xy-chart-data :x-value 1560 :y-value "are freaking out"}
                                   {:fx/type :xy-chart-data :x-value 1425 :y-value "X stunning photos"}
                                   {:fx/type :xy-chart-data :x-value 1388 :y-value "tears of joy"}
                                   {:fx/type :xy-chart-data :x-value 1337 :y-value "is what happens"}
                                   {:fx/type :xy-chart-data :x-value 1287 :y-value "make you cry"}]}]}}}))

results in:

1 Like

Iā€™ve very much enjoyed using Vega via Oz for visualizations recently. Not only does it offer a very low-boilerplate way to quickly produce graphics, it is also very easy to export the specifications used by Oz as JSON that can be used in building front-end tools and dashboardsā€“even by people whose preferred tool is not clojure.

Also, Vega works well with TopoJSON cartographic data, making it a good choice for visualizing geographic data.

2 Likes