2021-08 - Plans & Hopes for Clojure Data Science

Since the beginning of 2021, we’ve had a habit of a monthly thread where people could share their hopes for the emerging data science ecosystem. The place for those threads has been the Clojurians Zulip. We decided to move that to Clojureverse. There are many new friends getting involved, and it seems important to have this dialogue in a more visible place.

Through periodical updates, we may help each other catch up and think about the bigger picture, and the way our efforts may tie together. It’s also a good way for each of us to remind ourselves individually of what we have done, and what we would like to do in the near future.

It would be great if you all would consider the following questions and briefly mention your views towards them. Please skip anything that you find irrelevant. Keep in mind, these are only prompts to get you thinking.

  • Are you working on anything related to the Clojure ecosystem for data science / scientific computing / data tooling / data engineering? Let us know about it.
  • Have you been doing anything interesting in the last month?
  • Is there any new realization or change in your hopes and beliefs about the ecosystem’s future?
  • What are you hoping to create/learn/explore in the coming month? … and in the coming 3 months?
  • What developments are you hoping to see in the ecosystem and community in the coming month? … and in the coming 3 months?

Also: if you are interested to see what you or others have written in the past few months here are some links to the previous threads:

Looking forward to hearing about what everyone has been up to and hopes to be up to!

:pray:

6 Likes

Are you working on anything related to the Clojure ecosystem for data science / scientific computing / data tooling / data engineering?
I’ve been much slower than I wanted/anticipated but I’m still working on creating a module for GitHub - sicmutils/sicmutils: Scmutils in Clojure that deals with quantum algebra. If I can get it off the ground then I would like to develop it into a serious framework for working with quantum physics.

Have you been doing anything interesting in the last month?
Related to clojure, I’m now fairly deep into developing a language-learning app that helps you to learn from real-world video and subtitle content. I’m considering the possibility of setting up a small side-company around it, so if anyone’s interested in joining then please get in touch :).

What are you hoping to create/learn/explore in the coming month? … and in the coming 3 months?
Related to the above, I’m trying to learn as much as possible about clojure web development in my spare time.

What developments are you hoping to see in the ecosystem and community in the coming 3 months?
I’m excited by the prospect of real-world examples and tutorials for learning clojure data science tools. I’m particularly hoping that this will eventually draw a critical mass of people that will push clojure more towards the spotlight.

4 Likes
  • related work - video encode/decode for clojure, graal native and further work on tmd and tmdjs.
  • anything interesting? - tmdjs and decoding video. I played with tensorflow-js a short bit and got some face detection models working which was pretty fun.
  • Any realizations/hopes - Not so much, just watching the wheels of community involvement turn. The consistent effort from Daniel and others building the ecosystem and community in a careful, documented way is and will continue to be the biggest shining star for me. Also having people ask questions on zulip about some mathematical subject and having @generateme answer them quickly with fastmath functions is great. We are now at the point where there is a fairly substantial base foundation to build upon.
  • Exciting prospects in coming months - I would like to see/hear about some nontrivial deployments of tmdjs. In a more general sense John of Practicalli fame has been working closer with scicloj and that seems like it may result in some great content.
7 Likes

Quick and dirty status: organizing further tablecloth (TC) development. We plan to work on bringing dtype-next (columnar) operations to work similar to the rest of TC api + we want to compare TC with Pandas to find gaps.
I also revisit inferme, Baysian Inference library I wrote some time ago.

Main hope is to use notespace as a main documentation tool (with easy data viz).

4 Likes

I am currently new, so neither have much hope or plans.

Anyway, I came under the wing of @generateme, and I believe that I will be more or less useful in the development of tablecloth. We have a similar vision and we agree that the most enjoyable part of developing software is writing it.

4 Likes

My own work on the fabricate static site generator continues. I’m trying to balance actually using it with work on extending it, mostly through some basic SVG generative art experiments.

Concurrently with that, I’m thinking about how datahike can enhance the authoring experience by making queries a part of the writing process, inspired in part by John McPhee’s Kedit-based writing process. I wrote up an initial design brief that I hope to refine as I think about the “minimum viable database” for my own website before adding it to fabricate.

https://respatialized.github.io/design-doc-database.html

I’ve also been working through Think Bayes by translating all the examples and exercises to Clojure/tablecloth and it’s been a very interesting and challenging experience. I much prefer functions on data to the “here’s the magic ‘distribution’ button” approach that invariably asserts itself in the Python world, but it makes going through the book slower! I intend to finish the book by month’s end.

3 Likes

Thought I’d share here: I’ve sort of ironed out a literate Clojure workflow

https://geokon-gh.github.io/literate-clojure.html

It allows me to have a single-file clojure “project” that loads libraries dynamically and generates plots with inline SVG. (I’ve got other documents and projects in the pipeline, but they’re not ready to share yet unfortunately)

It’s made me really strongly believe in the thi-ng/geom architecture - with SVG hiccup being the correct “interchange” format. I think it’s a shame geom never took off, but I get that it’s a bit harder to get into b/c instead of being a monolithic plotting library with one entry point (like some central plot function with a million options) it’s a composable set of namespaced mini libraries with SVG being the ultimate output. Some of the minilibraries are lower level (matrix/transform/color/math) and some are higher level and built on svg (viz for plotting, other more complex ones for generating meshes and 3D images)

It’s all highly composable and the resulting SVGs are very flexible. This has two primary benefits:

The first is that It’s naturally very easy to extend the existing functionality. It’s very easy to write you own custom visualizations/plotting functions and to tweak/dial-in graphs. The existing plotting functions are quite capable and complex: geom/core.org at master · thi-ng/geom · GitHub and are a great starting point to making your own - which you will inevitably need to do. There are tons of options already, but many things are missing. If you say… want some error bars on your scatter plot then you’ll prolly need to implement it yourself. All the code (so far) has been very digestible and I’m not a Clojure guru by any stretch. I’ve never felt so comfortable looking at and extending someone’s codebase

The second is the flexibility of resulting SVG hiccup. You can manipulate the hiccup directly and modify it in any way you’d like. SVG is also a pretty pleasant and seems pretty well designed. It’s very modular and you can embed SVG in other SVG. If you want to add bar plots to your scatter plot, you just make the two plots and svg/group them together. Boom, done. So you can generate different graphs apply transformations etc. and compose them to generate multi-plot visualizations very easily. There is a bit of a learning curve, but once you get the hang of it, it starts to feel like you can plot anything with a bit of effort. There is not a lot of “meta” functionality though - but it’s very easy to write your own. I wanted to be able to arrange plots in a grid, but you don’t get a MATLAB-y figure(i),subplot(i,n,m,"blah") type of functionality. So I wrote a thing to do that in a few hours on a weekend. It felt very accessible and I guess I feel like I’m in control and not just subject to what the library provides (if that makes sense?). I’m rarely fighting the system and trying to massage it to do what is typical in MATLAB/R/etc.

And then in the end you can display/render the SVG in a myriad of ways. You are no platform constrained in any way. You can open a webview, you can use Batik/SVG Salamander, you can just serialize the hiccup and spit to a file. I even wrote a quick svg-hiccup to JavaFX renderer (using JavaFX graphics primitives - not a webview) for a larger GUI data processing application that has some simple in-window plots: corascope/svg.clj at master · geokon-gh/corascope · GitHub
It was very easy and a day or two of some fiddling,

Anyway, I’m just throwing it out there if people are looking for some alternatives. Last I poked around Scicloj and company it’s all very Vega JS webstack focused. Which is a pragmatic solution, but felt very not Clojure-y and non-extensible. (and of course everyone loves to rewrite functionality in existing libraries :stuck_out_tongue: ) So here is a less capable but more pure Clojure solution. Hope it’s useful for someone

4 Likes

I’m not much into the data science aspect of Clojure (or any other language really), but I have a certain fascination with data visualisation. Having previously looked at th-ing/geom and failed to use it for anything, I really appreciate your examples of what it can do and how.

1 Like

I’m really glad to hear it was useful, but just to be clear, the examples were taken straight from the thing/geom examples. I only added the add-libs wrappers to make it all a one-file executable. There is no real central place in the thing/geom org files that has a high level description of how things work. You just need to look at how Karsten Schmidt puts together his visualization and learn tricks.

Like this is a more complex example: geom/core.org at master · thi-ng/geom · GitHub

There are subtle tricks he’s doing: (apply svg/defs (map make-gradient item-type-colors)) is adding a svg defs for the gradients. And then he’s calling that in the “:fill” key off all the intervals - so they end up with the same styling. :fill #(str "url(#" (name (:type %)) ")")}). Unless you already know about svg defs then stuff like that is a bit non-obvious.

Or he does this thing to make the labels rounded to a year with :label (viz/default-svg-label round-to-year)}) - but you can jump to default-svg-label further down on the page and see that it’s nothing magical - it’s just generating a functor that make the labels. You can make your own labeler easily based on the example

I remember when I wanted to group charts I also was a bit confused how to do that, but then I just found another example that grouped a couple of images and I worked off the example: geom/demos.org at feature/no-org · thi-ng/geom · GitHub (this example also uses defs for reusing elements) Again, if you know about svg transforms this is also probably obvious

So maybe the moral of the story is that you need to poke around the examples to put together new visualizations - but it’s never too intimidating. And … well maybe it wouldn’t hurt to know how SVG works :slight_smile:

If you made some hard choices about sane defaults you could prolly make easy-to-get started wrapper functions. (and I’ve been doing that on a case by case basis for my own projects) But I can see why that wasn’t done. It’d really hide the composability of the architecture

And I haven’t even touched the other stuff… with meshes and webgl etc. etc.

This guys does some of his art with thing/geom(I think all the 3D stuff)

https://twitter.com/jackrusher

3 Likes

My focus this past month has been predominantly related to the “Phase Transition” project. This is a group of people that are meeting once-a-month to help encourage the SciCloj community to define some priorities with the aim of bringing our ecosystem to a point of “readiness”, where our standard for readiness is measured with regard not only to the speed and completeness of our libraries, but also their ease of use.

Because this is an ambitious goal, we have decided to work in phases, focusing on certain “domains” or areas of a typical data science workflow at a time. Currently, we are on Phase 1 and that phase has us focusing on the following areas:

  • Tooling - The tools we actually use to do data science work in Clojure
  • Data Wrangling - Basic data processing at the column and column-based table levels
  • Visualization - Easy-to-use visualization tools that work with our tooling

We are also focusing on three other domains that may cut across all phases and which are important for education and communication:

  • Website - Building a website that can be used as a entry point for new users.
  • Tutorials - Building a standard workflow and repository for tutorials that illustrate usage related to the domains listed above
  • Study Groups - Reviving study group sessions that help us both teach people how to use our existing tools and also allow us to learn where there are gaps and other usability problems

Finally, we have tried to build an awareness of what our target user or “user persona” is for this phase of development. A guiding idea here:

  • We think our target user for this phase is someone who may be a beginner to data science but is somewhat familiar with Clojure.

We have been having a discussion about more detailed personas for this phase, and some have contributed some more flushed out description. For example, here is one contributed by @practicalli:

Jane is a Clojure developer with a couple of years commercial experience and a year prior to that as a hobby
Jane has no data science experience but has seen TED talks and tutorials on visualising data
Jane has a project to create a dashboard to visualise covid19 data for her company
And has to find useful tools and data sources to build the dashboard
Jane visits the Government website for the country she resides in and finds various data sets in JSON and CSV formats
Jane choose to use Clojure CLI tools for the Clojure project to make use of the community tooling that helps visualise data as she is transforming it (Portal, Reveal)
Jane uses Spacemacs (Emacs / Cider) as its is a tool she is familiar with and Cider has simple to use debug and data inspector tools. Jane started with LightTable but that project is no longer active. She has also tried VS Code and Calva, but Cider is more stable and mature, has more features and better structural editing support.
The data wrangling libraries Jane is considering are clojure.data.json and clojure.data.csv as they are already included in Clojure
Jane has heard of other libraries for manipulating data, but doesnt know if they are relevant or easy to use
Jane will try using Oz for visualisations as she likes the idea of Vega & Vega lite as a data language, because it reminds her of the approach taken with Clojure itself.

What has happened so far?

  • We have met twice (monthly). (Our next meeting will be Sunday 8/22 @ 14:00 UTC). Please contact me if you are interested in participating in this effort.
  • We have established key plans/projects that align with the priorities for Phase 1 in each of our domains and have begun to work on these projects.
  • Some example projects that align with these goals are:
    • @generateme and @ribelo have (as noted above) begun work on bringing dtype-next column operations into TC in a way that is consistent with TC’s easy-to-use API
    • @ashima_panjwani has begun work on a library that will wrap @jsa-aerial 's hanami templating library for vega/vega-lite specs, making visualization even easier.
    • I will soon begin work on adding API support within tablecloth for the numerical (array) structures that dtype-next offers.
    • @practicalli has renewed work on building a strong community and documentation website that can serve as a platform for information about the emerging Clojure stack.
    • @skallinen and @daslu have started work on an initiative to rework notesepace to provide a minimal and easy-to-use baseline for tooling, facilitating both an easy pathway to getting started with Clojure data science and our need for a way to create, document, and share data work & exploration with our tools.
  • Based on some of these priorities, we have been able to help a few people submit a few applications for work related to these priorities in the latest round of Clojurists together funding that just closed on August 6th.
  • Our two main forums for in-person meetings – #sci-fu an open forum for discussion of development issues/problems/etc, and #ml-study an open forum for group learning/teaching on the Clojure data science stack – have become useful as places to think through problems related to our phased work:
    • #ml-study: @daslu has hosted two weekends of learning focused on both tablecloth, and the visualization tools.
    • #sci-fu: For several weekends we’ve been able to host important discussions on ongoing development work on visualization, tablecloth support for numerical array-processing, how to connect more people to projects, etc.

Our ongoing challenges?

  • As mentioned above, we have begun to identify tasks that line up with the priorities of Phase 1, but we still need to find more people who are interested in working on these problems. @daslu recently made a successful announcement asking for people who may be interested in contributing. About 25 people expressed interest, and more than half of them have already begun getting involved and picking projects. We need to continue helping people connect to these exciting problems.
  • Our goal is to reach a moment in the next month or two where we feel we can be finished with Phase 1 so that we can launch it to the public. Launching means inviting new users to come and use these tools! In our next meetings we will take stock of our our phase 1 work is going, and see where we are and how this goal can be reached.
2 Likes

Are you working on anything related …

  • My recent focus has been on community building: reached out to new people, talked at meetup groups, renewed the scicloj ml-study weekend sessions, mainly had many many meetings with newcomers, helping to get comfortable and to initiate new small projects.
  • This made me put less attention into tooling. This is a bottleneck and needs more care now (see below).

Have you been doing anything interesting in the last month?

  • I have been meeting wonderful people who wish to get involved in what we are building.

Is there any new realization or change in your hopes and beliefs about the ecosystem’s future?

  • One-on-one meetings are an important tool in building a group that works together. We need to do it much more, but we also need to do it well, making sure to provide everybody with a clear path to connecting fruitfully with the project.
  • The Clojure data science stack is interesting to many Clojurians even if data science is not their main need. It is a challenging problem, of diverse technical/conceptual aspects. In community discussions, we are creating a clear idea of what needs to be done, and we are providing newcomers with a path to being part of that. This is very much attractive for Clojurians who wish to create some open-source in a supportive group, wish to enjoy the emerging stack for their explorations, or are simply curious to see something really cool.

What are you hoping to create/learn/explore in the coming month?

  • Help community newcomers in getting involved and picking projects which are relevant to the core needs of this phase.
  • Create an easy pathway where anybody can experiment with the emerging stack, document the experiments, and share them in a place that is accessible to others.
  • Tooling: adapt Notespace to support the above needs (experiment, document, share).
  • For future compatibility with related tools such as Oz, Goldly, and Clerk, extract relevant parts of Notespace to serve as drafts of “compatibility layers” that can compose with other tools in an extensible way. Create a dialogue with other tool authors about those aspects.

… and in the coming 3 months?

  • Take part in building the data-modeling domain of the emerging data science stack.
  • Adapt to new developments in the tooling domain (lots of potential around the upcoming Clerk and Goldly).
  • Help in organizing re:Clojure 2021 and the workshops preceding it (this year there will be a lot of data science there).

What developments are you hoping to see in the ecosystem and community in the coming month?

  • Reach decent drafts of the current projects going on in data wrangling (array API around dtype-next, better printing, Tablecloth developments), data visualization (e.g., viz.clj), tooling.
  • Start a flow of individual and group experiments with the emerging stack, documented and shared in a central place.

… and in the coming 3 months?

  • Reach a decent draft for a stack of libraries which allow anybody who knows Clojure to perform almost everything in the Python Data Science Handbook
    and in R4DS, and feel comfortable about it.
  • Prepare good tutorials and workshops to share that with the broader Clojure community at re:Clojure.
3 Likes