So how is it going to develop? Are we creating a subsection of forums so that we can discuss for each part?
I think the current master branch has moved away from org-mode(?). I have considered starting to use org-mode myself lately, but this is the reason why we opted against it. It is kind of the best literate programming environment though. Do you still have plans to evolve hydrox?
Really nice to see all the shared interests here. I am maintaining Anglican and I am curious to hear people’s needs and interests. I am also in general working with Clojure in backend and frontend settings besides my research and think it is a nice fit. The recent Anglican version for example runs also in ClojureScript.
thi.ng creates SVG or webgl itself.
As a user and sponsor of kixi.stats I’d be into that. I worry a bit about carrying on with incanter as it is just so big.
As a company we definitely do cohort component, DES and MCMC so we’d like to see more stuff there. Not everything is a neural network. (I kid! I kid! pls don’t hit me actual data scientists!)
@joinr Do you think it is worth factoring out your DES stuff from spork into its own library?
There are several options to consider – let us discuss this under that separate topic:
@whilo I think I’d like to understand when I should be using Anglican and when I should be using other Bayesian and probabilistic tools. With all of these I think there are trade offs around speed or fitting in with particular frameworks or expressiveness that we need to think about. Knowing where each of our options fits would be great.
hydrox has evolved (twice) though I’m using more of a repl based bulk namespace approach as opposed to a file-watch approach (I found it inconsistent and needed resetting once in a while). The current landscape for docs isn’t too bad with cljdocs.
Are there any crossovers between anglican and bayadera? it’d be really helpful to have some sort of a comparison matrix between not just the pp libraries but also the ml ones as well. I understand ‘conj’ and ‘assoc’ really well. It’d be great to have some of the more difficult concepts simplified, explained and compared.
I really like the API of
kixi.stats, for what it’s worth.
I also love Karsten’s
thing libraries and have long championed them to the clojure community, but would caution us to look carefully at the feature set of VEGA.
geom is a great foundation that certainly could serve this purpose, but we should be clear on how much work is involved to bring it to feature parity with VEGA.
Ah, the implementation of kixi.stats (api and guts) is all @henrygarner so he should get the kudos.
I agree that VEGA based things would be a very good place to concentrate our efforts, especially given the grammar of graphics and grammar of interaction that they’ve thought about and appears very successful.
If we want to spend a lot of time having fun we could make that grammar also work with thi.ng (which I’m a fan of as well). I don’t think we should be betting on thi.ng directly at the moment.
We continue the discussion with some better granularity at Zulip, under the data-science stream.
For background on Zulip, see this discussion.
It is recommended to know a bit about Zulip’s concepts of streams and topics.
See you there!
Do you think it is worth factoring out your DES stuff from spork into its own library?
For various values of “worth” spork is an intentionally accreted monolith, the bits of which are in various states of use (and active development). It’s definitely designed for modularity though, and I’ve looked at breaking out bits into separate libraries (but lack the external motivation) either manually or automatically. In practice, the DES stuff uses/exploits bits of the ecosystem, like an entity component system, behavior trees, and some minor stats libraries. In the mid-long run, I’d shift towards using something to extract the dependencies during publication. I also need to port some rudimentary examples (currently half baked). I’m more concerned with production than adoption at the moment though (although happy to answer questions).
Personally, I’m more interested in refactoring and applying fixes to Incanter since we continue to leverage it in internal processes. I got bogged down in refactoring the plotting implementation (all multimethods and macros, heavily tied to JFreeChart too, which involved munging through the JfreeChart docs). I’m looking hard at adding a vega backend with a compatible porcelain API from incanter.charts (rendering to browser/html, or javafx webview to eschew spinning up a server). Incanter is big, but it’s also modular (via lein-modules), so it’s manageable IMO.
I’m an Emacs Org Mode user, I usually do some simple statistics on Org Mode “Literate Programming”. I have an suggestion, might create an GitHub README/Wiki page for Clojure Data Science, add link to library, and add detailed description about library. Maybe add some Clojure Data Science useful knowledge there. So people can find information easily. Put all info in one place would help newbie find what they want. WDYT? Looking information around is diffcult for me.
Thanks @stardiviner! Emacs Org Mode is beautiful.
That is a great idea, we are working on a website – will share in a few days to get some feedback.
We made this questionnaire to prepare for the Clojure data science online meeting.
Your response here will be extremely helpful!
Please note that no question is mandatory – e.g., you do not have to say even your name.
Well, my compliments to the chef
My feeling is that a dataframe library hasn’t emerged because a vector of maps+clojure.core is a good-enough “dataframe” for most uses. What are we missing in comparison to pandas?
I completely agree with you, but I mean what I said: dataframes are used for ETL, analytics, machine learning and even by some for regular backend development. If you take away such a fundamental abstraction to people doing these things maybe 70%-80% of them will go back to what they already know.
In data science the “beginner curse” is even worse than with regular software development: in the next couple of years nearly half of the data scientist/engineer in the workforce (or close to join it) as such will have 2 years or less of working experience.
This means that if we don’t get beginners on board we’re simply done. IMHO the real issue to solve is: how do we get beginners on board? And one of the possible answers is give well documented and simple dataframes to them.
About pandas, there’s a lot of tooling built around it, and doing something like the code below in
clojure.core + vector of maps is not so trivial and would likely have to be split in more parts. Moreover docs are somewhat lacking under this front, and let’s not talk about examples, tutorials and such.
(df.dropna(subset=['dep_time', 'unique_carrier']) .loc[df['unique_carrier'] .isin(df['unique_carrier'].value_counts().index[:5])] .set_index('dep_time') .groupby(['unique_carrier', pd.TimeGrouper("H")]) .fl_num.count() .unstack(0) .fillna(0) .rolling(24) .sum() .rename_axis("Flights per Day", axis=1) .plot() )
I’m not familiar with pandas, so I can’t read your example fully, but I appreciate your point. My initial reaction would be to see if I can provide functions that achieve the equivalent but using a collection of maps as input. I would then try to see if/how it would be possible to make this beginner-friendly, maybe by introducing some abstraction on top. Maybe the new mechanism of protocol implementation via metadata (Clojure 1.10) would allow us to have our cake and eat it (have plain data but with some extra convenience for beginners/everyone).
In my experience, beginners are not looking for data frames (at least not actively). They are looking for easy way to solve problems, and in most cases they don’t even know exactly what problems they have. Simply, they would like to learn to somehow process data and get interestingly looking results. They will opt for whatever you teach them to do, if there is a way to convince them to learn from you, and if you deliver that teaching.
R DataFrames look nice for the simplest task, but quickly get messy and inconsistent. Is the example you’ve posted easy, even for non-beginners? I don’t think so. However, the beginner will have to deal with it, because they will be looking for “learn data science” resources, and everything they find will be countless R or Python resources. In that case, Clojure can offer whatever awesome stack, and no-one will look at it regardless of it having data frames or not.
I want to stress the importance of learning resources, especially detailed high-quality ones that teach concepts versus showing the technology. Now it might be too late, or not, but there need to be a book called “Beginner’s Guide To Data Science” or something like that, and that book needs to teach all common tasks that the prospect reader will have, such as loading data, cleaning data, calling a selection of common algorithms, displaying results, discussing results. It doesn’t matter whether the code is similar to R and Python, but it absolutely matters that whatever we’re showing stays focused and deliver results. We don’t need to show 10 ways to skin the deer, but we absolutely have to show exactly 1 way to do that common task that we are teaching.
The common example of what I’m talking about is Clojure web development. A huge, huge, disadvantage of Clojure in that area (that is, otherwise, well covered in Clojure ecosystem) is that the prospective Clojurist is required to make too many choices too early. There is no resource to learn the concepts using one way. You have to patch 100 brief resources up by yourself. At each step, you’re presented with 10 choices, and you have to pick one. That means that you already have to be an experienced web developer to be able to find your way here.
The same is with data science/data analysis/machine learning etc. You have to already know R way or the Python way, then there is 0.001% you come to Clojure because you like it specifically, then you start searching how to do task X, and of course that you’ll look for the tools that look exactly like in R. By then, your Clojure experience will be the same as R at best, or totally spartan in most cases. To attract even a small percentage of R or Python community with that approach, you’d have to mach their offering, bugs and bad solutions and everything…
On the topic of data frames specifically, is there a way to offer an easier and more powerful way to accomplish the same set of tasks, but with something more elegant, such as DataScript and similar approaches? (I don’t know, I’m asking)
And, I think we also miss a huge number of people that are not looking to become data scientists but are programmers that are interested in working with data. There are millions of Java programmers, or .Net programmers, who would consider the JVM experience over RStudio or Python notebooks, only if:
- There is a straightforward code that they could call (they don’t know R, and don’t know what a data frame is yet)
- There is a good detailed book (or a video course or whatever people use these days) that will teach them data analysis etc. without digressing into what a transducer is or whatever Clojure implementation details are.