Spark alternatives in Clojure

cesarmarinhorj · April 6, 2023, 1:24pm

After all this time learning (and enjoying) functional programming with Clojure, and after knowing that exists Spark for “Data Mining”, there are Clojure alternatives? cause seems the same thing, as it is made in Scala, could be same thing in Clojure.

Anything that is been used, as a Spark alternative?

Thanks!

maxweber · April 6, 2023, 7:19pm

is one option, a Clojure library for Google Cloud Dataflow / Apache beam.

Like Spark, Google Cloud Dataflow is rather complex to use. Therefore I often just use one big cloud instance to do all the data processing in plain Clojure. Recently more and more Clojure libs for data science are published, like this one:

Here a talk with a tutorial: re:Clojure 2021 workshop: Wrangling datasets with Tablecloth by Mey Beisaron - YouTube

daslu · April 6, 2023, 9:22pm

Also worth mentioning, in addition to @maxweber’s great recommendations:

Geni - dataframes in Clojure based on wrapping Apache Spark
Clojask - a dataframe library for larger-than-memory datasets, implemented in Clojure

zcaudate1 · April 7, 2023, 12:54am

there’s https://www.dask.org. the library can be used with libpython.

gregoltsov · April 8, 2023, 11:03am

I think one way to look at this is to ask yourself what you want to use “Spark-like” for. From my experience, people tend to consider Spark when they have one of these needs:

Processing data that’s larger than RAM available - aka medium-sized to “big data”.
Streaming processing
Working with Data Lake files - like Parquet on S3.
A decent DataFrame-based processing.

So depending which particular use case you care about, there are already several options possible. 2, 3 and 4 have been addressed by several people above.

And, personally, for 1 and 3 I would urge you to reconsider needing Spark at all. It is a very heavy tool that in most cases is not needed when the data is medium-sized. Modern SQL tools like DuckDB allow working with serious data sizes (gigabytes+) and all you need is an existing tool like HoneySQL or next-jdbc.

cesarmarinhorj · April 8, 2023, 1:43pm

Thanks! DusckDB seems amazing.
Agree that Spark maybe is a oversized tool for the job. thanks!

daslu · April 9, 2023, 6:18am

BTW DuckDB has an integration layer with tech.ml.dataset called tmducken, so it should also probably work nicely with Tablecloth mentioned above.

system · October 8, 2023, 6:18pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.