Spark alternatives in Clojure

After all this time learning (and enjoying) functional programming with Clojure, and after knowing that exists Spark for “Data Mining”, there are Clojure alternatives? cause seems the same thing, as it is made in Scala, could be same thing in Clojure.

Anything that is been used, as a Spark alternative?

Thanks!

2 Likes

is one option, a Clojure library for Google Cloud Dataflow / Apache beam.

Like Spark, Google Cloud Dataflow is rather complex to use. Therefore I often just use one big cloud instance to do all the data processing in plain Clojure. Recently more and more Clojure libs for data science are published, like this one:

Here a talk with a tutorial: re:Clojure 2021 workshop: Wrangling datasets with Tablecloth by Mey Beisaron - YouTube

4 Likes

Also worth mentioning, in addition to @maxweber’s great recommendations:

  • Geni - dataframes in Clojure based on wrapping Apache Spark

  • Clojask - a dataframe library for larger-than-memory datasets, implemented in Clojure

4 Likes

there’s https://www.dask.org. the library can be used with libpython.

1 Like

I think one way to look at this is to ask yourself what you want to use “Spark-like” for. From my experience, people tend to consider Spark when they have one of these needs:

  1. Processing data that’s larger than RAM available - aka medium-sized to “big data”.
  2. Streaming processing
  3. Working with Data Lake files - like Parquet on S3.
  4. A decent DataFrame-based processing.

So depending which particular use case you care about, there are already several options possible. 2, 3 and 4 have been addressed by several people above.

And, personally, for 1 and 3 I would urge you to reconsider needing Spark at all. It is a very heavy tool that in most cases is not needed when the data is medium-sized. Modern SQL tools like DuckDB allow working with serious data sizes (gigabytes+) and all you need is an existing tool like HoneySQL or next-jdbc.

2 Likes

Thanks! DusckDB seems amazing.
Agree that Spark maybe is a oversized tool for the job. thanks!

BTW DuckDB has an integration layer with tech.ml.dataset called tmducken, so it should also probably work nicely with Tablecloth mentioned above.

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.