Faster csv reading/processing, how I got there with core.async

I guess what’s interesting is Java doesn’t have such ready made library, but probably because of the nature of the language, people don’t expect to use Java for quick ETL jobs and don’t go looking for such thing in the first place, like doing data exploration and such.

It does. Just not part of the core JDK. There are plenty of libraries/frameworks that handle each task you might ever need.

I’d expect that plenty of people do this with plenty success. I have done it many times successfully in my Java time and in my CLJ time, it was effortless.

Although I agree with the overall argument that libraries like tech.ml.dataset are extremely useful and should be used there is also a case to be made for the custom threadpool/F1 solution. The only thing “fragile” about that solution is the csv parsing. str/split was quick&dirty but writing a simple pure function to get what you need is trivial, it would likely also end up faster.

Otherwise that code gives you very fine grained control over pretty much every aspect (eg. core count, memory use, timeouts) that are almost all abstracted away and not under your control when using a library.

I think going with a library first and then maybe tuning it later is a good path to take. Most of the time the library approach is good enough and you can just move on solving your actual issues. If you really need the performance it is good to have the option to go deeper. I also kinda like not adding dependencies to my projects. That is not always a good thing though and I have definitely wasted a lot of time with DIY solutions that were complete overkill. :wink:

2 Likes

This is absolutely a cheat but since you know you want a specific field out of a line, consider java.util.StringTokenizer. You can write a simple imperative loop for finding the nth field by delimiter. No regular expressions, no array allocation.

Care to share?

Google will be better help. I’ve been out of that world for almost two decades. I read about Apache Spark and Camel a while ago but never used them for anything. I’m sure there are plenty more.

Ok, well for this particular usage, I haven’t found anything for Java. The parser used by tech.ml.dataset is the closest I’ve found actually.

I was wondering why, and I think it’s because of the use case Java gets for these things. It is much more common to use Java for an operational data pipeline then an explorative one. So I think there isn’t as many just easy and expressive library for fast, single machine, CSV processing.

I wouldn’t be surprised if most people do just what you described instead, and so a library never emerged.

But I haven’t tried every single thing out there, just didn’t seem like there was one major one at the least.

Java actually has tablesaw which informed a lot of the early tech.ml.dataset stuff (specifically benchmarking and ideas for efficiency). Early tech.ml.dataset actually just wrapped tablesaw columns, until this became undesirable for some performance and other design stuff, at which point the column implementation was moved into tech.datatype with protocols in tech.ml.dataset.

Interesting; I’d never delved in to that realm of the standard libs; seems very useful. I think that’s probably what the univocity parser (which t.m.d. is using) is doing under the hood for sparse field selection.

Interesting, that looks like the closest yes, a tablesaw version could be added then to see how it compares?

It’s probably comparable for this case. Same csv parser, similar techniques for storing data (t.m.d. has some novelties and focuses on an immutable copy-on-write approach when possible, but both compress strings and use roaringbitmaps to capture missing values). Both are using FastUtils for primitive collections under the hood. I think t.m.d. does some extra work with having a promoteable type that can widen as it parses, as opposed to the approaches others take where they sample a bit and then infer a type.

FYI on my old machine, Pandas took 43s while tech.ml.dataset (processing directly the CSV) took 18s. Without the :column-whitelist ["tip_amount"] optimization the time jumps up to 1.5 min.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.