Accumulating with atoms?

#1

I am building a utility that is designed to act as a streaming json aggregator. Basically analyzing the data structure and building general watch events for conditions and aggregating interesting totals quickly, max, min, sum, count, empty count.

Anyway the general pattern I am using is an atom map that keys on fields and stores a map of the values, then a lot of functions that are a doseq that then calls several swap update-in or assoc-in to the atom.

My question is this a practical or idiomatic structure. In short is there a way that can do it more lazily and with less usage of atoms to store a state. Performance wise can single threaded handle 5000 records a second. Which is just fine but as I am getting deeper in clojure and perhaps creating examples for others on my team, am I really doing things the right clojure way for this. Not that there is a right, I am just trying to make sure I am thinking in the most sensible way as I approach problem domains and not let my background in more imperative languages taint my design choices.

Ken

1 Like
#2

I think for the watching of certain thresholds reached, watches on atom can make sense. So what you’re doing sounds logical.

It would probably be easy as well for you to just aggregate things recursively, by just flowing the data and its aggregates through more and more functions, a data pipeline mostly. But for watching of conditions, you’d need to pipe it through a check in between every step.

Performance wise, they’d probably even out. If you had a recursive pipeline though, it be easier for you to parallelize it. For example using clojure core reducers fold function. That said, parallelizing doesn’t always yield better performance, there’s overhead, so you need to play with that.

#3

I think in the end my concern is after I got it all working, I sit back and look at my code I am starting to feel like I just wrote imperative clojure with maintaining state. And although it is correct in the sense it does what it does, I am already thinking optimizing.

I think what I can do better is pass a map of all the aggregates and then do a map either to a function that does all the checks is-max, is-empty etc. and then pass to a reduce for sums and count. Is it better to map over one function that does all the work (dispatches to several other functions), or map from the “main loop” to each function as a separate map call.

I might not even change the code, because working is more important than the full realization of all the academic questions the code and problem domain can raise. However, if there is one thing I learned in my IT career is that all code (good or bad) is a learning opportunity.

Ken

1 Like
#4

If you have a long-running server application that needs to store some totals in response to incoming events, using something like an atom is perfectly reasonable. OTOH, if you’re writing a data processing tool that will, say, run over a pile of JSON and emit some summary statistics, you probably want to do that with reduce.

1 Like