Help understanding eductions

Hello!

I’m reading through Programming Clojure and at the end of chapter four it says that using an eduction to process input from a file will help avoid an out of memory error that would occur if the file’s data were read into a lazy sequence that was immediately realized. I believe I understand the explanation that the eduction avoids the memory error because it doesn’t cache anything as it processes the input; however, I’m having trouble wrapping my brain around how this is any different from fully realizing the lazy sequence. My thinking is that, if the data from the file is too big to fit into available memory as a realized sequence, then it’s not going to fit when it comes “out the other end” of processing the eduction either.

I’m obviously missing some piece of this puzzle. Could anyone enlighten me?

With the “lazy sequence that was immediately realized”, you are storing everything you read because the entire sequence is realized and then processed. So if you read a file with a million lines, you first have to have memory to store all million lines.

With “eduction” you only store the final result, which will likely need much less memory.

For example, in the example from the book, when using eduction the only thing stored after the processing is the line count. When using realized sequence, it needs to store all the intermediate values as well, which requires much more memory.

Example:

(ns mjmeintjes.memory-example)

(defn get-realized []
  ;; requires storing all the even numbers in memory, even though we only need them for the count
  (->> (range 1e7)
       (filter even?)
       vec))

(defn get-eduction []
  ;; combines with the count to form one operation
  (->> (range 1e7)
       (eduction (filter even?))))

(comment
  ;; throws OutOfMemory
  (->> (get-realized)
       (reduce (fn [acc i]
                 (inc acc))
               0))

  ;; returns 5000000
  (->> (get-eduction)
       (reduce (fn [acc i]
                 (inc acc))
               0)))

deps.edn

{:paths ["src"]
 :deps {org.clojure/clojure {:mvn/version "1.10.3"}}
 :aliases {:dev {:jvm-opts ["-Xmx256m"]}}}
1 Like

Thank you for clarifying. I think I understand it now. So an eduction, in the context of reading in the contents of a file, would still keep an open handle to the file until the eduction was no longer needed?

Because of this line just before eductions are introduced:

The problem here is that the reader is not being closed. This resource is being left open and stranded in this code.

it seemed to me that the book was saying that using an eduction was a way to read all the lines of a file into memory and free the reference to the file without causing memory issues. In actuality the eduction is only solving the memory issue and doesn’t have much of an impact on when the file reader can be closed. Do I have that right?

When you realize a lazy sequence, you don’t have to retain the head of the sequence, so even though it caches things, as you consume it, what you no longer have a binding for will get garbage collected. You can use dorun or doseq for that for example. You can also just loop/recur over it with first and rest or next, or use reduce on it. All those won’t retain prior elements and won’t run out of memory even when used on a lazy-seq.

This will retain the head and return the full sequence causing memory issues if you run with a small heap size:

(def a (doall (range 1e7)))
java.lang.OutOfMemoryError: Java heap space
clojure.lang.Compiler$CompilerException: Syntax error macroexpanding at (NO_SOURCE_FILE:1:8).

But these will all be fine, since they don’t retain the head:

(def a (dorun (range 1e7)))
#'user/a

(reduce + (range 1e7))
49999995000000

(loop [xs (range 1e7) sum 0]
  (if-let [ns (next xs)]
    (recur ns (+ sum (first xs)))
    (+ sum (first xs))))
49999995000000

This is actually one of the great things about lazy sequences over eager ones, is that they let you do computations like that where the whole sequence wouldn’t fit in memory.

I wouldn’t say eduction solves the memory issue, since lazy-seqs don’t suffer from one to be solved in the first place.

That just depends on the code you have. Both eduction and lazy-seq don’t actually do anything when called and the computation is delayed until it is needed, so it will fail if you close the reader to line-seq before running the computation.

(def a (with-open [rdr (BufferedReader. (StringReader. "1\n2\n3\n4"))]
         (eduction (map parse-long) (map inc) (line-seq rdr))))

a

java.io.IOException: Stream closed
clojure.lang.ExceptionInfo:

Same as for lazy-seq:

(def a (with-open [rdr (BufferedReader. (StringReader. "1\n2\n3\n4"))]
         (->> (line-seq rdr) (map parse-long) (map inc))))

a

java.io.IOException: Stream closed
clojure.lang.ExceptionInfo:

So I have no idea what the book is trying to say without maybe seeing the full page about it from the book to better understand what they might be alluding too.

Edit:

Oh it might be alluding to this:

(def a (eduction (map identity) (range 1e7)))
(reduce + a)
49999995000000

(def a (map identity (range 1e7)))
(reduce + a)

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"

Where again since we’re retaining the head to the sequence, when we later reduce over it, it can run out of memory. Eduction I guess does solve this kind of scenario, because it won’t realize the sequence pointed too by a.

(def a (map identity (range 1e7)))

(let [a' a]
  (def a nil)
  (reduce + a'))
49999995000000

So, with eduction you’re less worried about accidentally retaining the head I guess.

And for the curious who wonder, but doesn’t a' point to the head and thus retain the whole sequence in memory while it is being reduced? The answer is an optimization Clojure does called locals-clearing. Since a' isn’t used in the let after the reduce, its reference actually gets cleared at the point of reduce (before reduce is ran), which is why the head isn’t retained in that let and why the GC will garbage collect elements from the lazy-sequence as they are being reduced over.

Fun fact is that Clojurescript doesn’t have this optimization, and doing the same in Clojurescript will consume a lot of memory, so eduction would be even more helpful in the case of Clojurescript.

3 Likes

Thank you for taking the time to write such a thorough reply. I definitely still have a few gaps in my knowledge of lazy sequences, but now know where to start filling them in.