So I went down a bit of a rabbit whole today regarding your question
If you wanted #1, I think you could more cleanly implement it in terms of lazy-seq
as such:
(defn fetch
"A stub fetch that pretends like its getting random data from a url, but instead return a rand char from url."
[url]
(println "fetching " url)
(rand-nth url))
(defn lazy-seq-fetch
"Returns a lazy-seq which on every getting of the next element, will fetch the next element by calling
page-fetching-fn. Will stop once it sees an element is returned by page-fetching-fn that has already been fetched before."
([page-fetching-fn] (lazy-seq-fetch page-fetching-fn #{}))
([page-fetching-fn pages-fetched]
(lazy-seq
(let [page (page-fetching-fn)]
(if (pages-fetched page)
nil
(cons page (lazy-seq-fetch page-fetching-fn (conj pages-fetched page))))))))
(def data (lazy-seq-fetch #(fetch "http://www.google.com")))
(mapv upper-case data)
;> fetching http://www.google.com
;> fetching http://www.google.com
;> fetching http://www.google.com
;> fetching http://www.google.com
;> fetching http://www.google.com
;> fetching http://www.google.com
;> fetching http://www.google.com
;;=> ["G" "W" "." "H" "O" "T"]
Adding rate limiting inside the lazy-seq in my opinion is not a good idea, you’re better off having the caller do that. Like say:
(run! #(do (println %) (Thread/sleep 1000)) (lazy-seq-fetch #(fetch "http://www.google.com")))
;> fetching http://www.google.com
;> t
;> fetching http://www.google.com
;> .
;> fetching http://www.google.com
;> g
;> fetching http://www.google.com
;> /
;> fetching http://www.google.com
;;=> nil
Though I think you could do something fancy, where you keep track of the time when you last fetched, and the next time you fetch, if it has not been X millisecond since last time, you could Thread/sleep for the remaining amount. Leave that up to you to add to the above code if you care.
The rabbit whole I was talking about, it happened on the Clojure slack today. At first, I thought of using iterate
instead of lazy-seq
. I’m just a bit more used to it, so I was going to reach for it first, then I saw its doc says it shouldn’t be used for side-effects. So I went in a rabbit whole as to why? Turns out iterate
returns a special kind of seq, which is non-caching when reduced over, but caching when used as a seq otherwise. That means that if you reduce over it twice back to back, you make the calls to fetch the data all over again, and if your server is not idempotent, it would return possibly different results.
Now apparently in a future Clojure version, due to patch CLJ-2555, we might see a new function called iteration
. It is more specifically designed to handle paging APIs, except your question is more challenging, because you want to fetch until you see duplicate data returned. Where as iteration
is more designed for APIs that take a key and return a key to the next page you can use to call it back next time to get the rest. Now that function is intended for side-effects, unlike iterate
. Though similar to iterate
in some ways, iteration
will call your API over and over if you keep accessing the same elements over and over. That said, if you call seq
on it, it will return a stable caching sequence, and you can then safely consume from that seq over and over, even reduce over it or over its rest or next, and it will not call your API more than once. This is unlike iterate
, where calling seq
on does not return a stable sequence like that, as if you reduce over the rest of it, it will call your API again. Here’s an example of how you’d use it for your problem:
(defn lazy-iteration-fetch [fetching-fn]
(->> (iteration
(fn [pages-fetched]
(let [page (fetching-fn)]
(if (pages-fetched page)
nil
[page (conj pages-fetched page)])))
:vf first
:kf peek
:initk #{})
lazy-seq))
(def data (lazy-iteration-fetch #(fetch "http://www.google.com")))
(mapv upper-case data)
;> fetching http://www.google.com
;> fetching http://www.google.com
;> fetching http://www.google.com
;> fetching http://www.google.com
;> fetching http://www.google.com
;> fetching http://www.google.com
;> fetching http://www.google.com
;;=> ["G" "." "O" "W" ":" "H"]
You don’t need to call lazy-seq
on its return like I did, but if you don’t, then if you access data over and over, it will call your API over and over, and if your API is not idempotent, possibly return different results. So it depends what behavior you want. The patch for it is here: https://clojure.atlassian.net/browse/CLJ-2555
Might be preferable to use an Atom to keep track of the pages-fetched instead of abusing k
as I did. This will be true especially if you do have a key to call your API with actually, and if it returns any kind of actual next page identifier for you to make the next call with:
(defn lazy-iteration-fetch [fetching-fn url]
(->> (let [pages-fetched (atom #{})]
(iteration
(fn [k]
(let [[page next-k] (fetching-fn k)]
(swap! pages-fetched conj page)
(when-not (@pages-fetched page)
[page next-k])
:vf first
:kf peek
:initk url)
lazy-seq))
You can trick iterate
to return a stable sequence as well, by encapsulating it under a call to map
, but it is a bit of a hack, and to be fair, there might be some edge cases that I’m not thinking of:
(defn lazy-fetch-until-data-not-seen [fetching-fn]
(->> (iterate
(fn [[_ page-fetching-fn pages-fetched]]
(let [page (page-fetching-fn)]
(if (pages-fetched page)
[:done nil nil]
[page page-fetching-fn (conj pages-fetched page)])))
[nil fetching-fn #{}])
(map first)
(drop 1)
(take-while #(not= :done %))))
It also turns out to be the less readable implementation, so I now advise against it for both being more hacky (the doc-string even warns not to do this ), and more complicated. That said, you can see how I call map
on the return of iterate
, so this would mean that even though iterate
is unstable in its sequence, map
will be stable, and map will never call the return of iterate
for elements more than once, meaning we won’t call the API over and over either.
Having said all that, in your case, you might want to explore #2, if so, it would appear to be a good case for either a callback style function, like in a loop you fetch the data from URL until its done, and every iteration you call the passed in callback with the result to be handled. Or this would be a good use case for core.async
as well, where you return a channel with the fetched data as they come in.