Reducers: sorting while folding

Luca_Bandelli · September 27, 2024, 1:57pm

Dears,

I have the following scenario:

i have a function FuzzySearch/weightedRatio
that given 2 string computes a similarity score.

I have my search input and about 3 million entries in all-options.
I the output should be [option score] sorted by score

I implemented the parallel ranking as follows:

(defn pfuzzy-search-3 [all-options input]
  (->>
   all-options
   (r/map (juxt identity (partial FuzzySearch/weightedRatio input)))
   (r/foldcat)))

My question here is, sure i can (sort-by second …) after,
but is there efficient way to sort while folding?

Maybe my question is conceptually wrong and there would be no additional gain in “sorting while folding”, and maybe actually i would get worse performance?

My naive implementation atm is:

(defn search-combinef
  ([] [])
  ([x y] (sort-by second (concat x y))))


(defn pfuzzy-search-6 [all-options input]
  (->>
   all-options
   (r/map (juxt identity (partial FuzzySearch/weightedRatio input)))
   (r/fold search-combinef conj)))

But i get the feeling that my combinef is defeating the purpose of not creating intermediate collections right?

Any help/opinion welcome

Thanks

p-himik · September 27, 2024, 2:49pm

In this case, I don’t think it’s likely to get any performance benefit as FuzzySearch/weightedRatio probably dominates the run time. You can profile to confirm.

Kind of a tangent, but you really want maximum performance when comparing a single string to a large set of strings, you might want to reconsider FuzzySearch/weightedRatio. Apart from it not being suitable for optimizations for large input sets, I also can’t find any docs on what their weighted ratio algorithm is aimed to achieve and how it achieves that.
E.g. if you’re fine with Levenshtein distance, it should be possible to construct a trie from all the strings in the set and then use that trie to quickly measure the distance with the target word. Perhaps there are libraries for it, I haven’t checked.

Luca_Bandelli · September 27, 2024, 3:15pm

thanks a lot for the fast input

Indeed, you are probably right, a quick search gave GitHub - carocad/clemence: fast and incremental Levenshtein and LCS computation i will have a look.

So probably in my current application it would not make a difference but the technical curiosity stands if anybody ever implemented a reducing function that cleverly sorts while reducing, or if this is nonsensical

p-himik · September 27, 2024, 4:27pm

I’ve been using transducers since before I have learned about the existence of reducers, so this is how I’d write it:

(defn word-ratio-comparator [[w1 r1] [w2 r2]]
  (let [result (compare r1 r2)]
    (if (zero? result)
      (compare w1 w2)
      result)))

(defn pfuzzy-search [all-options input]
  (into (sorted-set-by (fn [[w r]] [r w]))
        (map (fn [opt]
               [opt (FuzzySearch/weightedRatio input opt)]))
        all-options))

Although I would actually swap the order of all-options and input to stick to the Clojure’s convention for such things, making it more compliant with how -> and ->> work.

As for reducers, seems like this is the same thing:

(defn pfuzzy-search [all-options input]
  (->> all-options
       (r/map (fn [opt]
                [opt (FuzzySearch/weightedRatio input opt)]))
       (r/fold (fn sorted-set-cat
                 ([] (sorted-set-by word-ratio-comparator))
                 ([s1 s2] (into s1 s2)))
               conj)))

And both variants could be made a bit simpler if you don’t care whether the result is a collection of [word ratio] or [ratio word] since the latter would allow to use the default comparator.

Harold · September 28, 2024, 5:29pm

This is an interesting question.

A follow up: What are you going to do with the sorted result?

If you actually need all three million entries, that’s one thing, but if you only need the best match, or perhaps the n-best matches (for some n less than 3 million), that could open up some other options.

mdiin · September 29, 2024, 8:37am

You could also look at datalevin for doing the searching. I started looking at this last week, and the documentation did leave some holes to fill out. These are my notes:

To do fuzzy searching with datalevin, use the :search-optsand:search-enginekeys on the database options map. I am not entirely certain what each option does, and if they are both needed, but it looks like:search-engine alone is not enough.

 (require '[datalevin.core :as d])
  (require '[datalevin.search-utils :as su])

  (let [analyzer (su/create-analyzer {:token-filters [(su/create-ngram-token-filter 2)]})

        db (-> (d/empty-db "/tmp/mydb9"
                    {:text {:db/valueType :db.type/string
                            :db/fulltext true}}
                    {:search-opts {:analyzer analyzer}
                     :search-engine {:analyzer analyzer}})
               (d/db-with
                [{:db/id 1
                  :text "1 kg tomater"}
                 {:db/id 2
                  :text "1 tomat"}
                 {:db/id 3
                  :text "tomat"}]))]
    (d/fulltext-datoms db "2 tmoat")
    )

The above example will yield all three entries, i.e. it handles spelling mistakes as you want (“tmoat” instead of “tomat”).

Now, I don’t have much experience with datalevin yet, but it does seem like a lot of thought and work has gone into making it fast; see e.g. the search documentation.

My plan is to use datalevin as a sqlite replacement and use the search functionality in-database. It looks like it might be possible to use it as only a search engine as well.

Luca_Bandelli · October 2, 2024, 9:18am

great I will mark this as solution, because i think the original question finds its solution in the transient sorted-set-by combo.

Many thanks!!

I will benchmark and post an unpdate

Luca_Bandelli · October 2, 2024, 9:20am

This is also very intereseting to the solution of the underlying problem,

i will investigate:
a) if its possible to obtain a score with datalevin
b) how long it takes to index the 3 M documents with ngram analyzer

p-himik · October 2, 2024, 9:33am

I made a mistake above - sorted sets don’t support transients, so edited the comment to remove (transient ...).

system · April 2, 2025, 9:34pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.