My question here is, sure i can (sort-by second …) after,
but is there efficient way to sort while folding?
Maybe my question is conceptually wrong and there would be no additional gain in “sorting while folding”, and maybe actually i would get worse performance?
In this case, I don’t think it’s likely to get any performance benefit as FuzzySearch/weightedRatio probably dominates the run time. You can profile to confirm.
Kind of a tangent, but you really want maximum performance when comparing a single string to a large set of strings, you might want to reconsider FuzzySearch/weightedRatio. Apart from it not being suitable for optimizations for large input sets, I also can’t find any docs on what their weighted ratio algorithm is aimed to achieve and how it achieves that.
E.g. if you’re fine with Levenshtein distance, it should be possible to construct a trie from all the strings in the set and then use that trie to quickly measure the distance with the target word. Perhaps there are libraries for it, I haven’t checked.
So probably in my current application it would not make a difference but the technical curiosity stands if anybody ever implemented a reducing function that cleverly sorts while reducing, or if this is nonsensical
Although I would actually swap the order of all-options and input to stick to the Clojure’s convention for such things, making it more compliant with how -> and ->> work.
As for reducers, seems like this is the same thing:
And both variants could be made a bit simpler if you don’t care whether the result is a collection of [word ratio] or [ratio word] since the latter would allow to use the default comparator.
A follow up: What are you going to do with the sorted result?
If you actually need all three million entries, that’s one thing, but if you only need the best match, or perhaps the n-best matches (for some n less than 3 million), that could open up some other options.
You could also look at datalevin for doing the searching. I started looking at this last week, and the documentation did leave some holes to fill out. These are my notes:
To do fuzzy searching with datalevin, use the :search-optsand:search-enginekeys on the database options map. I am not entirely certain what each option does, and if they are both needed, but it looks like:search-engine alone is not enough.
The above example will yield all three entries, i.e. it handles spelling mistakes as you want (“tmoat” instead of “tomat”).
Now, I don’t have much experience with datalevin yet, but it does seem like a lot of thought and work has gone into making it fast; see e.g. the search documentation.
My plan is to use datalevin as a sqlite replacement and use the search functionality in-database. It looks like it might be possible to use it as only a search engine as well.