Hi all.
My company ingests thousands of documents a day, from some chaotic sources. Some of this involves web scraping. We also have a team in the Philippines who do manual web scraping, mostly by copying-and-pasting news articles.
We get a lot of duplicates, often because the team in the Philippines cannot easily coordinate with our automated scripts.
I need to build a system that can catch duplicate documents.
The approach I am familiar with is pushing documents into a vector database and doing cosine similarity searches.
However, I’ve been told that using partial hashes can be more efficient.
I am curious if anyone can point me to articles or videos that are especially good on this topic?