Partial hashes for document de-duplication?

Hi all.

My company ingests thousands of documents a day, from some chaotic sources. Some of this involves web scraping. We also have a team in the Philippines who do manual web scraping, mostly by copying-and-pasting news articles.

We get a lot of duplicates, often because the team in the Philippines cannot easily coordinate with our automated scripts.

I need to build a system that can catch duplicate documents.

The approach I am familiar with is pushing documents into a vector database and doing cosine similarity searches.

However, I’ve been told that using partial hashes can be more efficient.

I am curious if anyone can point me to articles or videos that are especially good on this topic?