Duplicate Detection Explained
What is a duplicate? Good question, there is no single answer. For some applications a duplicate means one item's content is the same (or close to the same) as another item's. For others it means the ID is the same regardless of the content. Others still can use multiple fields as composite keys and further some combination of the above. Depending on what you need, we have you covered, below are some quick examples of the different types of duplicate detection Sajari provides along with sample configuration settings.
Using a pure duplicate key
For many applications this will be the standard setting. If some field is the same, for example the "id", then the item is immediately classed as a duplicate. This is the standard mode of operation for auto crawled website users of Sajari, that key is the "url" field.
Sample config setting
Note: The index.dupethreshold value is set to zero, which means the contents of the item is irrelevant, if the key matches, it will be classed as a duplicate, which is different from other methods of duplicate detection as per below.
Using a composite duplicate key
A composite key is simply made up of multiple meta fields. These are appended and hashed. Effectively this means the item must share all keys to be classed as a duplicate. A single key is not enough, all must match.
Sample config setting
Above is a sample config typically used for finding duplicate job postings in a job search engine. If the "title", "lat" and "lng" are all the same (lat, lng), this immediately indicates a duplicate.
The main thing to watch for here is using empty values, as these will form an empty key and hash to the same value! A common mistake. If you're seeing "409 Duplicate" constantly and you're not sure why, this is a good place to look!
This will work well for many applications, but this particular sample use case has a problem! What if two separate companies with the same lat,lng have two very different job postings that share the same title? These will be incorrectly considered duplicates. So to solve that we use the index.dupethreshold setting to measure the similarity of the two documents, which is covered in the following section.
Combining keys with document similarity
As mentioned in the above example, sometimes duplicates are not so obvious, keys may not be enough. In these cases we use the index.dupethreshold setting to indicate how similar things can be before they're considered duplicates. The default for this setting is 0.0 which essentially means it is ignored. If the value is greater than zero but less than one, then the value is a % similarity measure to determine duplicity.
Similarity % is calculated with a "bag of words" style cosine similarity that uses stemmed unigrams, bigrams and trigrams.
Sample config setting
In the above example, if two items share "lat", "lng" and "title" fields AND also are 80% similar in contents, then they are considered duplicates of each other. This is incredibly useful for applications such as job search, where job advertisements are posted to multiple sources, as well as being cut, pasted and edited frequently.