Google
patent summary
Detecting
duplicate and near duplicate documents [US 6658423b1]
This
Google patent describes methods of identifying duplicate and near duplicate
documents by generating document fingerprints and matching these fingerprints
to the fingerprints of other documents. This action can be performed
either during or after a web crawl.
The document fingerprints are generated by extracting parts (words,
snippets, sentences, paragrpahs etc.) of a document, creating lists
of extracted parts and creating a fingerprint from the list of parts.
A each document will yield a predetrermined number of finger prints
i.e. 4. Documents found to duplicate by the matching of all fingerprints
are eliminated from the Google searchable data set and recorded as duplicate
so that future web crawls can eliminate them from the crawl, thus reducing
resources required to make future web crawls. When duplicates are identified
Google assess which document should remain as an active document by
selecting the highest Google page rank document which is most recent.
The
detection of near duplicate documents is handled in a similar way to
the detection of exact duplicates, however the near duplicate documents,
which share at least one fingerprint but not all are not discarded from
the Google search results set. The near duplicate documents are assigned
an identification element which organised near duplicate documents into
clusters. When Google responds to a user query Google first identifes
the search results by normal means of identification, then proceedes
to eliminate documents which belong to the same cluster as this provides
repetition of near duplicate content. The elimination factors used to
determine which documents should be presented to the user are the same
factors for eliminating pure duplicate content
Google's
method of identify near duplicate documents is transitive. i.e if document
A is considered near dublicate to document B and document C is considered
near duplicate to document B then document A and C are also considered
duplicates by a transative association via document B.
Conclusion
Repeating
content, intentionally or accidentally is something that should be avoided.
Although there is no way of identify the exact level of duplication
to be avoided a rule of thumb is not to repeat more than five words
more than three times from one document to another and not to repeat
more than eight words at all. i.e if your document has five words which
match five words of another document, ensure that there are no more
then three other five word snippets which are also in your document.
To make this more complicated from an SEO point of view but more accurate
from a search engine perspective, Google removes stop words from the
equation and you may also need to account for orthorgraphic variation
and inflections [Google System for improving
search quality]