Search engine optimisation

 
 


Search Engine Optimisation

Search Engine Marketing

Web Site Development

 
iSSS Home
iSSS
Web Development
Web development
SEO Services
Search engine optimisation
Service Costings
seo servcies
Contact
Contact
 
SEO Resources

 

Advertise Here

 

 

 

 

Google patent summary

Detecting duplicate and near duplicate documents [US 6658423b1]

This Google patent describes methods of identifying duplicate and near duplicate documents by generating document fingerprints and matching these fingerprints to the fingerprints of other documents. This action can be performed either during or after a web crawl.

The document fingerprints are generated by extracting parts (words, snippets, sentences, paragrpahs etc.) of a document, creating lists of extracted parts and creating a fingerprint from the list of parts. A each document will yield a predetrermined number of finger prints i.e. 4. Documents found to duplicate by the matching of all fingerprints are eliminated from the Google searchable data set and recorded as duplicate so that future web crawls can eliminate them from the crawl, thus reducing resources required to make future web crawls. When duplicates are identified Google assess which document should remain as an active document by selecting the highest Google page rank document which is most recent.

The detection of near duplicate documents is handled in a similar way to the detection of exact duplicates, however the near duplicate documents, which share at least one fingerprint but not all are not discarded from the Google search results set. The near duplicate documents are assigned an identification element which organised near duplicate documents into clusters. When Google responds to a user query Google first identifes the search results by normal means of identification, then proceedes to eliminate documents which belong to the same cluster as this provides repetition of near duplicate content. The elimination factors used to determine which documents should be presented to the user are the same factors for eliminating pure duplicate content

Google's method of identify near duplicate documents is transitive. i.e if document A is considered near dublicate to document B and document C is considered near duplicate to document B then document A and C are also considered duplicates by a transative association via document B.

Conclusion

Repeating content, intentionally or accidentally is something that should be avoided. Although there is no way of identify the exact level of duplication to be avoided a rule of thumb is not to repeat more than five words more than three times from one document to another and not to repeat more than eight words at all. i.e if your document has five words which match five words of another document, ensure that there are no more then three other five word snippets which are also in your document. To make this more complicated from an SEO point of view but more accurate from a search engine perspective, Google removes stop words from the equation and you may also need to account for orthorgraphic variation and inflections [Google System for improving search quality]

 

 

 

 
    All rights reserved iSSS.co.uk 2005
Last Modified 05/01/2006
Valid HTML 4.01 Transitional