Google
patent summary
Techniques
for finding related document [US 6,754,873 B1]
The
methods documented in this Google patent describes techniques for identify
and ranking related documents using link based analysis.
The
Google patent describes how a system like Google can identify related
documents by analyzing the backlinks of a document to find a set of
index documents. The forward links from the index pages are then analyzed
to find a related page set; the link structure of related pages is then
analyzed to score and further refine the related page set. Pages are
finally identified as been related when pages are linked by link structure
and scored similarly.
The
score of a page is derived from inbound link numbers from the index
set. The ranking system employs a similar score reduction system as
Google PageRank, with multiple inbound links from a single host reducing
the value of inbound links from that host and links from index pages
with a high volume of links will further reduce the score of each link.
Pages with a high volume of related web pages may be assigned as popular
web pages. Pages assigned as popular will only be displayed as related
to another page by Google if the relationship is symmetrical. i.e. if
you have a page which has an inbound link from a page which also has
a link to a popular site like Yahoo, and scores attained for both pages
are similar when the backlink scores are calculated there would be an
association made; however Google disassociates popular pages unless
the site has a symmetrical relationship which can only be achieved if
the scores for your site are similar when backlinks to yahoo are calculated.
When
ranking the relationship between web pages, a preference is described
which favors symmetrical relationships. i.e. page A has a list of related
web pages which includes page B, page be also has a related page list
which page A is included.
Conclusion
To
view the results of this Google patent, input 'related:www.domainname.domain'
into Google.
Any
one can see that this functionality is not highly valuable to a user
and has only limited use to a webmaster. Therefore inorder for it to
be worthy of development by Google it must be included in the main algorithm
used by Google to rank search result. It seems logical that this development
is a refinement of the Google PageRank algorithm, which implements a
limited score for sites which are related.