Google
patent summary
Methods
and systems for information extraction [US 2005/0131764 A1]
Methods
documented in this Google patent describe methods for extracting shopping
information from web based article.
The
Google patent describes methods of how the Google search engine identifies
web based documents which present details of items/services for sale
and methods for extracting data elements, including price, Image , an
SKU and version of the item. The document also describes how this information
can be ranked and grouped for display to a client device in response
to a client query.
A
shopping document is first identified during or after a web crawl by
the presence of at least one price representation and a shopping characteristic
in a link or form element. The price representation is identified by
the presence of a string with a currency identifier [£,$,€]
and a decimal place or comma, followed by two digits. The shopping characteristic
string can be present in the URL, parameter or value of an HTML element
such as a <A>, <FORM>, <IMG> or <INPUT>. The
string of a shopping characteristic would typically be 'Add to Basket',
Add to Cart' or 'Buy now' etc.
The
Google patent further describes the methods used to isolate the associated
elements of the item for sale. The document describes how Google identifies
potential price attributes by the format and proximity of keywords,
such as 'sale', 'price', 'our price' or words to that effect and how
Google dismisses inappropriate prices by the proximity of strings like
'was', 'save', 'from', 'starting at' or 'shipping'. Association of images
is determined by the proximity, size, frequency and aspect ration of
potential images. Images which occur frequently, are long or tall, small
or large; are dismissed as logos, icons or other images and not used
as a product image.
Conclusion
Although
you may find some speculative documentation that claims a correlation
between commercial web sites and Google natural SERPs, we find no evidence
of this and advise no special formation of web pages to avoid a Google
shopping document flag. If your document has items for sale; use shopping
characteristics, price formation and image formats which clearly identify
you sale items; if your document is not a shopping document but uses
some characteristics which may be associated with this type of document
then then be sure to identify those characteristics as none sale attributes.