Search engines have a short list of critical operations that allows them to
provide relevant web results when searchers use their system to find
- Definition: Crawling the
Search engines run automated programs, called "bots" or
"spiders", that use the hyperlink structure of the web
to "crawl" the pages and documents that make up the
World Wide Web. Estimates are that of the approximately
20 billion existing pages, search engines have crawled
between 8 and 10 billion.
- Definition: Indexing
Once a page has been crawled, its contents can be
"indexed" - stored in a giant database of documents that
makes up a search engine's "index". This index needs to
be tightly managed so that requests which must search
and sort billions of documents can be completed in
fractions of a second.
When a request for information comes into the search
engine (hundreds of millions do each day), the engine
retrieves from its index all the document that match the
query. A match is determined if the terms or phrase is
found on the page in the manner specified by the user.
For example, a search for car and driver magazine at
Google returns 8.25 million results, but a search for
the same phrase in quotes ("car and driver magazine")
returns only 166 thousand results. In the first system,
commonly called "Findall" mode, Google returned all
documents which had the terms "car", "driver", and
"magazine" (they ignore the term "and" because
it's not useful to narrowing the results), while in the
second search, only those pages with the exact phrase
"car and driver magazine" were returned. Other advanced
operators (Google has a
list of 11) can change which results a search engine
will consider a match for a given query.
- Using Ranking Results
Once the search engine has determined which results are
a match for the query, the engine's algorithm (a
mathematical equation commonly used for sorting) runs
calculations on each of the results to determine which
is most relevant to the given query. They sort these on
the results pages in order from most relevant to least
so that users can make a choice about which to select.
Although a search engine's operations are not particularly lengthy, systems
like Google, Yahoo!, AskJeeves, and MSN are among the most complex,
processing-intensive computers in the world, managing millions of calculations
each second and funneling demands for information to an enormous group of users.
Speed Bumps & Walls in Page Rank
Certain types of navigation may hinder or entirely prevent search engines
from reaching your website's content. As search engine spiders crawl the web,
they rely on the architecture of hyperlinks to find new documents and revisit
those that may have changed. In the analogy of speed bumps and walls, complex
links and deep site structures with little unique content may serve as "bumps."
Data that cannot be accessed by spiderable links qualify as "walls."
Possible "Speed Bumps" for SE Spiders:
- URLs with 2+ dynamic parameters; i.e.
http://www.url.com/page.php?id=4&CK=34rr&User=%Tom% (spiders may be
reluctant to crawl complex URLs like this because they often result in
errors with non-human visitors)
- Pages with more than 100 unique links to
other pages on the site (spiders may not follow each one)
- Pages buried more than 3 clicks/links from
the home page of a website (unless there are many other external links
pointing to the site, spiders will often ignore deep pages)
- Pages requiring a "Session ID" or Cookie to
enable navigation (spiders may not be able to retain these elements as a
browser user can)
- Pages that are split into "frames" can
hinder crawling and cause confusion about which pages to rank in the
Possible "Walls" for SE Spiders:
- Pages accessible only via a select form and
- Pages requiring a drop down menu (HTML
attribute) to access them
- Documents accessible only via a search box
- Documents blocked purposefully (via a robots
meta tag or robots.txt file)
- Pages requiring a login
- Pages that re-direct before showing content
(search engines call this cloaking or bait-and-switch and may actually ban
sites that use this tactic)
The key to ensuring that a site's contents are fully crawlable is to provide
direct, HTML links to each page you want the search engine spiders to index.
Remember that if a page cannot be accessed from the home page (where most
spiders are likely to start their crawl), it is likely that it will not be
indexed by the search engines. A sitemap (which is discussed later in this
guide) can be of tremendous help for this purpose.
Measuring Relevance and Popularity using Page
Modern commercial search engines rely on the science of information retrieval
(IR). That science has existed since the middle of the 20th century, when
retrieval systems powered computers in libraries, research facilities, and
government labs. Early in the development of search systems, IR scientists
realized that two critical components made up the majority of search
Relevance - the degree to which the content of
the documents returned in a search matched the user's query intention and
terms. The relevance of a document increases if the terms or phrase queried
by the user occurs multiple times and shows up in the title of the work or
in important headlines or subheaders.
Popularity - the relative importance, measured
via citation (the act of one work referencing another, as often occurs in
academic and business documents) of a given document that matches the user's
query. The popularity of a given document increases with every other
document that references it.
These two items were translated to web search 40 years later and manifest
themselves in the form of document analysis and link analysis.
In document analysis, search engines look at whether the search terms are
found in important areas of the document - the title, the meta data, the heading
tags, and the body of text content. They also attempt to automatically measure
the quality of the document (through complex systems beyond the scope of this
In link analysis, search engines measure not only who is linking to a site or
page, but what they are saying about that page/site. They also have a good grasp
on who is affiliated with whom (through historical link data, the site's
registration records, and other sources), who is worthy of being trusted (links
from .edu and .gov pages are generally more valuable for this reason), and
contextual data about the site the page is hosted on (who links to that site,
what they say about the site, etc.).
Link and document analysis combine and overlap hundreds of factors that can
be individually measured and filtered through the search engine algorithms (the
set of instructions that tells the engines what importance to assign to each
factor). The algorithm then determines scoring for the documents and (ideally)
lists results in decreasing order of importance (rankings).
Information Search Engines Can Trust in Page
As search engines index the web's link structure and page contents, they find
two distinct kinds of information about a given site or page - attributes of the
page/site itself and descriptives about that site/page from other pages. Since
the web is such a commercial place, with so many parties interested in ranking
well for particular searches, the engines have learned that they cannot always
rely on websites to be honest about their importance. Thus, the days when
artificially stuffed meta tags and keyword-rich pages dominated search results
(pre-1998) have vanished and given way to search engines that measure trust via
links and content.
The theory goes that if hundreds or thousands of other websites link to you,
your site must be popular, and thus, have value. If those links come from very
popular and important (and thus, trustworthy) websites, their power is
multiplied to even greater degrees. Links from sites like NYTimes.com, Yale.edu,
Whitehouse.gov, and others carry with them inherent trust that search engines
then use to boost your ranking position. If, on the other hand, the links that
point to you are from low-quality, interlinked sites or automated garbage
domains (aka link farms), search engines have systems in place to discount the
value of those links.
The most well-known system for ranking sites based on link data is the
simplistic formula developed by Google's founders - PageRank. PageRank, which
relies on a mathematical formula (based around finding a given document in a
random pattern of clicking on links), is
described by Google in their technology section:
PageRank relies on the uniquely democratic nature of
the web by using its vast link structure as an indicator of an individual
page's value. In essence, Google interprets a link from page A to page B as
a vote, by page A, for page B. But, Google looks at more than the sheer
volume of votes, or links a page receives; it also analyzes the page that
casts the vote. Votes cast by pages that are themselves "important" weigh
more heavily and help to make other pages "important."
Google uses a PageRank ?proxy? value, which logarithmically translates the
actual PageRank of a document to a value between 1 and 10, to rank Web sites
listed in its
directory (which offers a PageRank order or an Alphabetical order for
listings) and in its toolbar (below).
Google's toolbar (available
here) includes an icon that shows a PageRank value from 0-10
PageRank is, in essence, a rough system for estimating the value of a given
link based on the links that point to the host page. Since PageRank's inception
in the late '90s, more subtle and sophisticated link analysis systems have taken
the place of PageRank. Thus, in the modern era of SEO, the PageRank measurement
in Google's toolbar, directory, or through sites that query the service is of
limited value. Pages with PR8 can be found ranked 20-30 positions below pages
with a PR3 or PR4. In addition, the toolbar numbers are updated only every 3-6
months by Google, making the values even less useful. Rather than focusing on
PageRank, it's important to think holistically about a link's worth.
Here's a small list of the most important factors search engines look at when
attempting to value a link:
- The Anchor Text of Link - Anchor text describes the
visible characters and words that hyperlink to another document or location
on the web. For example, in the phrase "CNN
is a good source of news, but I actually prefer
take on events," two unique pieces of anchor text exist - "CNN" is the
anchor text pointing to http://www.cnn.com, while "the BBC's take
on events" points to http://news.bbc.co.uk. Search engines use this
text to help them determine the subject matter of the linked-to document. In
the example above, the links would tell the search engine that when users
search for "CNN", SEOmoz.org thinks that http://www.cnn.com is a
relevant site for the term "CNN" and that http://news.bbc.co.uk is
relevant to "the BBC's take on events". If hundreds or thousands of sites
think that a particular page is relevant for a given set of terms, that page
can manage to rank well even if the terms NEVER appear in the text itself
(for example, see the BBC's explanation of why Google ranks certain pages
for the term "Miserable Failure").
- Global Popularity of the Site - More popular sites, as
denoted by the number and power of the links pointing to them, provide more
powerful links. Thus, while a link from SEOmoz may be a valuable vote for a
site, a link from bbc.co.uk or cnn.com carries far more weight. This is one
area where PageRank (assuming it was accurate) could be a good measure, as
it's designed to calculate global popularity.
- Popularity of Site in Relevant Communities - In the
example above, the weight or power of a site's vote is based on its raw
popularity across the web. As search engines became more sophisticated and
granular in their approach to link data, they acknowledged the existence of
"topical communities"; sites on the same subject that often interlink with
one another, referencing documents and providing unique data on a particular
topic. Sites in these communities provide more value when they link to a
site/page on a relevant subject rather than a site that is largely
irrelevant to their topic.
- Text Directly Surrounding the Link - Search engines
have been noted to weight the text directly surrounding a link with greater
important and relevant than the other text on the page. Thus, a link from
inside an on-topic paragraph may carry greater weight than a link in the
sidebar or footer.
- Subject Matter of the Linking Page - The topical
relationship between the subject of a given page and the sites/pages linked
to on it may also factor into the value a search engine assigns to that
link. Thus, it will be more valuable to have links from pages that are
related to the site/page's subject matter than those that have little to do
with the topic.
These are only a few of the many factors search engines measure and weigh
when evaluating links. For a more complete list.
Link metrics are in place so that search engines can find information to
trust. In the academic world, greater citation meant greater importance, but in
a commercial environment, manipulation and conflicting interests interfere with
the purity of citation-based measurements. Thus, on the modern WWW, the source,
style, and context of those citations is vital to ensuring high quality results.
The Anatomy of a Hyperlink
A standard hyperlink in HTML code looks like this:
In this example, the code simply indicates that the text "GlobalGuideLine" (called
the "anchor text" of the link) should be hyperlinked to the page
http://www.globalguideline.com. A search engine would interpret this code as a
message that the page carrying this code believed the page http://www.globalguideline.com to be relevant to the text on the page and
particularly relevant to the term "GlobalGuideLine".
A more complex piece of HTML code for a link may include additional
attributes such as:
<a href="http://www.globalguideline.com" title="Rand's Site"
In this example, new elements such as the link title and rel attribute
may influence how a search engine views the link, despite its appearance on
the page remaining unchanged. The title attribute may serve as an additional
piece of information, telling the search engine that http://www.globalguideline.com,
in addition to being related to the term "GlobalGuideLine", is also relevant to the
phrase "Rand's Site". The rel attribute, originally designed to describe the
relationship between the linked-to page and the linking page, has, with the
recent emergence of the "nofollow" descriptive, become more complex.
"Nofollow" is a tag designed specifically for search engines. When
ascribed to a link in the rel attribute, it tells the engine's ranking
system that the link should not be considered an editorially approved "vote"
for the linked-to page. Currently, 3 major search engines (Yahoo!, MSN, &
Google) all support "nofollow". AskJeeves, due to its unique ranking system,
does not support nofollow, and ignores its presence in link code. For more
information about how this works.
Some links may be assigned to images, rather than text:
<a href="http://www.globalguideline.com"><img src="../images/ggl.jpg"
alt="Global GuideLine for SEO"></a>
This example shows an image named "rand.jpg" linking to the page -
http://www.globalguideline.com. The alt attribute, designed originally to
display in place of images that were slow to load or on voice-based browsers
for the blind, reads "Global GuideLine for SEO" (in many browsers,
you can see the alt text by hovering the mouse over the images). Search
engines can use the information in an image-based link, including the name
of the image and the alt attribute to interpret what the linked-to page is
Other types of links may also be used on the web, many of which pass no
technologies. A link that does not have the classic <a href="URL">text</a>
format, be it image or text, should be generally considered not to pass link
value via the search engines (although in rare instances, engines may attempt to
follow these more complex style links).
title="http://www.seomoz.org/" target="_blank" class="postlink">SEOmoz</a>
In this example, the redirect used scrambles the URL by writing it
backwards, but unscrambles it later with a script and sends the visitor to
the site. It can be assumed that this passes no search engine link value.
function referenced in the document to pull up a specified page. Creative
It's important to understand that, based on a link's anatomy, search engi