- Crawling the Web
Search engines run automated programs, called "bots" or
"spiders", that use the hyperlink structure of the web to
"crawl" the pages and documents that make up the World Wide
Web. Estimates are that of the approximately 20 billion
existing pages, search engines have crawled between 8 and 10
billion.
- Indexing Documents
Once a page has been crawled, its contents can be "indexed"
- stored in a giant database of documents that makes up a
search engine's "index". This index needs to be tightly
managed so that requests which must search and sort billions
of documents can be completed in fractions of a second.
- Processing Queries
When a request for information comes into the search engine
(hundreds of millions do each day), the engine retrieves
from its index all the document that match the query. A
match is determined if the terms or phrase is found on the
page in the manner specified by the user. For example, a
search for car and driver magazine at Google returns 8.25
million results, but a search for the same phrase in quotes
("car and driver magazine") returns only 166 thousand
results. In the first system, commonly called "Findall"
mode, Google returned all documents which had the terms
"car", "driver", and "magazine" (they ignore the term "and"
because it's not useful to narrowing the results), while in
the second search, only those pages with the exact phrase
"car and driver magazine" were returned. Other advanced
operators (Google has a
list of 11) can change which results a search engine
will consider a match for a given query.
- Ranking Results
Once the search engine has determined which results are a
match for the query, the engine's algorithm (a mathematical
equation commonly used for sorting) runs calculations on
each of the results to determine which is most relevant to
the given query. They sort these on the results pages in
order from most relevant to least so that users can make a
choice about which to select.
Although a search engine's operations are not particularly
lengthy, systems like Google, Yahoo!, AskJeeves, and MSN are
among the most complex, processing-intensive computers in the
world, managing millions of calculations each second and
funneling demands for information to an enormous group of users.
Speed Bumps & Walls
Certain types of navigation may hinder or entirely prevent
search engines from reaching your website's content. As search
engine spiders crawl the web, they rely on the architecture of
hyperlinks to find new documents and revisit those that may have
changed. In the analogy of speed bumps and walls, complex links
and deep site structures with little unique content may serve as
"bumps." Data that cannot be accessed by spiderable links
qualify as "walls."
Possible "Speed Bumps" for SE Spiders:
- URLs with 2+ dynamic
parameters; i.e. http://www.url.com/page.php?id=4&CK=34rr&User=%Tom%
(spiders may be reluctant to crawl complex URLs like this
because they often result in errors with non-human visitors)
- Pages with more than 100
unique links to other pages on the site (spiders may not
follow each one)
- Pages buried more than 3
clicks/links from the home page of a website (unless there
are many other external links pointing to the site, spiders
will often ignore deep pages)
- Pages requiring a "Session
ID" or Cookie to enable navigation (spiders may not be able
to retain these elements as a browser user can)
- Pages that are split into
"frames" can hinder crawling and cause confusion about which
pages to rank in the results.
Possible "Walls" for SE Spiders:
- Pages accessible only via a
select form and submit button
- Pages requiring a drop down
menu (HTML attribute) to access them
- Documents accessible only
via a search box
- Documents blocked
purposefully (via a robots meta tag or robots.txt file)
- Pages requiring a login
- Pages that re-direct before
showing content (search engines call this cloaking or
bait-and-switch and may actually ban sites that use this
tactic)
The key to ensuring that a site's contents are fully
crawlable is to provide direct, HTML links to each page you want
the search engine spiders to index. Remember that if a page
cannot be accessed from the home page (where most spiders are
likely to start their crawl), it is likely that it will not be
indexed by the search engines. A sitemap (which is discussed
later in this guide) can be of tremendous help for this purpose.
Measuring Relevance and Popularity
Modern commercial search engines rely on the science of
information retrieval (IR). That science has existed since the
middle of the 20th century, when retrieval systems powered
computers in libraries, research facilities, and government
labs. Early in the development of search systems, IR scientists
realized that two critical components made up the majority of
search functionality:
Relevance - the degree to which
the content of the documents returned in a search matched
the user's query intention and terms. The relevance of a
document increases if the terms or phrase queried by the
user occurs multiple times and shows up in the title of the
work or in important headlines or subheaders.
Popularity - the relative
importance, measured via citation (the act of one work
referencing another, as often occurs in academic and
business documents) of a given document that matches the
user's query. The popularity of a given document increases
with every other document that references it.
These two items were translated to web search 40 years later
and manifest themselves in the form of document analysis and
link analysis.
In document analysis, search engines look at whether the
search terms are found in important areas of the document - the
title, the meta data, the heading tags, and the body of text
content. They also attempt to automatically measure the quality
of the document (through complex systems beyond the scope of
this guide).
In link analysis, search engines measure not only who is
linking to a site or page, but what they are saying about that
page/site. They also have a good grasp on who is affiliated with
whom (through historical link data, the site's registration
records, and other sources), who is worthy of being trusted
(links from .edu and .gov pages are generally more valuable for
this reason), and contextual data about the site the page is
hosted on (who links to that site, what they say about the site,
etc.).
Link and document analysis combine and overlap hundreds of
factors that can be individually measured and filtered through
the search engine algorithms (the set of instructions that tells
the engines what importance to assign to each factor). The
algorithm then determines scoring for the documents and
(ideally) lists results in decreasing order of importance
(rankings).
Information Search Engines Can Trust
As search engines index the web's link structure and page
contents, they find two distinct kinds of information about a
given site or page - attributes of the page/site itself and
descriptives about that site/page from other pages. Since the
web is such a commercial place, with so many parties interested
in ranking well for particular searches, the engines have
learned that they cannot always rely on websites to be honest
about their importance. Thus, the days when artificially stuffed
meta tags and keyword-rich pages dominated search results
(pre-1998) have vanished and given way to search engines that
measure trust via links and content.
The theory goes that if hundreds or thousands of other
websites link to you, your site must be popular, and thus, have
value. If those links come from very popular and important (and
thus, trustworthy) websites, their power is multiplied to even
greater degrees. Links from sites like NYTimes.com, Yale.edu,
Whitehouse.gov, and others carry with them inherent trust that
search engines then use to boost your ranking position. If, on
the other hand, the links that point to you are from
low-quality, interlinked sites or automated garbage domains (aka
link farms), search engines have systems in place to discount
the value of those links.
The most well-known system for ranking sites based on link
data is the simplistic formula developed by Google's founders -
PageRank. PageRank, which relies on a mathematical formula
(based around finding a given document in a random pattern of
clicking on links), is
described by Google in their technology section:
PageRank relies on the uniquely
democratic nature of the web by using its vast link
structure as an indicator of an individual page's value. In
essence, Google interprets a link from page A to page B as a
vote, by page A, for page B. But, Google looks at more than
the sheer volume of votes, or links a page receives; it also
analyzes the page that casts the vote. Votes cast by pages
that are themselves "important" weigh more heavily and help
to make other pages "important."
Google uses a PageRank “proxy” value, which logarithmically
translates the actual PageRank of a document to a value between
1 and 10, to rank Web sites listed in its
directory (which offers a PageRank order or an Alphabetical
order for listings) and in its toolbar (below).

Google's toolbar (available
here) includes an icon that shows a PageRank value from 0-10
PageRank is, in essence, a rough system for estimating the
value of a given link based on the links that point to the host
page. Since PageRank's inception in the late '90s, more subtle
and sophisticated link analysis systems have taken the place of
PageRank. Thus, in the modern era of SEO, the PageRank
measurement in Google's toolbar, directory, or through sites
that query the service is of limited value. Pages with PR8 can
be found ranked 20-30 positions below pages with a PR3 or PR4.
In addition, the toolbar numbers are updated only every 3-6
months by Google, making the values even less useful. Rather
than focusing on PageRank, it's important to think holistically
about a link's worth.
Here's a small list of the most important factors search
engines look at when attempting to value a link:
- The Anchor Text of Link - Anchor text
describes the visible characters and words that hyperlink to
another document or location on the web. For example, in the
phrase "CNN
is a good source of news, but I actually prefer
the BBC's take on events," two unique pieces of anchor
text exist - "CNN" is the anchor text pointing to
http://www.cnn.com, while "the BBC's take on events"
points to http://news.bbc.co.uk. Search engines use
this text to help them determine the subject matter of the
linked-to document. In the example above, the links would
tell the search engine that when users search for "CNN",
SEOmoz.org thinks that http://www.cnn.com is a
relevant site for the term "CNN" and that http://news.bbc.co.uk
is relevant to "the BBC's take on events". If hundreds or
thousands of sites think that a particular page is relevant
for a given set of terms, that page can manage to rank well
even if the terms NEVER appear in the text itself (for
example, see the BBC's explanation of why Google ranks
certain pages for the term "Miserable Failure").
- Global Popularity of the Site - More
popular sites, as denoted by the number and power of the
links pointing to them, provide more powerful links. Thus,
while a link from SEOmoz may be a valuable vote for a site,
a link from bbc.co.uk or cnn.com carries far more weight.
This is one area where PageRank (assuming it was accurate)
could be a good measure, as it's designed to calculate
global popularity.
- Popularity of Site in Relevant Communities
- In the example above, the weight or power of a site's vote
is based on its raw popularity across the web. As search
engines became more sophisticated and granular in their
approach to link data, they acknowledged the existence of
"topical communities"; sites on the same subject that often
interlink with one another, referencing documents and
providing unique data on a particular topic. Sites in these
communities provide more value when they link to a site/page
on a relevant subject rather than a site that is largely
irrelevant to their topic.
- Text Directly Surrounding the Link -
Search engines have been noted to weight the text directly
surrounding a link with greater important and relevant than
the other text on the page. Thus, a link from inside an
on-topic paragraph may carry greater weight than a link in
the sidebar or footer.
- Subject Matter of the Linking Page -
The topical relationship between the subject of a given page
and the sites/pages linked to on it may also factor into the
value a search engine assigns to that link. Thus, it will be
more valuable to have links from pages that are related to
the site/page's subject matter than those that have little
to do with the topic.
These are only a few of the many factors search engines
measure and weigh when evaluating links. For a more complete
list.
Link metrics are in place so that search engines can find
information to trust. In the academic world, greater citation
meant greater importance, but in a commercial environment,
manipulation and conflicting interests interfere with the purity
of citation-based measurements. Thus, on the modern WWW, the
source, style, and context of those citations is vital to
ensuring high quality results.
The Anatomy of a Hyperlink
A standard hyperlink in HTML code looks like this:
<a href="http://www.seomoz.org">SEOmoz</a>
In this example, the code simply indicates that the text "SEOmoz"
(called the "anchor text" of the link) should be hyperlinked
to the page http://www.seomoz.org. A search engine would
interpret this code as a message that the page carrying this
code believed the page http://www.seomoz.org to be relevant
to the text on the page and particularly relevant to the
term "SEOmoz".
A more complex piece of HTML code for a link may include
additional attributes such as:
<a href="http://www.seomoz.org" title="Rand's Site" rel="nofollow">SEOmoz</a>
In this example, new elements such as the link title and
rel attribute may influence how a search engine views the
link, despite its appearance on the page remaining
unchanged. The title attribute may serve as an additional
piece of information, telling the search engine that http://www.seomoz.org,
in addition to being related to the term "SEOmoz", is also
relevant to the phrase "Rand's Site". The rel attribute,
originally designed to describe the relationship between the
linked-to page and the linking page, has, with the recent
emergence of the "nofollow" descriptive, become more
complex.
"Nofollow" is a tag designed specifically for search
engines. When ascribed to a link in the rel attribute, it
tells the engine's ranking system that the link should not
be considered an editorially approved "vote" for the
linked-to page. Currently, 3 major search engines (Yahoo!,
MSN, & Google) all support "nofollow". AskJeeves, due to its
unique ranking system, does not support nofollow, and
ignores its presence in link code. For more information
about how this works.
Some links may be assigned to images, rather than text:
<a href="http://www.seomoz.org/randfish.php"><img src="rand.jpg"
alt="Rand Fishkin of SEOmoz"></a>
This example shows an image named "rand.jpg" linking to the
page - http://www.seomoz.org/randfish.php. The alt
attribute, designed originally to display in place of images
that were slow to load or on voice-based browsers for the
blind, reads "Rand Fishkin of SEOmoz" (in many browsers, you
can see the alt text by hovering the mouse over the images).
Search engines can use the information in an image-based
link, including the name of the image and the alt attribute
to interpret what the linked-to page is about.
Other types of links may also be used on the web, many of
which pass no ranking or spidering value due to their use of
re-direct, Javascript, or other technologies. A link that does
not have the classic <a href="URL">text</a> format, be it image
or text, should be generally considered not to pass link value
via the search engines (although in rare instances, engines may
attempt to follow these more complex style links).
<a href="redirect/jump.php?url=%2Fgro.zomoes.www%2F%2F%3Aptth"
title="http://www.seomoz.org/" target="_blank" class="postlink">SEOmoz</a>
In this example, the redirect used scrambles the URL by
writing it backwards, but unscrambles it later with a script
and sends the visitor to the site. It can be assumed that
this passes no search engine link value.
<a href="redirectiontarget.htm">SEOmoz</a>
This sample shows the very simple piece of Javascript
code that calls a function referenced in the document to
pull up a specified page. Creative uses of Javascript like
this can also be assumed to pass no link value to a search
engine.
It's important to understand that, based on a link's anatomy,
search engines can (or cannot) interpret and use the data
therein. Whereas the right sort of links can provide great
value, the wrong sort will be virtually useless (for search
ranking purposes).
Keywords and Queries
Search engines rely on the terms queried by users to
determine which results to put through their algorithms, order,
and return to the user. But, rather than simply recognizing and
retrieving exact matches for query terms, search engines use
their knowledge of semantics (the science of language) to
construct intelligent matching for queries. An example might be
a search for loan providers that also returned results
that did not contain that specific phrase, but instead had the
term lenders.
The engines collect data based on the frequency of use of
terms and the co-occurrence of words and phrases throughout the
web. If certain terms or phrases are often found together on
pages or sites, search engines can construct intelligent
theories about their relationships. Mining semantic data through
the incredible corpus that is the Internet has given search
engines some of the most accurate data about word ontologies and
the connections between words ever assembled artificially. This
immense knowledge of language and its usage gives them the
ability to determine which pages in a site are topically
related, what the topic of a page or site is, how the link
structure of the web divides into topical communties, and much,
much more.
Search engines' growing artificial intelligence on the
subject of language means that queries will increasingly return
more intelligent, evolved results. This heavy investment in the
field of natural language processing (NLP) will help to achieve
greater understanding of the meaning and intent behind their
users' queries. Over the long term, users can expect the results
of this work to produce increased relevancy in the SERPs (Search
Engine Results Pages) and more accurate guesses from the engines
as to the intent of a user's queries.
Sorting the Wheat from the Chaff
In the classic world of Information Retrieval, when no
commercial interests existed in the databases, very simplistic
algorithms could be used to return high quality results. On the
world wide web, however, the opposite is true. Commercial
interests in the SERPs are a constant issue for modern search
engines. With every new focus on quality control and growth in
relevance metrics, there are thousands of individuals (many in
the field of SEO) dedicated to manipulating these metrics in
order to control the SERPs, typically by aiming to list their
sites/pages first.
The worst kind of results are what the industry refers to as
"search spam" - pages and sites with little real value that
contain primarily re-directs to other pages, lists of links,
scraped (copied) content, etc. These pages are so irrelevant and
useless that search engines are highly focused on removing them
from the index. Naturally, the monetary incentives are similar
to email spam - although few visit and fewer click on the links
(which are what provide the spam publisher with revenue), the
sheer quantity is the decisive factor in producing income.
Other "spam" results range from sites that are of low quality
or affiliate status that search engines would prefer not to
list, to high quality sites and businesses that are using the
link structure of the web to manipulate the results in their
favor. Search engines are focused on clearing out all types of
manipulation and hope to eventually achieve fully relevant and
organic algorithms to determine ranking order. So-called "search
engine spammers" engage in a constant battle against these
tactics, seeking new loopholes and methods for manipulation,
resulting in a never-ending struggle.
This guide is NOT about how to manipulate the search engines
to achieve rankings, but rather how to create a website that
search engines and users will be happy to have ranking
permanently in the top positions, thanks to its relevance,
quality, and user friendliness.
Paid Placement and Secondary Sources in the Results
The search engine results pages contain not only listings of
documents found to be relevant to the user's query, but other
content, including paid advertisements and secondary source
results. Google, for example, serves up ads from its well-known
AdWords program (which currently fuels more than 99% of
Google's revenues), as well as secondary content from its
local search,
product search (called Froogle), and
image search results.
Below is a screenshot of Google's search engine results page.
Hover on any of the areas of the image to reveal the source of
the content:
The sites/pages ranking in the "organic" search results
receive the lion's share of searcher eyeballs and clicks -
between 60-70%, depending on factors such as the prominence of
ads, relevance of secondary content, etc. The practice of
optimization for the paid search results is called SEM, or
Search Engine Marketing, while optimizing to rank in the
secondary results requires unique, advanced methods of targeting
specific searches in arenas such as local search, product
search, image search, and others. While all of these practices
are a valuable part of any online marketing campaign, they are
beyond the scope of this guide. Our sole focus remains on the
"organic" results, although links at the bottom of this paper
can help direct you to resources on other subjects.
Back
Next
|