Search Engine basics

Before getting to grips with search engine optimisation (SEO) techniques and how to apply them to your website, it's worth taking a few minutes to go over some search engine basics. Search engines deal with billions of pages of information from millions of websites. In order to understand how they process information for a specific page or site, let's look at:

How search engines find and process web page contents (spidering, indexing, cacheing)
How search engines rank web pages for a particular search query
How you can check what information a search engine has for a particular page

How do search engines work?

In order to return useful search results, a search engine must know as much as possible about the huge numbers of websites and webpages available on the internet. Because the number of webpages in existence is growing rapidly, and because content on existing webpages may change frequently, this is a complex and demanding task.

Crawling/spidering

Although many search engines allow users to submit URLs for newly-created websites, most processing is carried out by following links of sites and pages the search engine already knows about. The search engines use automated software tools called "crawlers" or "spiders" to visit webpages that they already know about. When a spider visits a page it can check if the page was updated since the previous visit, in other words it checks if the page content is the same as that stored in the search engine's index (the "index" is the term used to describe the pool of information stored about websites and their contents). If the search engine finds that the content of the page has changed it will:

store the new content (for later processing, so the index can be updated)
store the date at which the new content was found (see cache below)

When the new content is processed, one of the tasks that is carried out is to check if the page contained links to webpages that the search engine has never seen before. If such links are found, the address (URL) of the unseen webpages will be stored so that these pages can be crawled/spidered and added to the search engine's index at a later date. In this way, search engines are able to discover new websites and pages and include their contents in the search results.

In practice, search engines will discover links to a large number of new (previously unseen) sites and pages as it crawls known sites. One of the difficult problems search engines have to solve is to find a useful way of prioritising, so that the finite processing resources it has available are used in the most effective way. For example, should it focus on making sure it has up-to-date content information for the sites it already knows about, or should it focus on discovering and indexing the new (previously unseen) sites? And of the previously unseen sites, how should it prioritise the crawling and indexing of these sites?

Indexing

Once new page content has been discovered and stored by the search engine (either as an update to a previously known page or as a new page), the HTML contents must be "digested" and processed to extract the most useful information. Information about a particular page that's stored in the search engine's index will include:

the words used in the page (from both the visible text and within HTML tags)
the links pointing to this page from
the links from this page pointing to other sites/pages

The information stored in the index relating to the words used in a web page is one of the important ways that a search engine judges which pages are relevant to a particular search query. In doing this, it may use measures of word density and word frequency as well as looking at how words are combined to form phrases.

The places within the page that the words occur is also stored in the index as this can have an important influence on how relevant a search engine judges a page to be for a specific query. For example, with the search query "blue widgets", if these words occur in places like the page title or in page headings, which are judged to be important by the search engine, this will have a greater effect on the search results than if the words occurred in the normal text of a paragraph.

As well as storing information about the contents of a page (the "on-page" factors), a search engine index will also contain information specific to that page that comes from so-called "off-page" factors. The most important of these relate to links from other web pages that point to the page in question. The anchor text of such links (the words you actually click on, normally highlighted) will be stored in the index for that page and used by the search engine to help it decide what the page is about.

Link popularity and PR

Most search engines also use a measure of link popularity to judge the importance of a page. Pages with many thousands of links to them will be seen as more popular, and (all other things being equal), such pages will tend to be listed higher up in the search engine results than pages that have only a few links pointing at them. Google uses a measure known as PageRank to judge which pages are more popular (important) than others.

It's important to keep in mind, however, that whatever measure a search engine uses for the link popularity of a page, this is a general property of the page and does not relate to any particular word or search query. When a search engine returns results for a specific search, it will combine such general, search-term independent factors with factors that relate directly to the search term (e.g. word density in the page content). In other words, just because a page is popular (has many links) doesn't mean it will rank well for a given search term. Pages with lower link popularity can easily rank higher in the results if their content is strongly relevant to the search term.

Searching

When a user enters a search term into a search engine, a complex and rapid calculation is carried out to find the set of web pages that is most relevant to their query. Factors that are used in this calculation can be separated into two groups: search-term dependent factors, and search-term independent factors.

Search-term dependent factors are, clearly, directly related to words used in the search term entered by the user. Pages that contain many occurrences of the words in the search query will be judged more relevant for the search, and where the words occur in important locations (title, headings) this will be given extra weighting by the search engines. The anchor text of links pointing to each web page is another important search-term dependent factor: in other words, if a page has many links to it from other sites where the words used in the anchor text match words in the search query, the page will tend to be ranked higher in the search results.

If the search query uses several words, the number of times those words are found close together in the page content (or anchor text) is also likely to be judged as important by the search engine.

Search-term independent factors include all those general properties of a web page that are not related to a specific word or phrase. The link popularity of the page is one such factor, and others may include the age of the page content or the frequency with which it is updated.

Special search engine operators

Site

The site operator allows you to see all the pages from a particular site that a search engine has found and indexed. In short, it shows the extent and depth of a search engines "knowledge" about the site. For large sites with many hundreds of pages, search engines may not index every last page. Using a "site" search will shows which pages have been found and indexed.

Most search engines allow you to run a site search directly by enetering a search term such as "site:www.bbc.co.uk".

Cache

The cache operator shows exactly what HTML content was retrieved by a search engine when it last visited a given page. The date at which the content was retrieved is usually also displayed. The Google cache for a specific page can be viewed by entering a query such as "cache:news.bbc.co.uk". The cached version of a page is available from most major search engines as a named link alongside the usual search results.

In cases where a particular website cannot be viewed (because the server is down or too many users are trying to access it), the search engine "cache" operator can be a useful way to see the contents of a web page.

When using search optimisation techniques it is often useful to know when a search engine last updated its index with the contents of a particular page. For example, if you've recently updated a page, you may want to check whether a search engine has visited the page since the content was updated. Viewing the cached copy of a page stored by a search engine will normally include the date at which the content was found, allowing you to confirm if an updated page has been found and indexed yet.

Link

The link operator is used to show which web pages have a link to a specific web page. Although all major search engines offer this option there are some important differences in the results that different search engines return. In particular, when using the "link" option with Google it only displays a selection of the pages that link to the defined web page. It may have come across many more pages that include a link to the specified page - it still will only show a small subset of these when you try a "link" search (a more complete list is available if via the Google Webmaster console).

With Google and Yahoo! you can run a link check by entering a query such as "link:www.bbc.co.uk" directly into the search box.

Last updated: March 2012