About Search Engines and Web Site Promotion: A Whitepaper
Introduction
Finding information on the huge and ever-growing World-Wide Web can come down to finding the proverbial needle in a haystack. The high volume of information coupled with the absence of a central organizing authority can be a roadblock to finding the highest-quality, most appropriate information for the task at hand. Search engines have been developed to help with the task. Some popular sites that permit searching are AltaVista, Yahoo, Excite, Infoseek, Lycos, WebCrawler, OpenText, Magellan, and HotBot. Between them they have indexed more than 80 million web pages.
But the process through which information providers' sites become known to the search engines is not entirely straightforward. This document describes the issues and makes recommendations of interest to information providers.
Factors Affecting Site Listing and Ranking in Response to Search Requests
A site's listing and ranking in search responses is governed by a number of factors:
- When "www.mysite.org," comes into existence, a site representative may submit its address to the search engine's site for indexing. Alternatively (or additionally) the search engine may discover the site's address during ongoing link-by-link explorations of the Web.
- Keywords and other information about www.mysite.org, either manually submitted or automatically extracted from the site's pages, are registered in the search engine’s database.
- When people make search requests that relate to www.mysite.org, search engines use their pre-stored information to retrieve and arrange search results. Listed sites will appear, including www.mysite.org. Results are presented in an order determined by the search engine's design. This order may be alphabetical or chronological, but it may also represent an automatic ranking of relevancy in light of parameters in the requested search.
- As newer sites come along, they may supplant www.mysite.org and appear higher in search responses.
- If www.mysite.org changes address (its URL), search engines may be unable to find it. This will result in a broken link in the engine's listing. If site content has changed, and especially if there are new keywords, the site may no longer be properly searchable. Search engines revisit sites on an ongoing basis to see if they are still reachable or if contents have changed. Accordingly, they may reindex the site. Because of the size of the Web, content providers cannot depend on the timeliness of automatic reindexing.
These above factors present questions for information providers:
- What affects the relevancy rankings of sites as they appear in the indices? What does it take to appear at the top of the list?
- How can an information provider keep from being supplanted by a newer, more recently indexed site?
- How can an information provider get the search engine to reindex a site after changes have been made?
Of course, these issues are relevant to information consumers as well. A site providing the best information on a particular subject that cannot be found by search engines is nearly invisible.
Relevancy Rankings
Search engines often use proprietary methods to determine the relevancy of a site with respect to keywords. To make it difficult for Web site owners to manipulate the rankings of their sites in search results, the ranking methods are sometimes kept secret. Search engines might take some or all of the following into account when assigning relevancy:
- The number of times a keyword appears in the document, or the percentage of appearances in relation to the total text
- Whether the keyword appears in the title, the description and/or the keyword list in the <META> tag, or in anchor tags
- How close to the beginning of the keyword list the keyword is
- How close to the start of the document the keyword appears
- How near to one another two keywords are
- How many synonyms of a given keyword there are in the document, in addition to the keyword itself
- The length of the URL needed to reach the document (the number of levels deep the page is)
- How frequently the link from the search engine to the indexed site is followed by users of the search site
In addition, the following aspects of page design can cause problems for search engines, or even prevent them from discovering pages to index:
- Frames and Image maps, which many search engines cannot follow
- <A HREF . . .><IMG SRC . . .></A> tags (links with image rather than text anchors), because many search engines can't follow these either
- Tables and javascript declarations, which may have the effect of pushing relevant textual content further down the page
- Dynamic content (output from CGI programs or database lookups), which probably won't get indexed
What Information Providers Need to Do to Optimize Searching
There are two things that information providers can do to maximize the likelihood that search sites index them correctly and rank them highly. These are (1) tune sites explicitly for the search engines, and (2) regularly resubmit sites to the search engines.
Tuning Your Site for Search Engines
- Make sure the site is visible. Sites are generally fully visible if a browser can reach them, as long as there is no "robots.txt" file in any of its directories. The robots.txt file is a standard file that tells the automated link-finding processes of certain search engines not to include files in those directories. CGNET has not placed robots.txt files on any web servers. A robots.txt file has syntax like the following:
User-Agent : *
Disallow : /private/
This file would instruct any robot conforming to the robots exclusion standard not to visit any documents in the /private URL space (or any of its subdirectories).
- Make sure that each HTML page has a descriptive title (using the <TITLE> tag).
- Include HTML <META> tags. For example, in the <HEAD> section of HTML files, where the title is, include the following two lines:
<META name="description" content="A description of the site or page.">
<META name="keywords" content="a comma-separated series of keywords by which you want your Web site to be located">
Note that Excite and some search engines ignore <META> tags because they have been abused by webmasters trying to attract more hits. If you do use <META> tags, choose your description and keywords carefully and accurately.
- Try not to duplicate the site name in the title, or title words in the keywords list – this is a strategy for increasing the number of effective "keywords" used to index the site.
- Choose your keywords carefully. "Keywords" should be at least two words long, because web surfers searching for a single word often become frustrated by the overwhelming number of irrelevant hits they get on searches of a single keyword, and try to refine their searches by entering multi-word combinations or phrases to search for.
For more information about fine-tuning your web site for search engines (as well as tips on using search engines for your own searches), see the Search Engine Watch pages.
Resubmitting your Site
If your site has changed addresses or its content has changed significantly, you should resubmit site information to the major search engines. Some search engines have web forms for submitting or resubmitting a page, and some of them require submissions to be made by email. Some search engines accept a root URL for a site and automatically index the site by following links from the root. Others, including Infoseek, accept only single pages--sites with multiple pages must submit these via email as a list of URLs.
Resubmission of site information is only a one-time fix for updating search engine indices. It doesn't guarantee that search services will revisit the site by any particular date.
Reindexing sites takes time. The average time between submitting your URL and getting it into the database seems to be 5-8 weeks. Lycos and Excite both say that it takes 2-4 weeks for a submitted site to appear in their indices. AltaVista says that a home page should be listed in one or two days, with the rest of the site being indexed "over time." Different services use different reindexing methods decide differently how frequently to revisit a site to look for changes. If it takes a new site some weeks to make it into an index, a follow-up visit may take even longer. The decision to revisit a site may be influenced by how many hits the site gets, or how often it turns up in search queries.
Submission to Yahoo and some other search sites is a bit more involved. As a categorization service rather than a search service per se, Yahoo requires the submitter to specify which category or categories apply to a site. There can be multiple possibilities. Here are some possibilities for agricultural sites in Yahoo:
- Science:Agriculture
- Science:Agriculture:Agronomy:Institutes
- Science:Agriculture:Crops and Commodities
- Science:Agriculture:Organizations
- Science:Agriculture:Sustainable Agriculture
- Science:Agriculture:Sustainable Agriculture:Organizations
- Science:Agriculture:Sustainable Agriculture:Institutes
- Social Science:Environmental Studies:Institutes
- Business and Economics:International Economics:Development:Organizations
- Business and Economics:International Economics:Development:Sustainable Development
Non-profit or grant making organizations might belong in some of the following categories
- Society and Culture:Issues and Causes:Philanthropy:Organizations:Grant Making Foundations
- Society and Culture:Issues and Causes:Philanthropy:Organizations:Community Development
- Society and Culture:Cultures and Groups:Children:Organizations:
Web site owners should select categories carefully, as Yahoo and some other categorization services allow only limited number of links to a given site.
Submission Services
There are services that will manage submissions and resubmissions for you. Some of these are Postmaster (http://www.netcreations.com/postmaster/), Submit It! (http://www.submit-it.com/) or WebPromote (http://www.webpromote.com/av.shtml). The prices of these services range from free (for a limited number of search engines) to several hundred dollars.