Internet Research

Simple tips to help you double the effectiveness of your on-line searches.

By Sue Eipert
2002

Originally published in Washington Criminal Defense, 17(3) 2003; and in The Investigator: The Official Newsletter of the Washington Association of Legal Investigators, Fall issue 2003. Links verified and updated February 2004.

Finding general answers to questions can be as simple as typing a couple of words into the Google search box (http://www.google.com). But be aware – not everything is on the Web, not everything on the Web can be found through a search engine, and the quality of what is found is extremely variable.

Before the existence of the World Wide Web, there were multiple types of information sources. And even today, although the Web is often the most convenient, it is far from being the only source. Hard copy books and periodicals, and professional search services such as Lexis Nexis, Dialog and Factiva, are still essential for many reasons, but in this discussion, I’ll approach the question of how to optimize your search time on the Internet.

My familiarity with this topic stems from years of practice as an information professional. I offer research services to clients both in Seattle and elsewhere in North America, including marketing professionals and expert witnesses, among others. For many years prior to that I served as corporate librarian for an international engineering/environmental consulting company. I’ve searched for information on an enormous number and variety of subjects, especially technical science and engineering literature, competitive market information, industry surveys and statistical information.

Just as it is wrong to assume that if information is not on the Web, it doesn’t exist, it is also a mistake to assume that if you haven’t found something by ‘googling’ it, it’s not on the Web. Yes, Google has become so popular that it been delegated a verb – and it is deserving of its popularity. Google is simple and often effective, and in many cases provides all that you need. But nearly always, there are ways web searching could be improved.

The searching of the Web by the various search engines varies in amount of coverage and in timeliness. A classic study in 1999 showed that even the largest search engine covered only 16% of the publicly indexable Web [Lawrence and Giles]. Google currently covers more of the Internet than other search engines; AlltheWeb is in second place [Notess 2003]. The overlap among the various search engines is not great; in one test, when the same search was run on 10 different search engines, half of the pages found were found by only one of the search engines, and it was not always the same one [Notess 2002].

In addition to the publicly indexable information covered by search engines, there is a huge invisible Web. The size of the invisible or deep web is estimated to be "between 2 and 50 times larger than the visible Web" [Sherman and Price 2001]. Much of the information in the invisible Web resides in databases that must be searched individually; their contents cannot be discovered by an ordinary search engine search. Common examples would be phone number and address directories such as Anywho.Com (http://www.anywho.com) or Infospace (http://www.infospace.com).

Timeliness is another important variable among search engines. When a query is entered into a search engine, the search engine does not search the current Web, but is actually searching an index database built at some earlier period by crawling the Web. All major search engines work by sending a spider, or crawler, out onto the web. The spider visits a web page and then continues to visit other pages within the site by following links. Changes in the web pages are registered whenever the spider returns – about every month or so. Whatever the spider finds is put into an index, which is actually a copy of all the pages the spider found. Then, when a search is entered into the search box at a search engine web site, it is the index which is being searched. All search engines query indexes of pages that existed in the past, but some refresh their index more often than others. Most claim to refresh monthly. Currently AlltheWeb claims to do its crawl every 7 to 11 days compared with Google's 28 days. [McHugh 2003].

Compared to publishing a book or article, the ease of putting information on the Internet means that many more organizations and people are able to make information public. This provides a wonderful burgeoning of available information, but at the same time means that there is likely to be less editorial review. So, it becomes more important than ever to use the same critical analysis you would use in evaluating non-Internet information. Who wrote this? What are their credentials? When was it written? Who is the intended audience? Are there clues regarding a hidden agenda?

Research tip: try directories rather than search engines when searching for general information on a topic

Google searches are known for their ability to find specific known items. A search on the term "post office", returns search results with the U.S. Post Office site at the very top of the list every time. Searching for general information on a topic, however, may turn out to be less useful—with many results of dubious relevance. The top item in the results of a search on "Internet crime", was (at the particular time I tried this search) the Internet Crime Archives—an Internet archive of information about serial murderers—not information about Internet crime, or cybercrime. Most of the other results on the first page of results returned were much more relevant, but there is another way to go about finding high quality sites with general information on a particular subject.

Web directories are hierarchical, searchable lists of links to sites chosen by humans. The Open Directory project, for example, (see About the Open Directory Project, http://dmoz.org/about.html), uses volunteer editors that are considered experts. It is used by most of the major search engines. Google includes these links in its regular search results, but the directory can be used by itself to get a much shorter list of links that are actually about a particular subject and likely to be of higher quality.

A couple directories covering a wide variety of topics are: http://directory.google.com and http://www.about.com. FindLaw’s ‘Legal Subjects’, at http://www.findlaw.com/01topics/, is a directory of legal topics.

Research tip: try a specialized search engine

Similar to subject directories (and sometimes in combination with them) specialized search engines are designed to search only a specific portion of the Internet. Search results will be more focused. Specialized search engines also have the potential of being more current than the larger general purpose search engines (such as Google), because they have a smaller universe of information to keep updated.

Research tip: use more than one search engine

Since search engines do not all cover the same fraction of the Web, try more than one, especially for obscure information. Try AlltheWeb (http://www.alltheweb.com) and Gigablast (http://www.gigablast.com), or others, in addition to www.google.com.

Research tip: remember the invisible Web

Although the invisible Web is defined by the fact that its content cannot be searched by general purpose search engines, the entry points to this content can be found by doing, for example, a Google search, or through specialized search engines or directories. The results at topic-specific sites often include invisible Web sites, although they are not identified as such.

One way to find an invisible Web site specific to a topic is to check directories of invisible Web sites at:

Another way to locate an invisible Web site on a specific topic is to search Google—and add the word ‘database’. For example, results from searching Google for ‘drug database’ includes:

Research tip: use news sites for current articles not found in an ordinary search engine

Because the results found in ordinary search engines come from crawling the web at some time in the past, use sites which specialize in news to get up-to-the-minute information. Some good general sites—search engines specialized for news—are:

These sites will be adequate for many purposes, but news (as well as journal articles) will be covered much more comprehensively in commercial sources such as Lexis/Nexis, Dialog or Factiva than any free Internet search.

Research tip: don’t expect everything to be free

An example of Internet information that is not free is http://www.idex.com/, an excellent site with listings of expert witnesses. It is available only to the defense bar, and only by subscription.

Other examples are the large professional systems such as Lexis/Nexis, Dialog or Factiva, which existed long before the World Wide Web, and are not actually part of the Web, but are available through the Web. The amount of content is phenomenal and the quality is high. These providers aggregate information from hundreds of different publishers.

Research tip: search the web as it used to be

Do you need to know what a web site used to look like before it was changed or taken off the Web? Here are two ways that may yield results:

References

Sherman, Chris and Gary Price, 2001, The Invisible Web: Uncovering Information Sources Search Engines Can't See, Medford, NJ: CyberAge Books.

Lawrence, Steve, and C. Lee Giles. "Accessibility and Distribution of Information on the Web," Nature 400(6740): 107-109, July 8, 1999.

Notess, Greg R., 2002, Search Engines Statistics: Database Overlap, SearchEngineShowdown. Retrieved 2/11/2003 from http://www.searchengineshowdown.com/stats/overlap.shtml

Notess, Greg, 2003, Search Engine Statistics: Relative Size Showdown. SearchEngineShowdown. Retrieved 2/11/2003 from http://www.searchengineshowdown.com/stats/size.shtml

McHugh, Josh, 2003, "Google vs. Evil", Wired Magazine, Issue 11.01. Retrieved 2/11/2003 from http://www.wired.com/wired/archive/11.01/google_pr.html

About the author:

Sue Eipert provides business and scientific research services, using professional proprietary databases as well as the visible and invisible Web, to fulfill the information needs of clients in the U.S. and Canada, including environmental consultants, forensics professionals, expert witnesses in various fields, and manufacturing and e-commerce companies.

Eipert Information Services
seipert@eipertinfo.com

Copyright © 2000-2008 Sue Eipert