Originally published in: Crime Scene: Newsletter of the Northwest Association of Forensic Scientists, 28(2) 2002. Links verified and updated February 2004.
Search engines on the Web are great places to find information. The answer to many questions can be discovered on a web page found through a search engine, but if an answer is not found in this way, it certainly does not mean it’s unavailable anywhere. So where is the rest of published information?
I’ve searched for information on an enormous number and variety of subjects, especially within science and engineering fields, and for competitive market and industry information. For many years I was a corporate librarian for an engineering/environmental consulting company, and now am an independent researcher working for companies and clients. Before the World Wide Web existed, there was a multiplicity of information sources, and even today, Web search engines are far from the only source.
An intriguing study by faculty and students at the School of Information Management and Systems at the University of California at Berkeley (Lyman and Varian, 2000) estimated the world's total information on various media—print, film, optical, and magnetic. Let’s look at the portion of information that is used for research—retrievable public text in both electronic format and paper. [Note: the 2000 study discussed here was updated in 2003; both citations are listed below.]
According to this study the surface Web, or that part of the Web capable of being searched by the general-purpose search engines, contains 10 to 20 terabytes of textual content (1 terabyte = 103 gigabytes = 106 megabytes = 109 kilobytes). We might ask, how does this compare to printed materials? The authors, using figures from booksinprint.com in January 2000, found that there were 3.2 million titles in print in the U.S. in January 2000, comprising about 26 terabytes of information. The Library of Congress has a print media collection that includes almost 26 million books—208 terabytes. The annual worldwide production of information in publications is estimated as 8 terabytes in books, 25 terabytes in newspapers, 20 terabytes in magazines, and 2 terabytes in scholarly journals.
Where do we find information that is beyond that found on the surface Web? Well, first of all we need to consider that centuries-old medium, paper, and the databases that index paper and digital articles. Most books are not available in digital form at all; the Library of Congress is digitizing selective parts of its collection and making them available on the Web, but that is no more than a tiny fraction of the library holdings. Most scholarly journals are not accessible through Web search engines because many of those that are on the Web are available only through subscriptions to individual periodicals or to collections of periodicals. Also, many older publications will never be digitized. Older publications are often considered out-of-date and irrelevant, but can be very important in some cases. Serious medical errors have been made because doctors searched only electronically available sources, thereby missing key older articles. Litigation often requires older data such as engineering standards or medical guidelines that were in place in a particular year.
Professional proprietary indexing and abstracting services such as Dialog, Lexis-Nexis, and Dow Jones aggregate hundreds of databases (Biosis, Compendex, etc.) that index periodicals. Searching these aggregations of electronic databases was the core of literature searching by librarians and other information professionals for at least 20 years before the World Wide Web even existed, and now, even though the regular Web contains massive amounts of useful information and several of the databases are available by individual subscription, the aggregator services are still extremely important. They employ sophisticated search languages that make it possible for proficient searchers to seek information comprehensively and efficiently, and they also contain extensive valuable content including esoteric subject areas. The bulk of the world’s print scientific literature is indexed through these databases, and can be searched even though most of the articles themselves are not electronic.
Then there is the invisible, deep, or hidden Web—the part not penetrable by search engines. While the name given to this portion of the Web varies, as do the details of what is included in the definition the important point is that for various reasons, much of the information on the Web cannot be found directly from the general-purpose search engines that we use every day. Estimates of the size of the invisible or deep web vary from “between 2 and 50 times larger than the visible Web” (Gregory, 2001—book review of Sherman and Price, 2001) and “400 to 550 times larger than the information on the surface Web” (Bergman). Using any of these estimates, the information on the Web not searchable by general-purpose search engines is substantial. It is the fastest growing category of new information on the Internet, and tends to include more authoritative and current sources (Bergman). In a subsequent issue, I’ll focus on how to find what’s available in this invisible Web.
For an example of the invisible Web, check out the NUCEXP database (http://www.ga.gov.au/oracle/nukexp_query.html). It’s a compilation by the Australian Geological Survey Organisation of all nuclear explosions recorded since 1945. Each page of search results is created from a database for a particular search and thus not indexed by search engines.
Bergman, Michael K., “The Deep Web: Surfacing Hidden Value”, BrightPlanet White Paper. Retrieved from http://www.brightplanet.com/deepcontent/tutorials/DeepWeb/index.asp on June 14, 2002. [Link update: page moved to http://www.brightplanet.com/technology/deepweb.asp.]
Gregory, Gwen M., 2001, “Uncovering the Invisible Web: General-purpose Search Engines Only Index a Small Portion of the Internet. (Book Review),” Information Today, December 2001
Lyman, Peter and Hal R. Varian, 2000, How Much Information? Retrieved from http://www.sims.berkeley.edu/how-much-info on June 14, 2002. [Updated version at Lyman, Peter and Hal R. Varian, 2003, How Much Information 2003? See http://www.sims.berkeley.edu/how-much-info-2003.]
Sherman, Chris and Gary Price, 2001, The Invisible Web: Uncovering Information Sources Search Engines Can't See, Medford, NJ: CyberAge Books.
Sue Eipert provides business and scientific research services, using professional proprietary databases as well as the visible and invisible Web to fulfill the information needs of clients, including engineering companies, environmental consultants, forensics professionals, expert witnesses, manufacturers and Internet e-commerce companies.
Eipert Information
Services
seipert@eipertinfo.com
Home - Sci/Tech - Market/Industry - Search Tips - About Us / Contact Us
Copyright © 2000-2008 Sue Eipert