How Big is the Internet? By Marsha Miller On our current Internet Search Tools page, you will find a link, Learn more about search engines and how they work! We are working [slowly] on a major revision of the search tools page generally; this link takes you to the bottom of the current page, where I quote a 2000 estimate: The Internet contains over 2.1 billion pages (July 2000 estimate by Cyveillance). Every day more than 7 million new pages are being added. It is predicted that there will be more than 4 billion pages by early 2001. Obviously if I want to keep this sort of information attached to the new page, I need some new information! So, last week I set out to see what I could find. Here are some of the results. |
| Size of the Internet/World Wide Web: Estimating the size of the Internet and/or the World Wide Web depends on whether or not you are talking about the number of actual individual "pages"; e.g. www.nua.ie/surveys/how_many_online/index.html; the number of entities that have established a web site {e.g., http://www.nua.ie}, the number of ISPs {Internet Service Providers}/hosts, the number of sites by domain* {e.g., how many .edu* sites are out there}, etc. etc. Estimating how many 'pages' an Internet Search Engine hits also depends on many factors, including how often the Search Engine 'reaches out' to a particular page or site, what search algorithm it uses to 'record' what it finds and what search parameters are available to you, the searcher. The Internet Software Consortium counted 162,128,493 hosts {July 2002, www.isc.org/ds/WWW-200207/index.html}. As a point of reference, two years ago the figure was 93,047,785, and their earliest figure {January 1993} found 1,313,000 hosts. |
| For the serious searcher/researcher, Search Engine Watch can give you the best idea of how search engines are designed, so that you have a better idea of how to choose the best engine(s) for your particular job. |
| Just Because Everyone Else {popularity vs. actual research value?}….How many of you "just use Googol?" And if you are, how many of our students are using it and assuming that they are finding everything that needs to be found. If you think that Internet users are information literate searchers, you need to visit some of the sites that let you eavesdrop on the invisible searcher, which generate lists of active searches, refreshing every few seconds [try Metaspy at http://www.metaspy.com/; Ask Jeeves Peek Through the Keyhole at http://www.ask.com/docs/peek/; Search Spy at http://www.kanoodle.com/spy/spy; PrimeTimeSearch [use caution/uncensored] at http://www.primetimesearch.com/livesearch.htm . All of the searches shown are certainly demonstrate general/recreational/personal uses of the Internet/WWW but in an academic library we're interested in people searching efficiently for items of academic/scholastic/scholarly worth [aren't we??!!]. One of the reasons libraries have pages such as ISU's Internet Search Tools is to help people sort out where they really need to be, as opposed to whichever search engine is doing the best job of [self]-promotion. Browser News http://www.upsdell.com/BrowserNews/stat_search.htm has some interesting info: |
| Which search engines should you register with? Search Engine Watch says (Oct. 2002) the most used engines are (in descending order): Google, Yahoo, Overture, DMOZ (ODP), Inktomi, LookSmart, Teoma, AltaVista, and All the Web (FAST). Note that Overture is a for-pay engine, and that, unless you pay, it is extremely hard to get listed with Yahoo. |
| How Much Information? A special project at the University of California is looking at the origin and generation of all kinds of 'information'. Their section on the Internet { http://www.sims.berkeley.edu/research/projects/how-much-info/internet.html} has this fascinating information {dated 2000; if you look at the section below, online, there are embedded footnote links}: |
| There are two groups of Web content. One, which we would call the "surface" Web is what everybody knows as the "Web," a group that consists of static, publicly available web pages, and which is a relatively small portion of the entire Web. Another group is called the "deep" Web, and it consists of specialized Web-accessible databases and dynamic web sites, which are not widely known by "average" surfers, even though the information available on the "deep" Web is 400 to 550 times larger than the information on the "surface." |
| The "surface" Web consists of approximately 2.5 billion documents up from 1 billion pages at the beginning of the year, with a rate of growth of 7.3 million pages per day. Estimates of the average "surface" page size vary in the range from 10 kbytes per page to 20 kbytes per page. So, the total amount of information on the "surface" Web varies somewhere from 25 to 50 terabytes of information [HTML-included basis]. If we want to obtain a figure for textual information, we would use a factor of 0.4, which leads to an estimate of 10 to 20 terabytes of textual content. At 7.3 million new pages added every day, the rate of growth is [taking an average estimate] 0.1 terabytes of new information [HTML-included] per day. |
| If we take into account all web-accessible information, such as web-connected databases, dynamic pages, intranet sites, etc., collectively known as "deep" Web, there are 550 billion web-connected documents, with an average page size of 14 kbytes, and 95% of this information is publicly accessible. If we were to store this information in one place, we would need 7,500 terabytes of storage, which is 150 times more storage than we would need for the entire "surface" Web, even taking the highest estimate of 50 terabytes. 56% of this information is the actual content [HTML excluded], which gives us an estimate of 4,200 terabytes of high-quality data. Two of the largest "deep" web sites - National Climatic Data Center and NASA databases - contain 585 terabytes of information, which is 7.8% of the "deep" web. And 60 of the largest web sites contain 750 terabytes of information, which is 10% of the "deep" web. |
| How Many Online? While this doesn't affect your specific searching, it may interest you to know that a September 2002 chart at http://www.nua.ie/surveys/how_many_online/index.html estimates that 605 billion people are now online, 182 million from the U.S. and Canada. |
| Further reading: |
| A Nation Online: How Americans Are Expanding Their Use of the Internet {Washington, D.C. February 2002} |
| UCLA Internet Report |
| The Major Search Engines |
| *Domains=the .edu/.com/.org, etc. and the country codes {.jp for Japan, .us for U.S., etc.]. So, for example, July 2002 stats for .edu, which represents U.S. institutions of higher education had 7381306 'unique' listings, .orgs had 1238739, .gov= 700107, and .com= 43814657 { http://www.isc.org/ds/WWW-200207/dist-bynum.html }. |
| Are there any countries with NO hosts/domains registered? Yep! Bouvet Island, Equatorial Guinea, Haiti, Maldives, Sudan, Svalbard And Jan Mayen Islands, Somalia, Syrian Arab Republic, Chad, United States Minor Outlying Islands, Wallis And Futuna Islands, Mayotte and Zaire {July 2002} |