AALL Home TS/SIS Home OBS/SIS Home TSLL Home Contents, v.27:02 | « Classification | Preservation » |
TECHNICAL SERVICES LAW LIBRARIAN
Volume 27, No. 2 (December 2001)

  THE INTERNET
Sizing Up the Web
Kevin Butterfield
University of Illinois
butterfi@law.uiuc.edu

Man sitting in spider web.The Web Characterization Project conducts an annual Web sample to analyze trends in the size and content of the Web. Analysis based on the sample is publicly available. OCLC obtains the sample by creating a list of randomly generated IP addresses, and then attempting to connect to each address to identify the presence of public Web services. If a web service is identified, harvesting software captures the site and stores it for future analysis.

Statistics begin with 1998 and cover a wide range of areas. The project notes that the overall number of web sites has grown from 2,851,000 in 1998 to 8,745,000 in 2001. The number of unique web sites has grown from 2,636,000 in 1998 to 8,443,000 in 2001. According to OCLC, it is not uncommon for the same public Web site to be duplicated at multiple IP addresses - e.g., for server load balancing. To ensure that each unique Web site has the same probability of being selected for the sample, OCLC adopted the following rule: if a site is located at multiple IP addresses, the site is retained in the sample only if the numerically lowest IP address is in the sample. Several diagnostic tests were developed to assist in identifying sites with multiple IP addresses.

Overall the number of web sites grew 457% from 1997-2001, but the annual growth reported by the project is slowing. The number of sites on the web only grew by 18% in 2000-2001. This compares to an annual average growth rate of 68% each year from 1997 through 1999.

The projects statistics also bear out the dominance of the United States on the web in their country of origin and language statistics. The United States is the country of origin for 47% of the web sites surveyed in 2001. The categories "Unknown" and "Others" together make up 33% of the sites surveyed. After these, the closest finisher is Germany at 5%. In 1999, the numbers for the United States were about the same at 46%, but the "Unknown" and "Others" combined percentage was only 26%. These numbers support the idea that provenance is continuing to be difficult to prove on the web. In OCLC's sample, the country of origin refers to the geographical location of the organization or individual responsible for the intellectual content of the Web site – in other words, the entity that "published" the Web site. The country of origin applies only to the Web site publisher, not the physical location of the Web server upon which the site's content is stored.

The languages used to express the textual content of the site are also surveyed by OCLC. Note that more than one language may be used. English continues to dominate in 2001 as the language used on 73% of web sites. The second place language is German at 7%.

The project also tracks economic activity on the web. The classes of economic activity used are taken from the North American Industry Classification System (NAICS). The highest percentage belongs to the Others and Unknown categories which combine for 18.7%. 15.5% of the economic activity involves Information, 14.2% Professional, Scientific, and Technical Services, 12.8% Other Services (except public administration), 11.8% retail trade, 8.5% manufacturing, 6.6% education services, and so on. The category of economic activity applied by OCLC is one that best characterizes the organization or individual publishing the Web site. Typically, this categorization will take the form of the industry to which the Web site publisher belongs, but also includes not-for-profit economic activity as well (e.g., households, professional societies, etc.). OCLC's categorization applies to the publisher of the site, not the content.

Man standing with arms spread.The top fifty referred sites are also tracked. Microsoft is at the head of the list that ranks the public Web sites most frequently linked to from other public Web sites (based on the 2001 sample of public Web sites). Search engines and commercial sites dominate the category. A few newspapers (New York Times, Washington Post and USA Today) along with news services such as CNN also made the list. There were no libraries or educational institutions listed.

These statistics can be read a number of ways. The meaning I take from them is that what was once explosive growth has now slowed. The emphasis would seem to be on refinement and consolidation now rather than new construction. We are still trying to catalog, acquire and provide access to moving targets when we deal with the web and we have always known that content on the web evolves and that our descriptions must evolve with it

I found OCLC's statistics interesting after reading an article by Timothy C. Craven, a Professor in the Faculty of Information and Media Studies, The University of Western Ontario. Craven's article, "Changes in Metatag Descriptions Over Time," studied how web sites containing descriptive metadata changed over a period of time and whether or not this descriptive metadata adapted to the new content of the pages.

Craven asserts that, unlike scholarly articles and other traditional published documents, Web pages are frequently dynamic, subject to regular, or irregular, updating. Thus, authors of Web pages may benefit from assistance, not only with the initial creation of metatag descriptions, but also with the revision of these descriptions as pages evolve and are revised over time.

One question raised by Craven regarding such revision is how often it is in fact required. A Web page may be revised frequently, and yet its overall description may remain entirely valid and in no need of further attention. A possible indication of the frequency with which Web page descriptions should be revised is the frequency with which the web sites themselves are in fact revised.

Craven began his study by collecting sets of web pages. Each set was divided into two categories: those having metatag descriptions in the summer of 2000 and those not having metatag descriptions at that time. The pages were then revisited a year later, to determine the types of changes that might have taken place: what proportion had lost descriptions, what proportion had gained descriptions, and what changes had been made to descriptions. When a requested page was returned, specially designed software logged data that included the metatag description and the URL.

Overall, Craven observed no indication from the present research of either a net decline or a net increase in the use of metatag descriptions, at least over the time period covered. Craven hypothesizes that various possible developments might cause a disturbance in this apparent steady state: changes in search engine policies; addition of metatag display to browsing software; the advent of page editing software that makes metatags more prominent or assists in their creation; inclusion of the metatag description as a required element in HTML, the omission of which would be flagged by validation services; supplanting of present metatag descriptions by another kind of meta data, such as Dublin Core or by external descriptions generated by commercial indexing services. In terms of updating descriptions, about one third of the changes in descriptions observed involved major rewriting, but about two thirds involved lesser modifications.

For Further Information

OCLC's Web Characterization Project
http://wcp.oclc.org
Changes in Metatag Descriptions Over Time by Timothy C. Craven
http://www.firstmonday.org/issues/issue6_10/craven/index.html


AALL Home TS/SIS Home OBS/SIS Home TSLL Home Contents, v.27:02 | « Classification | Preservation » |
Comments to: WebMaster, tssis@law.wuacc.edu
Updated: February 3, 2002.
URL: http://www.aallnet.org/sis/tssis/tsll/27-02/inet.htm