No. WP-98-05

How public is the Web?:
Robots, access, and scholarly communication
 

ASIS conference logo

Herbert Snyder
Internet: hsnyder@indiana.edu
Phone: 812.855.3250
Fax: 812.855.6166 
Howard Rosenbaum
Internet: hrosenba@indiana.edu
Phone: 812.855.3250
Fax: 812.855.6166

Center for Social Informatics
SLIS
Indiana University
Bloomington, IN USA 47405-1801

(C) 1997 Snyder and Rosenbaum


Use this table to navigate through the paper:
Abstract Introduction Methodology Findings Discussion
Conclusions Bibliography Appendix A: Working with "robots.txt" file Appendix B: Putting it Together Appendix C: The Interview Questions

 
Abstract

This paper examines the use of "Robot Exclusion Protocol" to restrict the access of search engine robots to 10 major American university websites belonging to institutions recently named among "AmericaÕs Most Wired" universities (Gan, 1997). An analysis of web site searching and interviews with web server administrators at these sites shows that the decision to use this procedure is largely technical and is typically made by the web server administrator. The implications of this decision for openness in scholarly communication and for the future of academic, university-based web publishing are discussed.

Return to Contents

Introduction

Increasingly large amounts of scholarly output are being made publicly available on the world wide web. However, at the same time that information is becoming more accessible, technical developments occurring at universities and colleges are making it potentially more difficult for scholars and others to access information posted on servers in these institutions?domains. This paper examines the phenomenon of restricting access by resource discovery robots (e.g. search engine crawlers and spiders) to university and college computing networks hrough the use of the Robot Exclusion ProtocolÕs (REP) "/robots.txt" file or the NOROBOTS META tag (Koster, 1997). The purpose of REP, in either form, is to prevent search engines from indexing the public contents of web sites, effectively excluding the sites from search engine databases.

The study examines the 10 colleges and universities with the best internet connectivity and institutional computing infrastructures as defined by a recent ZDNet survey. These institutions were ranked according to 35 criteria which were grouped into: academics, hardware and wiring, social use, and student services (Gan, 1997). These criteria are used here as an indicator of the institution's resources for network connectivity and the potential for disseminating scholarly output. For more on how the original survey was conducted, see Appendix B: Putting it Together.

What is Robot Exclusion Protocol?

Before discussing REP, it is necessary to ask a prior question is "what is a robot?" A robot is (Koster, 1998a):

A program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

Robots are used for a number of purposes including but not restricted to search engine indexing, URL and HTML validation, link checking, monitoring changes in web pages. They are crucial to the operation of the web because they "power large text search engines" and "allow organizations ...to compile statistics on the size and makeup of the web" (Stein, 1997; 247). There are currently ~50 robots that are known to be roaming the web (Fischer, 1998). What is important about robots is how they are written and implemented. Poorly written robots can overload servers as they move through a web site and use more than their share local computing cycles. Well written robots do their job, which is to collect information, and move on. Robots tend to be written in different ways, so they will have different methods for gathering data from a website. Typically, they will use a selection of URLs from the search engineÕs database to begin moving through the web; most will follow links from these documents to new documents, gather data, and then follow links found in the new document. When the robot downloads a new page, it (Stein, 1995; 121):
Systematically identifies all links that point to HTML documents, and some information about each link is added to a growing database (some robots just record the title while others index the entire text). After exhausting the contents of a site, the robot chooses a link that points to a site that it's not seen before and jumps there...The upshot of this is that a robot is likely to find your site if anyone on the web has ever made a links that points to you.
In addition, some robots may index the full markup including the META tag, or other special hidden tags.

As a result of a cooperative agreement reached in 1994 between the authors of robots and crawlers and others with interest in "bots", technical procedures are available to prevent a robot, crawler, or spider from accessing all or part of a website. As a consequence of incidents where robots overloaded servers with high numbers of HTTP requests, accessed areas with temporary or duplicate information, tried to index synthetic URLs, such as those generated in a search engine response, or interacted with cgi-bin scripts in ways that disrupted the scripts' functionality, people on the listserv <robots-request@nexor.co.uk> began a discussion of strategies to prevent these incidents from recurring. Robot Exclusion Protocol (REP) was the outcome of their discussion; it is not an "official" standard and there is no enforcement mechanism in the event that it is not followed. Ironically, most of the major search engines use REP to prevent robots from accessing their sites; this is an effort to avoid the problems that occur when a robot tries to download the pages that are being generated dynamically by the search engines. See Appendix A: Working with "robots.txt" file for examples of search engine positions on REP.

REP provides web server administrators with two ways to restrict robot access to their sites. The first method requires that a file, called "/robots.txt" be placed in the top level servers directory then (Koster, 1998b):

When a compliant Web Robot visits a site, it first checks for a "/robots.txt" URL on the site. If this URL exists, the Robot parses its contents for directives that instruct the robot not to visit certain parts of the site.
This file has a standard format and syntax that allows the server administrator to include two parameters: "User agent," which specifies the robot (or, if all robots are to be blocked, a wild card character "*" is used), and "Disallow," which defines the file paths that are off limits (again, a wild card character can be used to block access to the entire site). The format for a file that would prevent all compliant robots from accessing three sections of the website is as follows:
# Robots.txt file for www.somedomain.edu
User-agent: *
Disallow: /cgi-bin
Disallow: /temp
Disallow: /host.dept/private
The "/robots.txt" file can be used in a variety of ways. It can exclude all or specific robots from all sections of the web site. It can restrict access to certain files or pages and allow access to others. It can also be used to permit access to specific robots and not others. See Appendix A: Working with "robots.txt" files for details and examples.

The second method is a local procedure which requires that the web page author include terms in the <HEAD> of her page that will be recognized by robots when they come upon the page. It involves the use of the <META> tag, with two attributes, each of which takes a specific value; the <NAME> should be "ROBOTS" and the <CONTENT> should be "NOINDEX." This tag is used as follows:

<HTML>
<HEAD>
<TITLE>Your hidden page</TITLE>
<META NAME="ROBOTS" CONTENT="NOINDEX">>
</HEAD>
What does this mean for the average user at an institution that uses REP (in the form of "/robots.txt")? Consider the case of a faculty member who places her work on the web, whether academic papers, creative writings, or artwork. The work is placed on the web in an effort to make it available to a wider audience and, if the intended audience is not restricted to a class, access to the work is intended to be global. One of the important ways the work can come to the attention of the intended audience is through the use of search engine, and, if REP is in place, the work is for the most part invisible; the audience is restricted to those who have the precise URL or who stumble across it while browsing. Perhaps the faculty member could work around this problem by submitting her URL directly to the search engine, which most indexing services allow. However, when the URL is submitted manually, it is queued and will not be entered into the search engine's database until the designated pages have been and visited and verified by the search engineÕs robot. When the robot comes upon the "/robots.txt" file, it turns away and the URL is consigned by the search engine to the digital wastebasket.

Suddenly, the decision to use REP has interesting implications for the growth of digital scholarly communication on the web. This paper seeks to explore several of these implications.

Return to Contents

Methodology

We chose the sample of 10 "most wired" universities as exemplars of best practices and not as a representative sample of universities. The point was to select institutions that encourage web access and use among their faculty and staff. This sample is a set of case study observations that indicates the direction that these organizations may be moving towards. As a sampling frame, we chose the top ten universities from the ZDNet survey. The purpose of this study is not to validate or challenge the ZDNet survey methods. While there may be controversy over some schools' relative positions in the rankings, those ranked at the top clearly represent a high degree of internet connectivity. See Appendix B: Putting it Together for a discussion of the original survey and the listing of the top twenty universities and colleges.

There were two strategies used to collect data. First, the university websites were examined using a search strategy designed to uncover the presence of a "/robots.txt" file. This involved finding the home pages for the main servers that constituted the university's web site and using the following in the location box of the browser:

http://host.university.edu/robots.txt
This request would either produce the "/robots.txt" file or an error message, which was interpreted as indicating that the file did not exist on the server. If the file was found, it was printed.

The second strategy involved email and telephone interviews with the web server administrators of these sites. The questions were intended to uncover the reasons for using REP or for choosing not to use it. Ten respondents were interviewed. See Appendix C: Interview questions.

Return to Contents

Findings

What areas of university web sites exclude search engine robots and why?

Universities which exclude search engines robots

The ostensible reason for using REP is, of course, to exclude search engines from accessing the university's web pages. The degree to which the engines were excluded from the sites varied widely. Among the universities which excluded robots, the most circumscribed restrictions which were limited to sites under development. According to the webmaster at the university where this limited restriction was applied, student pages were not excluded from search engines, however, the student pages frequently included links to incomplete sites within the university. It was to these incomplete sites alone that access was blocked.

More extreme cases were found in universities which blocked access to any pages other than official university pages. This included both student web pages and internal department pages which contained faculty members' personal pages. Two main reasons emerged for this type of exclusion: conservation of computer resources and control of external access. In the first instance webmasters frequently mentioned student pages that overtaxed the universityÕs computer resources due to their popularity, although no specific examples were supplied. The second instance had no further explanation other than to exclude from indexing by internal or external search engines, any directories or pages not officially prepared by the university.

In one university which did exclude search engine access to student and faculty web pages, the university supplied a separate server expressly for the purpose of promoting individual web pages and making them available to the public via search engines. The alternative resource was prominently displayed as part of the posted restrictions policy (see section below on notification to users) and test searches of the alternative website by the researchers indicated that the site allowed access to search engines.

In one instance a technical reason was also supplied for using REP - the prevention of "infinite loops." According to the university webmaster, the school's official calendar was excluded because, "each page on the calendar has links to surrounding days, and thereÕs no limit on the limit of event dates we'll display."

Universities which did not exclude search engines robots

The universities which did not exclude search engines were unanimous in their response that it was desirable that outside users be able to find and access material from the website without hinderance. Universities that did not block access were also notable for responding that they had not noticed a significant increase in use of their servers as the result of spider traffic. Although the data do not allow rigorous analysis according to the size of the institution, there appears to be some evidence that smaller school are more liberal in their web access policies.

The universities which offered unimpeded access were also cautious concerning the future need to exclude robots, and acknowledged that they could not rule out the possibility of using REPs as spider traffic increased. As one webmaster noted, "as robots representing commercial agents become more widespread, it [REP] will likely become a necessity."

User notification

Among the universities which excluded robot search engines, only one officially notified users of the exclusion policies. The remainder had policies of notifying students and faculty if asked about the policy, but did not otherwise disseminate the information. There was very little explanation for not notifying users. A single exception was made at one university where the system administrator stated, "It is not in the least bit difficult for someone to add their own web pages to external search engines. We don't prohibit this in any way."

The single institution which did officially notify users that robots were excluded also supplied a separate server for students and faculty to make their pages publicly available. The notification was included as part of the user documentation for obtaining and using webpage accounts on the university server.

How and by whom was the decision made to exclude robots?

In every case in which a university excluded search engine robots, the decision was made by the webmaster and/or technical staff. The decisions were made using technical criteria and in none of the institutions was there an official policy or review mechanism for the decision.

Return to Contents

Discussion

As with many technological advances, the capabilities and uses of internet technology have outstripped the policy mechanisms in place to deal with them. Universities are ostensibly in the business of promoting scholarly communication and intellectual freedom, but at the same time decisions have been made for technical reasons which have the affect of restricting access to university web resources. Despite the rhetoric of openness in scholarly communication in a networked environment, institutions where a significant amount of scholarly publishing is taking place are using REPs to block search engine access to their web sites, making the materials invisible to search engines. (Given the vast stores of information available on the WWW, this has the practical effect of making the information invisible to anyone other than searchers who already know it exists.)

Nor is future access guaranteed even where there are currently no restrictions. As the use of commercial robot-agents grows and they consume more server capacity, institutions which explicitly support unimpeded access to web based materials may be forced, for economic and technical reasons, to adopt REP-like technologies to conserve their computing resources.

An additional, disturbing implication of the findings is that users who are struggling to come to terms with publishing in a networked environment may not be aware that access to their work is being blocked for all who do not have the specific URL of the work or who do not stumble across the work while browsing. This may have particularly far-reaching consequences for authors who seek to make a research impact through web publication and/or for institutions which use citation-like measures for evaluation.

Indeed, authors who work in a robot-excluded web environment may face the prospect of moving their work to an alternative if they hope to be indexed. Contrary to the claim made earlier in this paper by a university webmaster, many search engines cannot add or retain a submitted URL if the host site excludes robot agents. If a web page is protected by either robots.txt or a <NOROBOT> meta tag, then submitting a URL to a search engine does not make the page visible to a robot when it comes to check the page because it must stop at the first sign of REP. In time, the search engine will remove pages previously indexed if REP is put in place after the pages had originally been indexed by the robot.

Return to Contents

Conclusions

It seems clear from an analysis of the data that university webmasters do not make the decision to use REPs because they explicitly seek to restrict scholarly communication, however, these decisions do have serious policy implications. Web administrators are operating from a technical worldview and are attempting to maximize the use and utility of their institutionÕs computing resources. Whether it is reasonable to expect them to consider the potential affects on access and intellectual freedom that result from technical decisions, at present they clearly do not do so.

Two recommendations to improve and protect access seem clear: first, those bodies in universities which are charged with preserving intellectual freedom must become involved in decision-making for information technology. As with most technologies such as medicine or genetic engineering, information technology decisions have consequences that extend too broadly to allow them to be made solely according to technical criteria. Second, users will need to become better aware of the capabilities and restrictions that the information technologies they use possess.

Return to Contents


Bibliography

Fischer, K.D. (1997). The WWW Robot and Search Engine FAQ.
http://science.smsu.edu/robot/faq/robot.html
Gan, D. (1997). America's 100 most wired colleges. ZDnet.
http://www.zdnet.com/yil/content/college/intro.html
Koster, M.(1998a). Robot exclusion. Webcrawler.
http://info.webcrawler.com/mak/projects/robots/exclusion.html
Koster, M. (1998b). The Web Robot FAQ.
http://info.webcrawler.com/mak/projects/robots/faq.html
Koster, M. (1998c). The Web Server Administrator Guide to the Robot Exclusion Protocol.
http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html
Stein, L. (1995). How to Set Up and Maintain a World Wide Web Site: The Guide for Information Providers. Reading, MA: Addison Wesley.
Stein, L. (1998). Web Security: A Step by Step Reference Guide. Reading, MA: Addison Wesley.

 

 
 
 
 
 
 
 

Return to Contents

Appendix A: Working with "robots.txt" file

Statements from a selective list of major search engines on the use of REP:

From Infoseek

Removing a Web Page from Infoseek

If a Web page needs to be removed from Infoseek (or never found by Infoseek's indexing robot), add the URL to the robots.txt file for the site. Some commercial Internet service providers may use a robots.txt file to prevent Web robots, such as Infoseek, from indexing their users' Web pages.

If a server is using a robots.txt file and the page you submit is protected by it, then Infoseek cannot add your page.

From Hotbot:

2.How do I keep my pages out of HotBot?

HotBot honors the " robots.txt" file standard. This file can be placed on your site to tell search robots which directories they are allowed to add to their databases and which they must not index.

If you prefer that your site not be indexed by HotBot, ask your webmaster to create a robots.txt file for your site. HotBot's crawler will fetch and obey this command file. It will obey any entry with a User-Agent of "*" or containing the word "SmartCrawl" (the name of HotBot's crawler).

HotBot also honors the proposed robots noindex META tag, which keeps HTML files out of HotBot's database index. This can be added to the head section of an html document...

You can also just remove the offending page from your Web server or restrict public access to your server (talk to your server administrator, system administrator, or dealer if you don't know how to do this).

If your page has already been indexed by HotBot, doing any of the above will cause HotBot to remove your page from its index. But this won't happen instantly! Your page will remain on HotBot until the next time the Web crawler visits your site, which can take up to two weeks. Unless there is a real emergency, we cannot remove pages immediately.

From Alta Vista :

Our spider will find any URL connected to the main body of the Web through even one link. If you don't want your entire site to be indexed, we strongly advise that you take advantage of the Robots Exclusion Standard by setting up a /robots.txt file. It only takes a minute, and gives you complete control over what fraction of your site is indexed. The file looks like:

User-agent: * # directed to all spiders, not just Scooter Disallow: /cgi-bin/sources Disallow: /access_stats Disallow: /cafeteria/lunch_menus/
Any URL matching one of these patterns will be ignored by robots visiting your site. This file is read by Scooter every few days, so changes may not take immediate effect.

From Lycos :

Robot Exclusion Policy

The robots exclusion policy can be used to indicate to our spider (and to spiders from other search services) where it is allowed to travel on your website. In the absence of these instructions, our spider may index parts of your site that you do not want in Lycos (e.g., personal directories). By using a special file called "robots.txt", you can indicate which parts of a site should or should not be visited by a robot. Please read on to learn how to use a robots.txt file to let us know what parts of your website can be indexed.

Robot Exclusion File

The robots.txt file indicates which areas of a server should not be accessed during a robot's disallow access to personal directories, temporary information or CGI-scripts--anywhere a system administrator would prefer robots not travel. There are generic configurations that apply to all robots, as well as very specific configurations for excluding particular robots and directories.

Robot Error Messages

If you receive an error like this when you submit a webpage to Lycos:

Lycos error: Excluded by robots.txt
it means that the administrator of your server has disallowed the Lycos spider access to your webpages--by employing a robot exclusion file. When our spider sees instructions in the file that specifically restrict it, accessing your site and adding your pages to the Lycos catalog is not possible.

From Excite:

Restricting indexing on your site

What is a robots.txt file?

Web spiders (also called robots) are automated computer programs that retrieve URLs. Excite's spider crawls the Web daily, accessing an enormous number of URLs. Because it's one of the most powerful spiders on the Web, it can index most of each site, not just a page or two.

But what if certain parts of your Web site contain confidential information? What if several pages are still under construction and are not ready for public review? How can you protect parts of your territory or the whole site from the Excite spider and the other robots out there?

By using the structured text file known as robots.txt, part of the Robots Exclusion Standard. The robots.txt file tells robots that your Web site, or specific parts of the site, are off-limits.

Do not put material on the Internet that absolutely should not be seen by an unauthorized person. You will not be fully protected. The robots.txt file is not a shield against unauthorized entry: It's merely a recommendation addressed to the Web community about how a spider should operate. This standard is not backed by an official organization or covered by law. Keep in mind the possibility that someone out there might simply choose to not follow the standard.

Using /robots.txt:

From Web Robot Pages:

What to put into the robots.txt file

The "/robots.txt" file usually contains a record looking like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
In this example, three directories are excluded.

Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/". Also, you may not have blank lines, as they are used to delimit multiple records.

Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif".

What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:

To exclude all robots from the entire server

User-agent: *
Disallow: /
To allow all robots complete access
User-agent: *
Disallow:
Or create an empty "/robots.txt" file.

To exclude all robots from part of the server

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
To exclude a single robot
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: WebCrawler
Disallow:

User-agent: *
Disallow: /

To exclude all files except one

This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "docs", and leave the one file in the level above this directory:

User-agent: *
Disallow: /~joe/docs/
Alternatively you can explicitly disallow all disallowed pages:
User-agent: *
Disallow: /~joe/private.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
Return to Contents

Appendix B: Putting it Together

From: PUTTING IT TOGETHER: How we arrived at the Top 100

Our survey addressed a total of 35 factors organized under four main categories. ACADEMICS accounted for 45 percent of the total score, while the HARDWARE AND WIRING and SOCIAL USE of the Net categories comprised 22.5 percent each. STUDENT SERVICES accounted for 10 percent. The content, aesthetics, or navigability of the colleges' home pages did not factor in the rankings at all, because not only can looks be deceiving, but home pages often may belie the extent of services available to the student.

Kenneth Green, visiting scholar at the Claremont Graduate School and director of the annual Campus Computing Survey (a widely cited report on technology use in higher education), provided us with an invaluable tutorial which informed our inquiries. Ranking was based on total scores determined from survey results, with no credit given for incomplete or missing answers. Due to the imprecise nature of information in this young industry, we would like to stress that while some colleges routinely collect data about computer usage by conducting their own campus surveys, the accuracy of the responses of other colleges may have been affected by a varying degree of optimism on the part of the individual respondents.

The top twenty colleges in the ranking were:
1. MIT  11. Colby 
2. Northwestern  12. Princeton
3. Emerson  13. Case Western Reserve
4. Rensselaer  14. U. of Arizona
5. Dartmouth  15. Pomona
6. U. of Oregon  16. UC-Berkeley 
7. NJIT  17. U. of Connecticut
8. IU Bloomington  18. Skidmore
9. Middlebury  19. Iowa State U.
10. Carnegie Mellon  20. Reed 

Return to Contents

Appendix C: The Interview Questions

Interview questions

Name: _____________________________________
Position: __________________________________
Institution: ________________________________
Phone #: ___________________________________
Date: _____________________________________

Hello, my name is and IÕm from the School of Library and Information Science at Indiana University. IÕm calling to ask if you would participate in a brief interview about the use of "Robot Exclusion Protocol" on your UniversityÕs official web site. Our conversation will not be recorded and your answers will be anonymized and aggregated in the research report. You may also end this interview at any time,although I hope you don't.

Do you use "Robot Exclusion Protocol" or a "robots.txt" file to exclude robots from your University's web site?

If yes:

Why do you use Robot Exclusion Protocol/robots.txt?

What prompted you to decide to use REP?

What areas of the web are blocked off?

What types of information are blocked off?

Why did you decide to block these sections of the web site?

Do your users know that you use REP to block access to their pages?

How was the decision made to exclude robots?

At what level of the organization was this decision made?

Is it this an official policy?

Is it a technical decision made by computing services?

If no:

Why have you chosen not to use REP?

Do you have any plans to use it?

Are there any costs that you've noticed because you allow search engine robots full access to your site?

Return to Contents


This page prepared by Howard Rosenbaum
Last update: 12.1.98
hrosenba@indiana.edu
You are here: http://memex.lib.indiana.edu/hrosenba/www/Papers/asis981.html