I, Robot

Robot is a generic term for those sometimes helpful, sometimes pesky, 
programs that index Web pages across the Internet to feed the 
databases of search engines and other indexing sites. They're also 
sometimes known as spiders, crawlers, or bots. And while it's often a 
good thing that they crawl through your Web site, gathering the 
information to help other Netizens find you on the Web, there may be 
certain pages you don't want visiting robots to access at all. Most 
major robots respect a file called robots.txt, which you can place in 
the root of your Web site. This file gives instructions to robots about
which pages or directories they're allowed to index. The file format 
is self-explanatory, so let's look at an example:

# Robots must obey the following:
User-agent: * # The wildcard means ALL robots
Disallow: /test.asp # Do not index this particular page
Disallow: /administrative # This directory is off-limits!
Disallow: /jaf/test.asp # Don't index this page, but the rest 
# of the directory is OK

This robots.txt illustrates the possible scenarios of disallowing 
entire directories as well as individual files. The pound sign (#) 
introduces a UNIX-style comment. Be aware that not all robots will 
honor this file, particularly the spiders on small or personal sites,
but this approach will allow you to protect your pages and directories 
against major exposure.

You might also like...

Comments

ElementK Journals

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“The first 90% of the code accounts for the first 90% of the development time. The remaining 10% of the code accounts for the other 90% of the development time.” - Tom Cargill