Have you also been wondering, why the error log of your web server constantly returns entries like
[error] [client 204.62.245.187] File does not exist: /usr/local/etc/httpd/htdocs/mysupersite/robots.txt?
When you submit your website to a spider engine, the spider engine "visits" your site to register it. Most spider engines thereby search automatically for the robots.txt file. If this file is not found, the above error occurs.
However, the robots.txt file is not compulsory. Instead of a file, you can use the "robots" meta tag. However, if you do not include a robots.txt file and submit your page to hundreds of spider engines (e.g. with Hello Engines!), you will receive also hundreds of error messages. Please note that your website is probably visited by several search engines every day. Therefore, the error.log file might soon become very large, as it is filled up with irrelevant error messages.
In the robots.txt file of your site, you have the option to define the pages that are to be excluded from the indexing. Please not that only one robots.txt is taken into account per server and that it must be located on the top level. For a UNIX system, it can for example be filed in
/usr/local/etc/httpd/htdocs/robots.txt
The syntax of the robots.txt is extremely simple and generally looks like this:
In the above case, two directories are excluded from the indexing. For each directory that is not to be indexed by the spider engine, you must add a separate "disallow" line.
Example: to block all robots from accessing and indexing your website, enter the following lines in the robots.txt file:
To allow all robots to access and index all pages of your website, enter the following lines in robots.txt:
To prevent a specific robot from accessing your directories, enter the following:
To allow only one specific robot to index your directories (thus blocking all others) enter the following lines:
Similarly, you can exclude specific pages from indexing: