Block bots from crawling your website using robots.txt

From The IT Community
Jump to: navigation, search




So that search engines can index your website, they need to crawl it. For that they use programs called bots. Sometimes you might not want all of your pages to appear in search engines. In that case, create a robots.txt file in the root directory of your website.

In this file you can either block all, or just bots from specific search engines. See also All options for user-agent in robots.txt file

Each section in the robots.txt file is separate and does not build upon previous sections. For example:

User-agent: *
Disallow: /folder1/

User-Agent: Googlebot
Disallow: /folder2/

In this example only the URLs matching /folder2/ would be disallowed for Googlebot.

  • To block the entire site, use a forward slash.
Disallow: /
  • To block a directory and everything in it, follow the directory name with a forward slash.
Disallow: /junk-directory/
  • To block a page, list the page.
Disallow: /private_file.html
  • To remove a specific image from Google Images, add the following:
User-agent: Googlebot-Image
Disallow: /images/dogs.jpg 
  • To remove all images on your site from Google Images:
User-agent: Googlebot-Image
Disallow: / 
  • To block files of a specific file type (for example, .gif), use the following:
User-agent: Googlebot
Disallow: /*.gif$
  • To match a sequence of characters, use an asterisk (*). For instance, to block access to all subdirectories that begin with private:
User-agent: Googlebot
Disallow: /private*/
  • To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
User-agent: Googlebot
Disallow: /*?
  • To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
User-agent: Googlebot 
Disallow: /*.xls$

You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:

User-agent: *
Allow: /*?$
Disallow: /*?

The Disallow: /*? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).

The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).



Was this article helpful? Then please donate to keep The IT Community alive...

If you found this article helpful please share it, comment and help others by writing your own article.






Translate this page:




Articles found in the same category:
(max. 20 shown)