Amending your robots.txt file

I have amended my robots.txt file for this website in an effort to improve the number and quality of pages that are being indexed for this site in the major search engines.

Several days ago I talked about how I had added a Global Translator plugin to this website and then provided a quick update on how happy I had been with the extra pages that were being produced by the search engines.

Well over the last few days I have now noticed that for each new post that I add, only ONE of the translated pages is being added to Google’s index for this site. On closer examination the one page seems quite random i.e. could be the Japanese version, the Spanish version etc. Also I can see that the category page e.g. the ‘plugin’ category is also being indexed.

I decided to investigate this matter and had a look around the internet for answers to this problem. It appears that the way in which wordpress may be set up out of the box creates a duplicate content issue. When you add a new post, this is added to several places in your sites hierarchy and this new post could appear in the search engines results in all these places i.e. under the original single page post, the category post, the archive page, the feed, as well as the home page of the website.

Consequenlty Google sees this as duplicate content and only indexes the added post in one of the places that it is found which again seems random (although I’m sure down to a Google algorithm). Google is smart enough to know that it does not wish to present your same post up 4 times or more to a valued user of its search engine.

My investigations resulted in the same recommended advice being given over and over again and that was to alter your robots.txt file (or appropriate meta tag) in order to prevent search engine crawlers from crawling certain areas of your site and thus reduce and/or remove this potential duplication issue.

Therefore I have amended my robots.txt file and added in extra lines as shown below:-

User-agent: *
Disallow: /feed
Disallow: /category/

#disallow all files with ? in url
Disallow: /*?

The first line is standard allowing all search engine crawlers to go anywhere on the website.
The second line prevents crawling of the feed folder as the rss feed (containing xml) is of no value to most search engine users.
The third line prevents crawling of the category folder.
The fourth line was added because I noticed that some ‘old’ pages were being listed with the old dynamic ‘?’ in the url e.g  ‘tomakemoneyonline.net/?page_id=6′. This should stop any dynamic pages from being crawled and listed by the search engines.

After updating the robots.txt file, I will monitor the pages that Google and others now index and report back in several days.

Technorati

Related Posts

Free Word Translator
MSN Does Not Like Me
Add RSS feed to your static site
List Of Search Engine Robots
MSN Does Like Me

WordPress database error: [Incorrect file format 'wp_comments']
SELECT * FROM wp_comments WHERE comment_post_ID = '24' AND comment_approved = '1' ORDER BY comment_date

Leave a Reply