'What is Robots.txt?'

Every Web publisher ought to be thinking about how to improve the traffic that they get from search engines. Even the most strident “I’m trying to appeal only to people in my local community” publishers should recognize that some people within their community, as is the case in any community, are using search engines to find local content.

Which brings us to this week’s reader question. Actually, it isn’t from a reader, but from a fellow participant in last week’s NewsTools 2008 conference. He asked the question during the session with Google News’ Daniel Meredith, and I thought it worth discussing on OJR, because I saw a lot of heads nodding in the room as he asked it.

Meredith had mentioned robots.txt as a solution to help publishers control what content on their websites that Google’s indexing spiders would see. A hand shot up.

“What is robots-dot-text?”

Meredith gave a quick and accurate answer, but I’m going to go a little more in depth, for the benefit of not-so-tech-savvy online journalists who want the hard work on their websites to get the best possible position in search engine results.

Note that I wrote “the best possible position,” and not “the top position.” There’s a difference, and I will get to that in a moment.

First, robots.txt is simply a plain-text file that a Web publisher should put in the root directory of their website. (E.g. http://www.www.ojr.org/robots.txt. It’s there; feel free to take a look.) The text files includes instructions that tell indexing spiders, or “robots,” what content and directories on that website they may, or may not, look at.

Here’s an example of a robots.txt file:

User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow: /*.doc$
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /ads

This file tells the “Mediapartners-Google” spider that it can look at anything on the website. (That’s the spider that Google uses to assist in the serving of AdSense ads.) Then, it tells other spiders that they should not look at any Microsoft Word documents, GIF or JPGs images, or anything in the “ads” directory on the website. The asterisk, or *, is a “wild card” that means “any value.”

Let’s say a search engine spider finds an image file in a story that’s it is looking at one your website. The image file is located on your server at /news/local/images/mugshot.jpg, that is, it is a file called mugshot.jpg, located within the images directory within the local directory within the news directory on your Web server.

Your robots.txt file told the spider not to look at any files that match the pattern /*.jpg. This file is /news/local/images/mugshot.jpg, so it matches that pattern (the asterisk * taking the place of news/local/images/mugshot). So the spider will ignore this, and any other .jpg file it finds on your website.

So why is this important to an online journalist? Remember that Meredith said Google penalizes websites for duplicate content. If you want to protect your position in Google’s search engine results and in Google News, you want to search engine spiders to focus on content that is unique to your website, and ignore stuff that isn’t.

So, for example, you might want to configure your robots.txt so it ignores all AP and other wire stories on your website. The easiest way to do this is to configure your content management system to route all wire stories into a Web directory called “wire.” Then put the following lines into your robots.txt file:

User-agent: *
Disallow: /wire

Boom. Duplicate content problem for wire stories solved. Now this does mean that Web searchers will no longer be able to find wire stories on your website through search engines. But many local publishers would be that result as a feature, not a bug. I’ve heard many newspaper publishers argue that coming to their sites from search engine links to wire content do not convert for site advertisers and simply hog site bandwidth.

If you are using a spider to index your website for an internal search engine, though, you will need to allow that spider to see the wire content, if you want it included in your site search. If that’s the case, add these lines above the previous ones in your robots.txt:

User-agent: name-of-your-spider
Allow: /wire

Or, use

User-agent: name-of-your-spider
Allow: *

… if you wish it to see and index all of the content on your site.

Sometimes, you do not want to be in the top position in the search engine results, or even in those results at all. On OJR, we use robots.txt to keep robots from indexing images, as well as a few directories where we store duplicate content on the site.

Other publishers might effectively use robots.txt to exclude premium content directories, files stores on Web servers that aren’t meant for public use, or files that you do not wish to be viewed by Web visitors except those who find or follow the file from within another page on your website.

Unfortunately, many rogue spiders roam the Internet, ignoring robots.txt and scraping content from sites without pause. Robots.txt won’t stop those rogues, but most Web servers can be configure to ignore requests from selected IP addresses. Find the IPs of those spiders, and you can block them from your site. But that’s a topic for another day.

There’s no good reason to lament search engines finding and indexing content that you don’t want anyone other that your existing site visitors or other selected individuals to see. Nor do you have to suffer duplicate content penalties because you run a wire feed on your site. A thoughtful robots.txt strategy can help Web publishers optimize their search engine optimization efforts.

Want more information on creating or fine-tuning a robots.txt file? There’s a good FAQ [answers to frequently asked questions] on robots.txt at http://www.robotstxt.org/faq.html.

Got a question for the online journalism experts at OJR? E-mail it to OJR’s editor, Robert Niles, via ojr(at)www.ojr.org

About Robert Niles

Robert Niles is the former editor of OJR, and no longer associated with the site. You may find him now at http://www.sensibletalk.com.