Battling the 'bots? Try bringing out the big guns

Can a website have too much traffic?

Brett Tabke says, unequivocally, yes. That’s why the editor of WebmasterWorld.com closed off his site to search engines crawlers last month. It might seem an ironic move for a website built in large part by webmasters sharing tips on how to build traffic to their sites. But a conversation with Tabke quickly revealed the frustration that prompted the ban.

“WebmasterWorld was knocked offline twice in the past month, and a couple of bots hit the site for more than 12 million page views during [WebmasterWorld’s Las Vegas PubCon] conference,” Tabke said.

It is hardly unprecedented for news sites to decide that they have too much traffic. Some newspaper websites have implemented registration systems in part to discourage out-of-market traffic referred through search engines in favor of repeat local readers. But Tabke’s move targets an overlooked, and by some accounts growing, component of Web traffic — spiders and other automated user agents.

“The problem affects anyone running dynamic content,” Tabke said. “You’re not just serving a page off a server. You’re generating a page on the fly. And that causes a pretty good system load.”

Automated agents hit webservers with loads that dwarf those of human readers, explained Rob Malda, editor of Slashdot.org.

“A normal user will read the index, a couple stories, and maybe a few comments,” Malda wrote in an e-mail interview. “This might take them 5 minutes. A robot will hit the index, 10 articles, and a thousand comments. … We’ve had robots spider 150-200k pages in a matter of hours. Humans don’t do that.”

“Some of the worst abusers are monitoring sites –- services that look for updates, or businesses that look for trademarks,” Tabke said. “There are services out there whose entire business model is to crawl forums looking for the use of their clients’ trademarks in discussions.”

“These offline browsers start to drag a server down,” Tabke said.

Solutions – and new problems

A robots exclusion protocol allows webmasters to control access to their website by automated agents. Using a robots.txt file in the Web server’s root directory, a webmaster can declare which robots are — or are not — allowed to access certain content on a site. Tabke last month set WebmasterWorld’s robots.txt file to ban access to all the site’s content by all robots. This solution came only after repeated attempts to slow down the automated onslaught, Tabke said.

“First, we required cookies, to force a log-in to view pages on the site,” Tabke said. “Right there, that took care of 95 percent of the bots. But that other five percent wouldn’t be stopped.”

“Before we started addressing the problem, we were at about 10 to one robots versus human” traffic, Tabke said. “After we addressed it, we got that number down to about 20 percent. But if we let it go, if we stopped trying to keep the bots out, it would easily go back to 20 to one, 20 bots for every human reader.”

The problem with Tabke’s solution is that, with search engine spiders among those no longer welcomed at his site, WebmasterWorld.com has disappeared from Google and other search engine results. The switch also has taken down the site’s internal search, which replied on Google. Yet the move has relieved the excess load on the site’s servers.

“I got some of my best sleep in months,” Tabke said of the nights after the switch.

Tabke notes the dilemma that heavy automated traffic creates for webmasters. Sites put up with the load from spiders and automated agents in exchange for larger human audiences down the line. Fortunately for Tabke, WebmasterWorld.com is a well-established destination for Internet publishers, and no longer relies on strong search engine performance to attract visitors. But he acknowledges that not every webmaster facing heavy bot loads would be able to write off search engine traffic the way he has.

“Here we tried to write software to make our site crawlable to get the search engine traffic and the eyeballs, and now we’re rolling all that back to stop the bots.”

And, of course, not all automated traffic ultimately benefits publishers. Many unscrupulous developers unleash site rippers designed to harvest e-mail addresses, flood comment sections with spammed messages, or to snatch content to be rebuilt into pages targeted to pay-per-click advertising systems.

But trying to keep the “bad” bots out while letting the “good” bots in can lead to the sort of administrative headaches that ultimately led Tabke to shutting the door for all. Ultimately, webmasters will have to decide for themselves how much automated traffic they are willing to service.

“I think webmasters are obligated to create some sort of limit because a handful of ill-behaved robots can degrade performance for other users,” Slashdot’s Malda wrote.

Tabke is looking to his readers for help.

“We’ve had a lot of good ideas from our members,” he said. “Some have even given us code that we might use to address the problem.”

“It’s not something that the average webmaster can do. But we’re certainly going to share ideas and try.”

Comments

Barry Parr says:

December 14, 2005 at 9:46 am

Coastsider was nearly booted by my hosting service because of robot-related traffic. They were also wildly inflating my traffic numbers, making it difficult to set realistic ad rates. I was able to get my cpu load down to a reasonable level by caching some content, excluding bots from certain directories, and excluding certain bots altogether. It was a painful and time-consuming process and one I doubt is complete. But I could not afford to exclude robots altogether.

This also raises the question of how much robot traffic advertisers are paying for.
Robert Niles says:

December 14, 2005 at 11:55 am

Metrics becomes a huge issue here. Webmasters routinely knock survey-based readership numbers, often with comment along the lines of “I know what my traffic is — just look at my log files!” But if you are not factoring bot traffic out of your log files, you are almost certainly overstating your human traffic, and possibly by an order of magnitude, if your site is getting hit like WebmasterWorld’s.

When I took over OJR, and spent some time with our log files, I was surprised to find that almost three-fouths of the site’s traffic was bots. I’d had sites get hit hard by bots before, but never as much as OJR was. Now, serving that traffic wasn’t a problem for us, and we’ve taken only a few, standard steps toward keeping out the bad guys. But it did require my doing quite a bit of work setting up new log file filters when we switched ISPs, in order to get more accurate traffic numbers.

Battling the 'bots? Try bringing out the big guns

Solutions – and new problems

Comments

Some advice on covering tragedies

Journalists too quick to call Boston explosions a terrorist attack?

Journalism schools educate more employable students

Does Twitter put limitations on discussions of race?

Search OJR

Follow us on Facebook!

About Us

Browse Archives

Top Tags

Battling the 'bots? Try bringing out the big guns

Solutions – and new problems

Comments

Some advice on covering tragedies

Journalists too quick to call Boston explosions a terrorist attack?

Boston Marathon explosions remind journalists how to handle social media

Journalism schools educate more employable students

Does Twitter put limitations on discussions of race?

Search OJR

Follow us on Facebook!

About Us

Browse Archives

Top Tags