Seven essential resources to help protect your website from technical attack

Kevin Roderick publishes the widely read and highly acclaimed (among Los Angeles-area journalists, at least) blog LA Observed. But this week, Roderick’s been living a Web journalist’s nightmare. Earlier this week, many Web browsers started blocking access to his website, following Google’s determination that LA Observed included links to sites that were distributing malware – malicious code that could infect readers’ websites with viruses and other nasty stuff.

Roderick ultimately traced the problem to ads running on his site, and took that section down while he worked with his hosting provider to purge the links. A day after clearing the site, Google cleared LA Observed, and traffic is able again to flow normally to Roderick’s site.

I don’t want to write about Roderick’s specific situation, beyond using it as a peg to remind all independent online publishers of the importance of keeping an eye on the tech side of publishing.

Tools such as Blogger and Moveable Type have allowed writing with no tech training to become popular and self-sustaining online publishers. But tech gremlins can attack anyone, and even novices need to pay attention to the threats.

In that spirit, here are seven essential resources for online publishers who don’t want to get burned:

1. Google’s Webmaster Tools
http://www.google.com/webmasters/tools/

Google drives more traffic than any other site online, so it just makes sense for you to use every tool that Google provides you to improve your website’s position in its search engine results. Google’s toolset allows you to submit and track sitemaps of your website’s content, see how Google ranks your content via a simple interface, as well as to learn of any problems that Google is having with your site, problems that might drive your site down in the Google search engine results, or cause it to be blocked altogether.

2. Google’s Safe Browsing Diagnostic Tool
http://www.google.com/safebrowsing/diagnostic?site=http://www.yoururl.com

There is no landing page for this valuable service, so you’ll have to copy and paste the URL above, substituting your URL for “www.yoururl.com”. This page allows you to see what your users will see if Google ends up blocking your page, for similar reasons that it blocked LA Observed. But if you check this page on a regular basis, you won’t have to wait for your traffic to tank, or readers to e-mail you, to discover if you have a problem. The page also will detail the problem for you, allowing you to more efficiently isolate and remove it.

3. Google’s Online Security Blog
http://googleonlinesecurity.blogspot.com/

This is where to go for an overview of website security issues, as they can affect your presence on Google. You’ll want to bookmark the blog’s post on recovering from a website hack in case you ever find your site infected by malware, or blocked because it is linking to such sites.

4. Stop Badware’s Link Clearinghouse
http://stopbadware.org/home/clearinghouse

You can prevent linking to “bad neighborhoods” online, including malware sites, by checking links through this page, before adding them to your site. Obviously, if you permit user-generated content on your website, and allow readers to post links, you won’t be able to control every outbound link from your site. But this tool can be helpful in allowing you to avoid bad linking in your work on the site.

5. Webmaster World
http://www.webmasterworld.com

I’ve recommended Webmaster World before, and want to do so again today. It’s the best forum I’ve found for highly detailed news and analysis about how to prevent, and recover from, tech attacks on multiple common online publishing platforms. Browse the forums relevant to your CMS on a regular basis to stay aware of breaking threats to your website. If you need to post an emergency patch to your CMS, this is likely the place where you’ll find out about it.

6. Matt Cutts’ Blog
http://www.mattcutts.com/blog/

Cutts is one of Google’s software engineers and the go-to guy in combating Web spam. A popular speaker at Webmaster conferences, Cutts details many of the threats facing online publishers and offers guidance on how to deal with them.

7. Search Engine Land
http://searchengineland.com/

Danny Sheridan’s website offers much more than security advice; it’s a great bookmark to stay on top of many technical aspects of Web publishing, notably improving and protecting your position in search engine results.

There are no 100 percent guarantees online. You could follow all these links on a regular basis, and still end up hacked. But reading and using these resources will greatly improve the odds to your favor.

It is also choose your Web hosting partner carefully – to find someone who has a track record of protecting client websites, and with whom you can comfortably communicate, in case the day ever comes when you need help to recover from a website attack.

Battling the 'bots? Try bringing out the big guns

Can a website have too much traffic?

Brett Tabke says, unequivocally, yes. That’s why the editor of WebmasterWorld.com closed off his site to search engines crawlers last month. It might seem an ironic move for a website built in large part by webmasters sharing tips on how to build traffic to their sites. But a conversation with Tabke quickly revealed the frustration that prompted the ban.

“WebmasterWorld was knocked offline twice in the past month, and a couple of bots hit the site for more than 12 million page views during [WebmasterWorld’s Las Vegas PubCon] conference,” Tabke said.

It is hardly unprecedented for news sites to decide that they have too much traffic. Some newspaper websites have implemented registration systems in part to discourage out-of-market traffic referred through search engines in favor of repeat local readers. But Tabke’s move targets an overlooked, and by some accounts growing, component of Web traffic — spiders and other automated user agents.

“The problem affects anyone running dynamic content,” Tabke said. “You’re not just serving a page off a server. You’re generating a page on the fly. And that causes a pretty good system load.”

Automated agents hit webservers with loads that dwarf those of human readers, explained Rob Malda, editor of Slashdot.org.

“A normal user will read the index, a couple stories, and maybe a few comments,” Malda wrote in an e-mail interview. “This might take them 5 minutes. A robot will hit the index, 10 articles, and a thousand comments. … We’ve had robots spider 150-200k pages in a matter of hours. Humans don’t do that.”

“Some of the worst abusers are monitoring sites –- services that look for updates, or businesses that look for trademarks,” Tabke said. “There are services out there whose entire business model is to crawl forums looking for the use of their clients’ trademarks in discussions.”

“These offline browsers start to drag a server down,” Tabke said.

Solutions – and new problems

A robots exclusion protocol allows webmasters to control access to their website by automated agents. Using a robots.txt file in the Web server’s root directory, a webmaster can declare which robots are — or are not — allowed to access certain content on a site. Tabke last month set WebmasterWorld’s robots.txt file to ban access to all the site’s content by all robots. This solution came only after repeated attempts to slow down the automated onslaught, Tabke said.

“First, we required cookies, to force a log-in to view pages on the site,” Tabke said. “Right there, that took care of 95 percent of the bots. But that other five percent wouldn’t be stopped.”

“Before we started addressing the problem, we were at about 10 to one robots versus human” traffic, Tabke said. “After we addressed it, we got that number down to about 20 percent. But if we let it go, if we stopped trying to keep the bots out, it would easily go back to 20 to one, 20 bots for every human reader.”

The problem with Tabke’s solution is that, with search engine spiders among those no longer welcomed at his site, WebmasterWorld.com has disappeared from Google and other search engine results. The switch also has taken down the site’s internal search, which replied on Google. Yet the move has relieved the excess load on the site’s servers.

“I got some of my best sleep in months,” Tabke said of the nights after the switch.

Tabke notes the dilemma that heavy automated traffic creates for webmasters. Sites put up with the load from spiders and automated agents in exchange for larger human audiences down the line. Fortunately for Tabke, WebmasterWorld.com is a well-established destination for Internet publishers, and no longer relies on strong search engine performance to attract visitors. But he acknowledges that not every webmaster facing heavy bot loads would be able to write off search engine traffic the way he has.

“Here we tried to write software to make our site crawlable to get the search engine traffic and the eyeballs, and now we’re rolling all that back to stop the bots.”

And, of course, not all automated traffic ultimately benefits publishers. Many unscrupulous developers unleash site rippers designed to harvest e-mail addresses, flood comment sections with spammed messages, or to snatch content to be rebuilt into pages targeted to pay-per-click advertising systems.

But trying to keep the “bad” bots out while letting the “good” bots in can lead to the sort of administrative headaches that ultimately led Tabke to shutting the door for all. Ultimately, webmasters will have to decide for themselves how much automated traffic they are willing to service.

“I think webmasters are obligated to create some sort of limit because a handful of ill-behaved robots can degrade performance for other users,” Slashdot’s Malda wrote.

Tabke is looking to his readers for help.

“We’ve had a lot of good ideas from our members,” he said. “Some have even given us code that we might use to address the problem.”

“It’s not something that the average webmaster can do. But we’re certainly going to share ideas and try.”