Ban all robots to stop the rogues?

Almost all Web publishers successful enough to have to pay bandwith charges have struggled with how to deal with traffic from robots. These are the automated programs, sent by search engines, crackers, spammers, sloppy developers and even overeager handheld owners, to scan, index and even download thousands of pages from your website.

When I arrived at OJR, I was surprised to find that more than half, almost two-thirds, of the site’s traffic was not from human readers, but from robots. Some of that traffic was welcomed, such as robots from major search engines like Google and Yahoo News. But much of it was from rogue spiders — spammers trolling for e-mail addresses, attempts to download the entire site for duplication on various scraper sites, and such. I spent a fair amount of time tweaking OJR’s robots.txt file to ban identified rogue spiders, and OJR’s stats software to filter hits from the rest.

Well, this week WebmasterWorld.com has taken the radical step of banning all spiders from its site. In a post on the site, administrator Brett Tabke reported that despite spending five to eight hours a week fending off rogue spiders, the site was still hit with 12 million unwanted spider page views last week.

The move, presumably, will result in WebmasterWorld disappearing from major search engine results and would eliminate the site from archive searches, as as Archive.org.

WebmasterWorld has established a large and loyal audience. One could argue that the site doesn’t need search engine traffic. But how loyal will its readership turn out to be if members can’t search for the site, or its archives, through Google, et al?

As Brett titled his post announcing the change, “lets try this for a month or three…”

Then we will see.

Proposal: New standards and tools for distributed online reporting

One of the Internet’s strengths as a medium for journalism is its ability to support widely distributed, grassroots news reporting. Whenever a significant earthquake hits Southern California, tens of thousands of residents log on to the local U.S. Geological Survey website to report what they felt. The USGS site processes these surveys in real time to generate zip-code level shake maps that depict the intensity of the quake throughout the region.

There’s no need to install sensors all over town. And no wait for a costly phone survey. The Internet enables the USGS to engage a small army of citizen reporters to collect their information. Journalists, of course, can do the same with their reporting projects.

But such efforts run into problems when there’s no single obvious source for grassroots reporters to submit their information. We saw this with the dozens of websites that attempted to compile missing persons lists after Hurricane Katrina. No one publication had a comprehensive list of the missing. And attempts to aggregate the lists required either finger-numbing cutting and pasting, or equally tedious RegEx coding to parse the data from the various websites.

Sure, the USGS managed to establish its website as the place to go to report earthquakes. But, for most stories, readers and journalists face the “Katrina conundrum” — too many sources trying to collect the same information, without coordination.

It doesn’t have to be that way. Of course, some journalists always will want to go their own way, searching for a scoop. But others see the value in cooperation, in working together to best provide comprehensive information for the public. To do that, website publishers need:

  • A simple online tool with which to collect fielded information from the public.

  • A way to share that information with others collecting similar information, and
  • A way for all those information collectors to know when other collectors have gathered fresh information.

Today, I want to propose that OJR lead an initiative to address these three needs.

Right now, to collect information like the USGS, or a Katrina missing persons list, you need to be a coder who can put together an HTML input form and a script to dump that information into a database. What online journalism needs is a free, open-source tool that does for grassroots reporting what Blogger.com did for online journals – making it easy for a non-coder to set up a grassroots reporting input page with no HTML or database experience.

Second, the information that tool collects ought to be recorded in a standard fielded format, so that it can be easily shared with other collection efforts. There’s no need to build a common database or central server to support this. All that’s needed is for each site collecting data to be able to export it as XML, using a common set of fields. Tools can be written, along the lines of RSS aggregators, to collect those XML fields and aggregate them into comprehensive databases.

Personally, I believe that the RSS standard itself does not support nearly enough fields to transmit an entire database of incident reports. We need something more expansive. Dave Winer’s OPML moves in that direction, but I don’t know that it offers the granularity needed for this project. The point is, I think a common XML format is the solution to the problem, but that we need to have some industry discussion as to what that format might be. Obviously, it ought to be flexible enough to accommodate everything from missing persons lists to fraud reports to (my pet project) theme park accidents. Let’s start talking on what that format might look like. (And, rest assured, I don’t want to recreate the overkill of NewsML.)

Third, we need a weblogs.com-type destination site that information collectors can ping to let identically tagged information collection projects know that they’ve been updated.

We could build a development tool that handles issues two and three itself. But I think that it is important that any development tool work with collection efforts that do not use the tool. That’s why the blogosphere works so well. You don’t have to use Blogger, or any other specific individual tool, to link to and aggregate other blogs. Our distributed reporting efforts should work the same way.

So, who’s interested in helping me refine this idea, and build these tools? The development of blogging tools showed us the power that could be unleashed when we liberated online narrative publishing from the HTML coders and opened it to everyone. Let’s do the same with distributed data reporting. E-mail me at rniles@usc.edu, and let’s get started.

Online story idea: Check the advertisers, too

Today’s post is for our self-appointed media watchdogs around the blogosphere. Don’t forget you can check out a publication’s advertisers, too.

After years of cajoling, my wife’s convinced me to get my wretched, non-functioning nose fixed. (Stay with me, I’m not changing the topic. Really.) But before I picked a surgeon, I did some online research to confirm that he was board-certified.

Here’s the story idea: Your readers might be surprised to discover the number of plastic surgeons advertising on billboards and in newspapers and magazines in your town who are not board-certified in their field. Why not grab the local paper and look up the status of the physicians advertising there?

Check those names against the list of diplomates of the American Board of Plastic Surgery, which you can find on the American Board of Medical Specialties website at http://www.abms.org/login.asp, or the American Board of Facial Plastic and Reconstructive Surgery, at http://www.abfprs.org/certified/index.cfm. (A physician need not belong to both to be board-certified. One’s good enough.) A couple minutes on these sites revealed that a very prominent local physician in my hometown of Pasadena, Calif., one who’s had billboards up all over town, is not board-certified. Funny how that detail isn’t mentioned in any of his ads.

(And if you’re looking for a graphic take on why patients should choose a board-certified surgeon, allow me to recommend “Skin Tight,” by the Miami Herald’s Carl Hiaasen.)

Take it up a level and check out the physicians’ state medical license status, too. You can look those up at http://www.docboard.org/docfinder.html. Are there any advertisers who’ve been in trouble with the state medical board before?

The same concept works with other types of physicians and state-licensed professionals, too, including dentists and attorneys. If you try this story on your blog or website, let me know and I’ll link back to it. Good luck!