robots

Wanted: human editors. Scrapers and robots need not apply

January 10, 2012 by Robert Niles

My world is awash in crap data.

Several times a week, I open my snail mail box to find bulk-mail solicitations for some member of one of my websites, but sent to the site’s street address. Every month or so, I’ll get a series of calls to my business phone (which is listed on my website), but the caller will ask for a name I’ve never heard. For the rest of that week, I’ll get dozens of similar calls, from different people calling on behalf of some work-at-home scheme, all asking for the same fake name.

And whenever I’m stuck searching for information via Google or Bing, I inevitably have to scroll past link after link to scraped websites – pages written not by any human being, but slapped together by scripts created to blend snippets from other webpages into something that will fool Google’s or Bing’s algorithm into promoting them.

If Google really wants to make its search engine results pages more meaningful, forget about adding links from my Google+ friends. How about creating a scraper-free search engine, instead?

I have no doubt that the reason why I get all those misaddressed letters and wrong-number phone calls is that some fly-by-night “data” company scraped together a database by mashing up names, street addresses and phone numbers it crawled on various websites. That database gets laundered through some work-at-home company, which sells it to ~~customers~~ suckers via the Internet as a “lead list” for commission sales.

It’s bad enough to take phone calls from these poor chumps, who think that they’ve taken a step toward earning some honest income. But I’m stunned when I see the bogus-name letters coming to my office from established colleges and non-profit institutions, who clearly also have bought crap mailing lists.

(FWIW, all my phone numbers are on the National Do-Not-Call Registry, and I’m opted out of commercial snail mail with the Direct Marketing Association, so no legitimate data company should be selling my contact information to businesses and organizations I’ve not dealt with before.)

Maybe it’s too much to hope for a solution that frees me from having to throw away all these unwanted letters and beg off these unwanted phone calls. (Not to mention saving the people contacting the expense of pursuing bogus leads.) But maybe I can hope for a scraper-free Internet experience instead.

I know it’s possible, because there used to be a scraper-free search engine – one that searched just hand-picked Web sites created by actual human beings. It was called Yahoo!, and if they’re smart, the latest crew of new managers at Yahoo! could do far worse than trying to recreate a 2012 version of their Web directory, then using it to populate a Google-killing search engine.

For an example of the garbage polluting search engines today, this site came up high in the SERPs when I searched recently for my wife’s name and the name of her website.

If you know anything about the violin, you should be ROTFL now. For those who aren’t violin fans, allow me to explain that Ivan Galamian, one of the great violin pedagogues of the 20th century, has been dead for over 30 years. While we would have loved to have someone of his stature working for us at Violinist.com, only an idiot scraper script would think he works for us now.

It kills me that good websites, blogs and journals written by thoughtful correspondents get pushed down in the SERPs – and overlooked by potential fans – because of this garbage.

I want a search engine that knows better – that excludes Web domains populated by scraped data and instead searches online sites written by actual human beings. I wouldn’t limit such a search engine to sites written by paid, professional staff. There’s too much rich content to be found in the conversations of others. But blogs, discussion boards and rating-and-review sites included in this search engine should be composed of information submitted by human beings, not scraped from other websites and edited together by bots.

The original Yahoo! lost when start-up rival Google indexed more pages than Yahoo, giving Google an edge over its established competition. But I – and, I suspect, many others – don’t care about the size of a search engine’s database any longer. Google’s right on in its attempt, announced today, to build a more human-driven search engine. But I’m not convinced that adding Google+ links to the SERPs is enough of a change to make a difference in quality.

First, not enough people use Google+. Its 18-and-over-only age limit also disqualifies the millions of teen-agers who help drive the digital conversation. And I fear that Google’s new “Search Plus Your World” approach simply will encourage spammers to flood Google+ with even more bogus accounts and friend requests, in order to boost their reach into the Google SERPs those new “friends” see.

It’s great to use social media to help bring more people into the process of selecting which websites should be indexed in a search engine. But, ultimately, at this point organizations still need more aggressive in-house human oversight in back-checking the results.

Google lost its quality control over its SERPs long ago. Whether it’s search engine results or business lead lists, there’s too much crap data on the market today. That illustrates the continued need for more, and better, human leadership of data cultivation. There’s a market need out there. So who’s going to step forward to fulfill it?

Battling the 'bots? Try bringing out the big guns

December 13, 2005 by Robert Niles

Can a website have too much traffic?

Brett Tabke says, unequivocally, yes. That’s why the editor of WebmasterWorld.com closed off his site to search engines crawlers last month. It might seem an ironic move for a website built in large part by webmasters sharing tips on how to build traffic to their sites. But a conversation with Tabke quickly revealed the frustration that prompted the ban.

“WebmasterWorld was knocked offline twice in the past month, and a couple of bots hit the site for more than 12 million page views during [WebmasterWorld’s Las Vegas PubCon] conference,” Tabke said.

It is hardly unprecedented for news sites to decide that they have too much traffic. Some newspaper websites have implemented registration systems in part to discourage out-of-market traffic referred through search engines in favor of repeat local readers. But Tabke’s move targets an overlooked, and by some accounts growing, component of Web traffic — spiders and other automated user agents.

“The problem affects anyone running dynamic content,” Tabke said. “You’re not just serving a page off a server. You’re generating a page on the fly. And that causes a pretty good system load.”

Automated agents hit webservers with loads that dwarf those of human readers, explained Rob Malda, editor of Slashdot.org.

“A normal user will read the index, a couple stories, and maybe a few comments,” Malda wrote in an e-mail interview. “This might take them 5 minutes. A robot will hit the index, 10 articles, and a thousand comments. … We’ve had robots spider 150-200k pages in a matter of hours. Humans don’t do that.”

“Some of the worst abusers are monitoring sites –- services that look for updates, or businesses that look for trademarks,” Tabke said. “There are services out there whose entire business model is to crawl forums looking for the use of their clients’ trademarks in discussions.”

“These offline browsers start to drag a server down,” Tabke said.

Solutions – and new problems

A robots exclusion protocol allows webmasters to control access to their website by automated agents. Using a robots.txt file in the Web server’s root directory, a webmaster can declare which robots are — or are not — allowed to access certain content on a site. Tabke last month set WebmasterWorld’s robots.txt file to ban access to all the site’s content by all robots. This solution came only after repeated attempts to slow down the automated onslaught, Tabke said.

“First, we required cookies, to force a log-in to view pages on the site,” Tabke said. “Right there, that took care of 95 percent of the bots. But that other five percent wouldn’t be stopped.”

“Before we started addressing the problem, we were at about 10 to one robots versus human” traffic, Tabke said. “After we addressed it, we got that number down to about 20 percent. But if we let it go, if we stopped trying to keep the bots out, it would easily go back to 20 to one, 20 bots for every human reader.”

The problem with Tabke’s solution is that, with search engine spiders among those no longer welcomed at his site, WebmasterWorld.com has disappeared from Google and other search engine results. The switch also has taken down the site’s internal search, which replied on Google. Yet the move has relieved the excess load on the site’s servers.

“I got some of my best sleep in months,” Tabke said of the nights after the switch.

Tabke notes the dilemma that heavy automated traffic creates for webmasters. Sites put up with the load from spiders and automated agents in exchange for larger human audiences down the line. Fortunately for Tabke, WebmasterWorld.com is a well-established destination for Internet publishers, and no longer relies on strong search engine performance to attract visitors. But he acknowledges that not every webmaster facing heavy bot loads would be able to write off search engine traffic the way he has.

“Here we tried to write software to make our site crawlable to get the search engine traffic and the eyeballs, and now we’re rolling all that back to stop the bots.”

And, of course, not all automated traffic ultimately benefits publishers. Many unscrupulous developers unleash site rippers designed to harvest e-mail addresses, flood comment sections with spammed messages, or to snatch content to be rebuilt into pages targeted to pay-per-click advertising systems.

But trying to keep the “bad” bots out while letting the “good” bots in can lead to the sort of administrative headaches that ultimately led Tabke to shutting the door for all. Ultimately, webmasters will have to decide for themselves how much automated traffic they are willing to service.

“I think webmasters are obligated to create some sort of limit because a handful of ill-behaved robots can degrade performance for other users,” Slashdot’s Malda wrote.

Tabke is looking to his readers for help.

“We’ve had a lot of good ideas from our members,” he said. “Some have even given us code that we might use to address the problem.”

“It’s not something that the average webmaster can do. But we’re certainly going to share ideas and try.”

Wanted: human editors. Scrapers and robots need not apply

Battling the 'bots? Try bringing out the big guns

Solutions – and new problems

Some advice on covering tragedies

Journalists too quick to call Boston explosions a terrorist attack?

Journalism schools educate more employable students

Does Twitter put limitations on discussions of race?

Search OJR

Follow us on Facebook!

About Us

Browse Archives

Top Tags

Wanted: human editors. Scrapers and robots need not apply

Battling the 'bots? Try bringing out the big guns

Solutions – and new problems

Some advice on covering tragedies

Journalists too quick to call Boston explosions a terrorist attack?

Boston Marathon explosions remind journalists how to handle social media

Journalism schools educate more employable students

Does Twitter put limitations on discussions of race?

Search OJR

Follow us on Facebook!

About Us

Browse Archives

Top Tags