Wanted: human editors. Scrapers and robots need not apply

My world is awash in crap data.

Several times a week, I open my snail mail box to find bulk-mail solicitations for some member of one of my websites, but sent to the site’s street address. Every month or so, I’ll get a series of calls to my business phone (which is listed on my website), but the caller will ask for a name I’ve never heard. For the rest of that week, I’ll get dozens of similar calls, from different people calling on behalf of some work-at-home scheme, all asking for the same fake name.

And whenever I’m stuck searching for information via Google or Bing, I inevitably have to scroll past link after link to scraped websites – pages written not by any human being, but slapped together by scripts created to blend snippets from other webpages into something that will fool Google’s or Bing’s algorithm into promoting them.

If Google really wants to make its search engine results pages more meaningful, forget about adding links from my Google+ friends. How about creating a scraper-free search engine, instead?

I have no doubt that the reason why I get all those misaddressed letters and wrong-number phone calls is that some fly-by-night “data” company scraped together a database by mashing up names, street addresses and phone numbers it crawled on various websites. That database gets laundered through some work-at-home company, which sells it to customers suckers via the Internet as a “lead list” for commission sales.

It’s bad enough to take phone calls from these poor chumps, who think that they’ve taken a step toward earning some honest income. But I’m stunned when I see the bogus-name letters coming to my office from established colleges and non-profit institutions, who clearly also have bought crap mailing lists.

(FWIW, all my phone numbers are on the National Do-Not-Call Registry, and I’m opted out of commercial snail mail with the Direct Marketing Association, so no legitimate data company should be selling my contact information to businesses and organizations I’ve not dealt with before.)

Maybe it’s too much to hope for a solution that frees me from having to throw away all these unwanted letters and beg off these unwanted phone calls. (Not to mention saving the people contacting the expense of pursuing bogus leads.) But maybe I can hope for a scraper-free Internet experience instead.

I know it’s possible, because there used to be a scraper-free search engine – one that searched just hand-picked Web sites created by actual human beings. It was called Yahoo!, and if they’re smart, the latest crew of new managers at Yahoo! could do far worse than trying to recreate a 2012 version of their Web directory, then using it to populate a Google-killing search engine.

For an example of the garbage polluting search engines today, this site came up high in the SERPs when I searched recently for my wife’s name and the name of her website.

Scraper site screen grab

If you know anything about the violin, you should be ROTFL now. For those who aren’t violin fans, allow me to explain that Ivan Galamian, one of the great violin pedagogues of the 20th century, has been dead for over 30 years. While we would have loved to have someone of his stature working for us at Violinist.com, only an idiot scraper script would think he works for us now.

It kills me that good websites, blogs and journals written by thoughtful correspondents get pushed down in the SERPs – and overlooked by potential fans – because of this garbage.

I want a search engine that knows better – that excludes Web domains populated by scraped data and instead searches online sites written by actual human beings. I wouldn’t limit such a search engine to sites written by paid, professional staff. There’s too much rich content to be found in the conversations of others. But blogs, discussion boards and rating-and-review sites included in this search engine should be composed of information submitted by human beings, not scraped from other websites and edited together by bots.

The original Yahoo! lost when start-up rival Google indexed more pages than Yahoo, giving Google an edge over its established competition. But I – and, I suspect, many others – don’t care about the size of a search engine’s database any longer. Google’s right on in its attempt, announced today, to build a more human-driven search engine. But I’m not convinced that adding Google+ links to the SERPs is enough of a change to make a difference in quality.

First, not enough people use Google+. Its 18-and-over-only age limit also disqualifies the millions of teen-agers who help drive the digital conversation. And I fear that Google’s new “Search Plus Your World” approach simply will encourage spammers to flood Google+ with even more bogus accounts and friend requests, in order to boost their reach into the Google SERPs those new “friends” see.

It’s great to use social media to help bring more people into the process of selecting which websites should be indexed in a search engine. But, ultimately, at this point organizations still need more aggressive in-house human oversight in back-checking the results.

Google lost its quality control over its SERPs long ago. Whether it’s search engine results or business lead lists, there’s too much crap data on the market today. That illustrates the continued need for more, and better, human leadership of data cultivation. There’s a market need out there. So who’s going to step forward to fulfill it?

Should Microsoft buy Yahoo?

The big news roiling the online publishing? Microsoft’s attempt to take over Yahoo!, the latest move in the software giant’s ongoing battle with search engine leader Google.

Let’s talk about it. What would the deal mean for the online news business? For online entrepreneurs? For the economy?

[Sorry — It looks like Twiigs.com, the company that hosts the poll, has eaten the results a couple times due to some server issues it’s had over the weekend. So please do vote again if you see the input form below (which means that your old vote was among those eaten.]

Not all that Wired about it: Communication technology gets the short end at NextFest

Apparently robots and moonrovers are more important than wireless communication and media delivery technology. Or so it would seem after a visit to Wired‘s annual ooh-aah technology convention NextFest, going on this week at the Los Angeles Convention Center.

For a magazine/Web outlet designed to bring information to readers, Wired sure selected a media-light crowd of exhibitors this year. Just eight out of 162 exhibits had anything to do with communications. And really, only Yahoo’s presentation had much of interest to anyone working in online media. (The rest were cool 3D displays, cellphone activated lightshows, installation art of instant messaging, etc.)

What gives? Where were all the next-gen social media applications, the iPhonery, the streaming video delivery stuff? NextFest opted for the wow-factor of robots and lightshows and missed out on what actually changes our lives.

I had a chat with Ben Clemens, Director of the Design Innovation Team at Yahoo, who also did a stint at the online portion of the New York Times.

Ben explained that his team is working on a unique app that will visually chart Web searches in real time and map them onto a model of the globe. Playing back the data will give an insight into how searches spread and develop over geographic space and over time. I thought it would be tremendously useful for journalists following the news cycle of a story, so I asked him about the model. (Partial transcript follows the video.)

Ben Clemens: The idea is there there search burst events which are lots and lots of people looking for the same thing at the same time and we want be be able to visualize that and show what’s the geographic pattern that they are looking for.

OJR: What sort of application might this have for tracking the way people follow a news story, for example?

Right now what you’re seeing is a fairly coarse level of data, but what we’d like to get to is the point where we can actually see as a story unfolded pegging the spread of search queries in some sort of more local event. One of the data sets that we’re actually working on right now is the bridge collapse, we wanted to track on a very local basis how it was that the searches spread, because that started as a very local event and then became a national event. Right now we don’t have the fine grain of geo-coding we would need to actually do that, but that’s the next thing we are working on.

OJR: Would then news websites want to tailor their news offerings based upon real time what people are interested in specific locations?

Ben: I think probably journalists will make their own decisions, but I think it’s good information to get from actual user data. This is what people are actually doing, as opposed to what they say they are interested in.

OJR: Does this connect with Yahoo News at the moment?

Ben: This is an experiment; it is not part of any Yahoo product. We would like to take advantage of it in Yahoo products going forward but for now we’re just at the bleeding edge trying to figure out how we would use this. Just the mechanism to get the data and to individualize it are a lot of the mechanics that we are working on right now. If that gets to a good enough state, then we would talk to products.

OJR: How far down the line is that?

Ben: (Laughs.) I really can’t say.

Ben then showed me an austere white-on-white globe of the earth with slow-moving blue specks shooting out from the surface of the North American continent. He explained that each speck represents a search query instance and that the speed and thickness of the particle streams indicate the popularity of the search. The data set at hand was a Yahoo search for “Mattel,” immediately following the lead-in-toys story that drove worried parents by the thousands to the Internet to search for their child’s toy.

Interesting stuff, and sure to give us too much information about ourselves down the line.

Other than than, NextFest was a bit of a bust from a journalist’s perspective. I mean, don’t get me wrong: the Google Lunar X Prize announcement was uber-cool (journalist was third on my list of childhood aspirations, astronaut and paleontologist being numbers one and two), but really, the lack of media eyecandy was disappointing. I would have thought it would have been a perfect fit for OJR–Wired is journalism that brings you technology and OJR is journalism about technology that brings you journalism–but eh, so it goes. I suppose I shouldn’t be surprised.