Are search engines stealing newspapers' content?

Sam Zell might be new to the newspaper business, but he’s already publicly embraced the “Internet is a parasite on newspapers” meme.

“If all the newspapers in America did not allow Google to steal their content for nothing, what would Google do, and how profitable would Google be?” the Los Angeles Times reported Zell saying to Stanford University class this week. Zell has agreed to takeover Tribune Co., which owns The Times.

Zell’s comment is as ill-informed as outgoing ASNE president Dave Zeeck’s were to his organization last week. I’ve been searching the Web through Google, and reading Google News, for years, and can’t recall Google publishing complete newspaper stories under the Google brand.

Yes, Google hyperlinks page titles and publishes short snippets of those pages’ content beneath them on its search engine result and Google News pages. Those links have helped millions of readers find newspaper stories that they would not have read otherwise.

Without newspaper content on Google News and in Google’s search engine result pages [SERPs], newspapers would face an even more dire future, as those millions of readers would find instead other, non-newspaper sources of news and information. And Google wouldn’t lose much at all. Google News doesn’t run ads. And I suspect that Google’s AdWords program would continue to haul in billions of dollars annually even without newspaper.coms in the SERPs.

Surely, not everyone in the newspaper industry shares Zell’s extreme view. For all the ignorance that certain newspaper managers exhibit in public forums, the newspaper industry employs many more sharp individuals with deep knowledge of how the Web works and how to make money from it. They’re found in the online departments of newspaper.coms and they deserve their chance to call the shots on how newspapers will approach the Internet.

The Zeeck and Zell attitude won’t save newspapers, and will serve only to further isolate them from a new and growing generation of Web-savvy readers.

I e-mailed several managers to ask them what they thought of Zell’s comment, and how they think the newspaper industry ought to approach the search engines.

Chris Jennewein
Vice President, Internet Operations, Union-Tribune Publishing Co.

I think newspapers should welcome search engines because they drive traffic to our sites. Our business on the Internet is all about building audience, and audiences find our sites through search engines. Fighting the new reality of Internet search is ultimately self-defeating for our industry.

Ken Sands
Online publisher,

Well, first reaction is that Zell is talking primarily about Google News, not Google itself. He suggests Google wouldn’t have anything without the content from media, and that’s just not true. Google News aggregates links. It doesn’t steal the content so much as organize it in a way that’s meaningful for lots of people. Mainstream media could have done the same thing but didn’t have the vision, the organizational capabilities or the technical skill.

Google argues that it’s a partner of the MSM, sending readers to each site (for free!). I can’t argue with that. Almost 40 percent of the visitors to our site come through Google searches. Those are typically one-time, one-story readers, but they do account for a huge amount of our traffic. There’s a legitimate question about whether these inflated numbers are meaningful… not sure this does our local advertisers much good.

On balance, I’d say Google does more good than harm. The Google Ad Sense for Newspapers alpha project is a good example. They’ve sent advertising to print newspapers for several months without taking a cut of the action. When the program goes to beta, they will begin taking a small percentage. They seem to understand that their best way of making money is to make sure that everyone else makes money, too.

Will Google some day dominate the media world? Quite possibly. But for now at least, it’s sure hard to pass up their goodies.

Steve Yelvington
Internet strategist, Morris Communications

Sam Zell’s comments make a lot of editors feel good, and editors need something to feel good about these days. But news search and aggregation are just a piddling part of Google’s portfolio of services. Google doesn’t even run advertising in its news channel, so claiming that newspaper content is behind Google’s profitability is just saber-rattling.

Google is a significant threat, but not because it’s “stealing” newspaper content. They’re rolling up a lot of local small business money that newspapers can and should be chasing. We’re not going to get that business by sitting around in our newspaper offices and wishing for the return of sideburns and bell-bottom trousers. We need to be out creating new solutions and coming up with better ideas.

I’ll post additional comments as they hit my in-box. Or you can post your thoughts through the comment button below.

Non-traditional sources cloud Google News results

OJR readers may remember a September 2004 article by contributing editor J.D. Lasica that suggested a political bias in the popular online news portal Google News. Searching on the term “John Kerry,” Lasica cited several stories from “second-tier” online-only news and commentary sites that appeared to have a conservative tilt. Among them were headlines such as “Swift Boat Veterans for Truth Expose John Kerry’s Lies.”

Lasica’s piece got me thinking about ways to measure bias in search results. His observations became the basis for my recently completed master’s thesis, the findings of which may be of interest given the ongoing debate about the quality of Google News’s sources. While my analysis does not indicate an overall conservative or liberal slant, it does confirm Lasica’s suspicion that non-traditional news sources are injecting ideologically biased articles into Google News search results. The data show that articles returned in Google News searches are significantly more likely to have an ideological bias than those returned in searches on Yahoo News. (See below for detailed study results.)

The study examines search results for evidence of bias. By analyzing the content of articles returned in searches on the major-party presidential candidates in the days leading up to the 2004 election, it aims to assess the aggregator’s level of political bias. The study looks at balance within these articles as an indicator of bias, using results from the same searches on Yahoo News as a benchmark.

Notably, almost all of the additional bias in articles returned by Google News searches can be attributed to the site’s use of non-traditional news sources. In other words, if we consider only sources affiliated with old-media companies, the average bias scores for articles on Google News and Yahoo News are virtually identical.

Google News, still in beta three and a half years after its launch, tracks the top stories on some 4,500 English-language news sites, updating its index roughly every 15 minutes. The ability to effectively search this huge collection of timely information has helped make Google News one of the Internet’s most popular news portals, drawing about 5.9 million visitors a month.

But its groundbreaking method of identifying top stories based on how frequently they appear on sites in its index – and doing so entirely without human intervention – put the portal in critics’ crosshairs from the beginning.

That its algorithms are able to automatically determine relative importance of stories and present a front page with top stories in different subject areas has been interpreted by some as an ominous sign that computers will someday make human editors obsolete. At the same time, users have ridiculed bugs that cause the site to occasionally attach a photo to an unrelated article or elevate a relatively minor story to a prominent spot on its front page.

(It should be noted that my research concerns only the site’s search results and is unrelated to Google News’s practice of automatically ranking the top stories on its front page and section fronts.)

Because it uses no human editors, Google News has considered itself immune to bias.

“The algorithms do not understand which sources are right-leaning or left-leaning,” Google News inventor Krishna Bharat told Lasica last year. “They’re apolitical, which is good.”

But choosing which sites to index is perhaps as subjective an editorial decision as selecting the stories to play on the front page of a newspaper or website.

Google News does not share the list of sites it crawls, a practice that has resulted in a lot of speculation about its criteria for inclusion and the notion that there might be some ideological imbalance in its list of sources.

In an attempt to shed some light on the question, one blogger has written a script that grabs the news portal’s front page regularly and logs all the sources that it finds. The count stood at 2,256 as of Wednesday night, indicating that about half of the 4,500 sources have been identified.

Along with the mainstream sites in the list are a number of relatively obscure, online-only news sources (some of which are best described as weblogs), including the opinion sites and

Earlier this year, Google News dropped several sites, including the white supremacist journal National Vanguard, from its index after users complained that hate speech was turning up in searches.

It seems the news portal has been making plenty of its own news lately – albeit unwittingly. In March, Agence France-Presse charged in a lawsuit that Google was infringing its copyright by displaying AFP material on Google News pages. Days later, Google announced it would stop using AFP content. Since then, the Associated Press also has expressed “concern” about Google’s use of its material without payment.

And just this month we learned of a patent application filed by Google scientists in 2003, laying claim to methods of “improving the ranking of news articles” based on the “quality” of the articles’ sources – an apparent admission that relevance alone is not a satisfactory measure of an article’s value.

Google’s patent application offers the following variables, among others, as possible measures of a source’s quality: the volume of traffic it receives, the amount of content it produces, the speed at which it responds to breaking news, the size of its editorial staff and the number of bureaus it maintains. Any of these factors would appear to favor traditional media outlets.

If this is an admission that non-traditional sources are of lower quality, how does that square with Google News’s stated goal of increasing the diversity of viewpoints presented on its pages?

Google News currently does not distinguish opinion from fact in its search results (though it now attempts to identify press releases and satire). Hence, editorials and other opinion pieces frequently appear alongside straight news stories in search results. It is not clear that average users can make the distinction between the two, especially given the many online-only sources that peddle a confusing mixture of fact and opinion.

Ranking news stories based on some measure of quality may be a step in the right direction, but to maintain its credibility, Google News needs transparency – both in its selection criteria and its list of sources.

Key findings of the study

I was intrigued by the notion that a site without human editors might still be biased, and I wanted to test it scientifically. To do this, I analyzed the content of articles returned in searches on “George W. Bush” and “John Kerry” in the weeks leading up to the 2004 election. [More complete results and a detailed description of the research process are available in the full study (PDF).]

I wrote a crawler script to retrieve the results from Google News and Yahoo News for the search terms “George W. Bush” and “John Kerry” at four-hour intervals. The program run was during the two weeks preceding the Nov. 2 presidential election, resulting in a total of 80 “snapshots.” Each snapshot contained four sets of search results: “George W. Bush” on Google News, “George W. Bush” on Yahoo News, “John Kerry” on Google News and “John Kerry” on Yahoo News. The program also downloaded the full text of the top articles returned in each result list.

For each of five snapshots, chosen randomly, the first five articles from each of the four result lists were analyzed, ensuring an equal number of Bush and Kerry results and an equal number of Google News and Yahoo News results. This resulted in a sample of 100 articles, which then were examined sentence-by-sentence. Overall, 1,587 sentences were coded in one of five ways:

  • Favorable to Bush
  • Unfavorable to Bush
  • Favorable to Kerry
  • Unfavorable to Kerry
  • Neutral

Using the values for each sentence, two scores are calculated for each article, measuring the degree of the article’s overall favorability to each candidate. These favorability scores could take values of –1 (completely unfavorable) to 1 (completely favorable), with 0 being neutral. For instance, a Kerry favorability score of –0.3 for an article would indicate that, on balance, 30% the content of an article is unfavorable to John Kerry.

Two charts – one for Google News and the other for Yahoo News – provide a basic summary of the data. They show the two candidates’ favorability scores for each article, plotted against each other. This facilitates comparison of the overall favorability of the two portals’ search results.

Favorability plots by news portal



Each data point represents an article, and its placement on the chart represents its favorability to the two candidates:

  • Upper left quadrant: Article is favorable to Kerry and unfavorable to Bush
  • Upper right quadrant: Article is favorable to both
  • Lower right quadrant: Article is favorable to Bush and unfavorable to Kerry
  • Lower left quadrant: Article is unfavorable to both

In other words, articles in the upper right and lower left are more balanced than those in the upper left and lower right. Articles closer to the center are more neutral. The circular boundary is a density ellipse drawn to make it easier to see patterns in the data.

To determine the direction of bias in a particular story, we compare favorability scores for Bush and Kerry. Where they are similar, the article is more balanced. Each article is assigned a balance score, which is the difference between the two favorability scores. A balance score greater than 0 would indicate bias toward Kerry while a negative score shows bias toward Bush. Both Google News and Yahoo News have average article balance scores that are very close to 0, indicating balanced search results. In other words, both the Google News and Yahoo News searches returned articles that were, on the whole, equally favorable to both George W. Bush and John Kerry. This is what we would expect to see of balanced search results at a time when public opinion was pretty evenly divided between the two candidates.

Balance scores

However, the spread of articles’ balance scores reveals an important difference: Articles returned by Google News tend to be significantly more biased in one direction or the other than articles from Yahoo News.

Besides being coded for favorability, articles were also classified by whether they came from an independent, online-only source (such as or a website affiliated with a traditional news source. A traditional news source is defined as a wire service, newspaper, magazine, TV station, radio station, broadcast network or cable network. (Content from one of these sources that is syndicated on a news aggregator such as Yahoo News is also considered traditional.) Of the articles returned by Google News, 40% were from non-traditional news sources, while only 24% of the Yahoo News results came from non-traditional sources.

When articles from non-traditional sources are omitted from the comparison, there is no significant difference in the spread of the article balance scores between Google News and Yahoo News. This indicates that virtually all of the difference in bias between articles returned by Google News and those returned by Yahoo News is attributable to Google’s use of non-traditional news sources.