Non-traditional sources cloud Google News results

OJR readers may remember a September 2004 article by contributing editor J.D. Lasica that suggested a political bias in the popular online news portal Google News. Searching on the term “John Kerry,” Lasica cited several stories from “second-tier” online-only news and commentary sites that appeared to have a conservative tilt. Among them were headlines such as “Swift Boat Veterans for Truth Expose John Kerry’s Lies.”

Lasica’s piece got me thinking about ways to measure bias in search results. His observations became the basis for my recently completed master’s thesis, the findings of which may be of interest given the ongoing debate about the quality of Google News’s sources. While my analysis does not indicate an overall conservative or liberal slant, it does confirm Lasica’s suspicion that non-traditional news sources are injecting ideologically biased articles into Google News search results. The data show that articles returned in Google News searches are significantly more likely to have an ideological bias than those returned in searches on Yahoo News. (See below for detailed study results.)

The study examines search results for evidence of bias. By analyzing the content of articles returned in searches on the major-party presidential candidates in the days leading up to the 2004 election, it aims to assess the aggregator’s level of political bias. The study looks at balance within these articles as an indicator of bias, using results from the same searches on Yahoo News as a benchmark.

Notably, almost all of the additional bias in articles returned by Google News searches can be attributed to the site’s use of non-traditional news sources. In other words, if we consider only sources affiliated with old-media companies, the average bias scores for articles on Google News and Yahoo News are virtually identical.

Google News, still in beta three and a half years after its launch, tracks the top stories on some 4,500 English-language news sites, updating its index roughly every 15 minutes. The ability to effectively search this huge collection of timely information has helped make Google News one of the Internet’s most popular news portals, drawing about 5.9 million visitors a month.

But its groundbreaking method of identifying top stories based on how frequently they appear on sites in its index – and doing so entirely without human intervention – put the portal in critics’ crosshairs from the beginning.

That its algorithms are able to automatically determine relative importance of stories and present a front page with top stories in different subject areas has been interpreted by some as an ominous sign that computers will someday make human editors obsolete. At the same time, users have ridiculed bugs that cause the site to occasionally attach a photo to an unrelated article or elevate a relatively minor story to a prominent spot on its front page.

(It should be noted that my research concerns only the site’s search results and is unrelated to Google News’s practice of automatically ranking the top stories on its front page and section fronts.)

Because it uses no human editors, Google News has considered itself immune to bias.

“The algorithms do not understand which sources are right-leaning or left-leaning,” Google News inventor Krishna Bharat told Lasica last year. “They’re apolitical, which is good.”

But choosing which sites to index is perhaps as subjective an editorial decision as selecting the stories to play on the front page of a newspaper or website.

Google News does not share the list of sites it crawls, a practice that has resulted in a lot of speculation about its criteria for inclusion and the notion that there might be some ideological imbalance in its list of sources.

In an attempt to shed some light on the question, one blogger has written a script that grabs the news portal’s front page regularly and logs all the sources that it finds. The count stood at 2,256 as of Wednesday night, indicating that about half of the 4,500 sources have been identified.

Along with the mainstream sites in the list are a number of relatively obscure, online-only news sources (some of which are best described as weblogs), including the opinion sites MichNews.com and Useless-Knowledge.com.

Earlier this year, Google News dropped several sites, including the white supremacist journal National Vanguard, from its index after users complained that hate speech was turning up in searches.

It seems the news portal has been making plenty of its own news lately – albeit unwittingly. In March, Agence France-Presse charged in a lawsuit that Google was infringing its copyright by displaying AFP material on Google News pages. Days later, Google announced it would stop using AFP content. Since then, the Associated Press also has expressed “concern” about Google’s use of its material without payment.

And just this month we learned of a patent application filed by Google scientists in 2003, laying claim to methods of “improving the ranking of news articles” based on the “quality” of the articles’ sources – an apparent admission that relevance alone is not a satisfactory measure of an article’s value.

Google’s patent application offers the following variables, among others, as possible measures of a source’s quality: the volume of traffic it receives, the amount of content it produces, the speed at which it responds to breaking news, the size of its editorial staff and the number of bureaus it maintains. Any of these factors would appear to favor traditional media outlets.

If this is an admission that non-traditional sources are of lower quality, how does that square with Google News’s stated goal of increasing the diversity of viewpoints presented on its pages?

Google News currently does not distinguish opinion from fact in its search results (though it now attempts to identify press releases and satire). Hence, editorials and other opinion pieces frequently appear alongside straight news stories in search results. It is not clear that average users can make the distinction between the two, especially given the many online-only sources that peddle a confusing mixture of fact and opinion.

Ranking news stories based on some measure of quality may be a step in the right direction, but to maintain its credibility, Google News needs transparency – both in its selection criteria and its list of sources.


Key findings of the study

I was intrigued by the notion that a site without human editors might still be biased, and I wanted to test it scientifically. To do this, I analyzed the content of articles returned in searches on “George W. Bush” and “John Kerry” in the weeks leading up to the 2004 election. [More complete results and a detailed description of the research process are available in the full study (PDF).]

I wrote a crawler script to retrieve the results from Google News and Yahoo News for the search terms “George W. Bush” and “John Kerry” at four-hour intervals. The program run was during the two weeks preceding the Nov. 2 presidential election, resulting in a total of 80 “snapshots.” Each snapshot contained four sets of search results: “George W. Bush” on Google News, “George W. Bush” on Yahoo News, “John Kerry” on Google News and “John Kerry” on Yahoo News. The program also downloaded the full text of the top articles returned in each result list.

For each of five snapshots, chosen randomly, the first five articles from each of the four result lists were analyzed, ensuring an equal number of Bush and Kerry results and an equal number of Google News and Yahoo News results. This resulted in a sample of 100 articles, which then were examined sentence-by-sentence. Overall, 1,587 sentences were coded in one of five ways:

  • Favorable to Bush
  • Unfavorable to Bush
  • Favorable to Kerry
  • Unfavorable to Kerry
  • Neutral

Using the values for each sentence, two scores are calculated for each article, measuring the degree of the article’s overall favorability to each candidate. These favorability scores could take values of –1 (completely unfavorable) to 1 (completely favorable), with 0 being neutral. For instance, a Kerry favorability score of –0.3 for an article would indicate that, on balance, 30% the content of an article is unfavorable to John Kerry.

Two charts – one for Google News and the other for Yahoo News – provide a basic summary of the data. They show the two candidates’ favorability scores for each article, plotted against each other. This facilitates comparison of the overall favorability of the two portals’ search results.

Favorability plots by news portal

Google

Yahoo

Each data point represents an article, and its placement on the chart represents its favorability to the two candidates:

  • Upper left quadrant: Article is favorable to Kerry and unfavorable to Bush
  • Upper right quadrant: Article is favorable to both
  • Lower right quadrant: Article is favorable to Bush and unfavorable to Kerry
  • Lower left quadrant: Article is unfavorable to both

In other words, articles in the upper right and lower left are more balanced than those in the upper left and lower right. Articles closer to the center are more neutral. The circular boundary is a density ellipse drawn to make it easier to see patterns in the data.

To determine the direction of bias in a particular story, we compare favorability scores for Bush and Kerry. Where they are similar, the article is more balanced. Each article is assigned a balance score, which is the difference between the two favorability scores. A balance score greater than 0 would indicate bias toward Kerry while a negative score shows bias toward Bush. Both Google News and Yahoo News have average article balance scores that are very close to 0, indicating balanced search results. In other words, both the Google News and Yahoo News searches returned articles that were, on the whole, equally favorable to both George W. Bush and John Kerry. This is what we would expect to see of balanced search results at a time when public opinion was pretty evenly divided between the two candidates.

Balance scores

However, the spread of articles’ balance scores reveals an important difference: Articles returned by Google News tend to be significantly more biased in one direction or the other than articles from Yahoo News.

Besides being coded for favorability, articles were also classified by whether they came from an independent, online-only source (such as Salon.com) or a website affiliated with a traditional news source. A traditional news source is defined as a wire service, newspaper, magazine, TV station, radio station, broadcast network or cable network. (Content from one of these sources that is syndicated on a news aggregator such as Yahoo News is also considered traditional.) Of the articles returned by Google News, 40% were from non-traditional news sources, while only 24% of the Yahoo News results came from non-traditional sources.

When articles from non-traditional sources are omitted from the comparison, there is no significant difference in the spread of the article balance scores between Google News and Yahoo News. This indicates that virtually all of the difference in bias between articles returned by Google News and those returned by Yahoo News is attributable to Google’s use of non-traditional news sources.

About Eric Ulken

Eric Ulken left his job as editor for interactive technology at the Los Angeles Times in November 2008 to travel and report on trends and best practices in online journalism. He is a 2005 graduate of the communication management M.A. program at USC's Annenberg School for Communication, where he was an editor and producer for OJR and Japan Media Review. He has been a web monkey at newsrooms in six states, including his native Louisiana.

Comments

  1. The findings need to be further refined. However, with the patent that Google has (or has applied for, I don’t remember which), it’s clear that Google realizes that they have a problem and need a weighting mechanism.

    Let’s be honest, Google is using the citation analysis techniques first developed by ISI. However, even ISI uses an editorial review process to determine which sources are including in their databases. This is a no-brainer next step for Google.

    Why not just take the 7,000 sources listed in Bacon’s Internet Media Directory and put them through an editorial review process? This would seem like the best solution.

    Also, Google (and Yahoo, for that matter) can offer a feature similar to CyberAlert’s IntelliClips whereby users, if they want, can select which sources they want scanned.