Audit suggests Google favors a small number of major outlets - Columbia Journalism Review

In the last week of April, nearly 23 percent of all traffic to news sites tracked by web analytics firm Parse.ly came from search engines. Google alone accounts for nearly half of external referral traffic—traffic, that is, that comes from platforms, apps, and other outside sources— to news sites. Together with the fact that Facebook referral traffic is on the wane, this means that Google’s search algorithm is now perhaps the most powerful mediator of online attention to news.

But for all the influence Google has in directing attention, we know painfully little about how its algorithm selects and curates news. Which sites does it direct traffic toward? And how does Google’s news curation impact the diversity of information found?

ICYMI: A reporter asked for 20 years of lottery winner data. After analyzing the records, he noticed something unusual.

To find out, the Computational Journalism Lab at Northwestern, including Daniel Trielli and I, undertook an audit study of the “Top Stories” box on Google search. Top Stories often shows up in the prime real-estate at the top of search results, presenting a carousel of news articles relevant to the query.

To audit Top Stories, we scraped Google results for more than 200 queries related to news events in November, 2017. We selected the queries to test by looking at Google Trends every day and manually choosing terms related to hard news events. These included names of people in the news such as “colin kaepernick,” breaking news events such as “earthquake,” and issue-specific queries such as “tax reform” or “healthcare gov.” We set up our scraper to minimize the potential for result personalization (the process by which Google tailors its search results to an account or IP address based on past use), and ran each query once per minute for a full 24 hours.

In total, we collected 6,302 unique links to news articles shown in the Top Stories box. For each of those links we count an article impression each time one of those links appears.

The data shows that just 20 news sources account for more than half of article impressions. The top 20 percent of sources (136 of 678) accounted for 86 percent of article impressions. And the top three accounted for 23 percent: CNN, The New York Times, and The Washington Post. These statistics underscore the degree of concentration of attention to a relatively narrow slice of news sources.

Of course, the concentration of sources also varies depending on the query. On average there were 19 sources per query, but 30 percent of queries had 10 or fewer sources. And sometimes, even if there were more sources, most of the impressions could go to just a handful. For instance, the query “rex tillerson” had 38 sources, but just two of those sources—the Times and CNN—were responsible for 75 percent of the article impressions.

Prior research has shown that search engines can affect users’ attitudes, shape opinions, alter perceptions and reinforce stereotypes, as well as affect how voters come to be informed during elections. As such, media diversity is an important aspect to the way that Google—or any news aggregator—curates sources and perspectives.

To get at this issue in our audit, we looked at the diversity of sources surfaced in Google Top Stories in terms of their ideological lean. More specifically, we used ratings data published in an earlier study which identifies the ideological alignment of the top 500 most-shared news sites on Facebook. The ratings don’t measure the slant of the media outlet per se, but rather reflect the self-reported political affiliation of Facebook users sharing content from those sources. The criteria were published in the peer-reviewed journal Science in June, 2015 by Eytan Bakshy, Solomon Messing, and Lada Adamic from Facebook’s Core Data Science team.

Organizations that can generate fresh copy may be more apt to have that material selected by the curation algorithm.

Our data shows that 62.4 percent of article impressions were from sources rated by that research as left-leaning, whereas 11.3 percent were from sources rated as right-leaning. 26.3 percent of impressions were from news sources that didn’t have ratings. But even if that last set of unknown impressions happened to be right-leaning, the trend would still be clear: A higher proportion of left-leaning sources appear in Top Stories. (Again, this means news sources shared on Facebook more often by people with a left-leaning political affiliation.)

Is there simply more news produced on the left? It appears so. We confirmed this by searching the GDELT database of news articles for the same queries we used to audit Google. In GDELT there were 2.2 times as many articles from left-leaning sources as right-leaning sources. But in Google Top Stories that ratio was 3.2, indicating that the curation algorithm was slightly magnifying the left-leaning skew in comparison to the GDELT baseline.

Another aspect of Google’s news curation is the timeliness of articles selected. Just how quickly does Google churn through news content? Since the Top Stories box provides the approximate age of each article (e.g. “2 hours ago”), we were able to tabulate the recency of articles. What we found is that 83.5 percent of articles were less than 24 hours old and 13.1 percent were less than an hour old. What this means is that organizations that can generate fresh copy may be more apt to have that material selected by the curation algorithm.

In the last part of our analysis, we looked at how much traffic an appearance in Top Stories actually generates. To do this, we combined our scraped data with referral data provided by Chartbeat. Across queries, there is a lot of variation in the number of people searching who could therefore be referred. As an example, in our data, “Matt Lauer” generated 3,961 referrals for each article impression we observed, far more than the average. But a majority of search terms (58 percent) averaged less than 100 referrals per impression.

ICYMI: I wrote a story that became a legend. Then I discovered it wasn’t true.