Studying Online News Popularity

Srini Nariangadu
3 min readOct 26, 2020

--

The How, What and When of Article Popularity

The Dataset

In 2015 researchers from the Universities of Porto and Minho in Portugal collected data from articles(~39000) that had been published on Mashable between January 7 2013 to January 7 2015.

The data was the basis for research which resulted in the publication of a paper on “A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News”.

In it, they proposed and validated an IDSS(Intelligent Decision Support System) that could analyze an article(for Mashable) and propose structural changes(like changing the number of words in the title — an example cited in the paper) to improve the popularity of the article.

A lot has changed since 2015, especially Mashable’s Channels — other than Tech, Entertainment and Culture none of the other channels from 2015 have survived. However, the dataset is still an information rich bundle and worth taking another look at.

Some Assembly Required

The Channel information for each article was spread over multiple columns — one for each type. The first task was to merge that information into a single column that identified the channel for each article.

Once that was done, it was realized that Channel info was missing for around 6000 articles. Unfortunately, this was too large a portion of the dataset to be ignored so the data needed to be gathered. Luckily enough, this was a reasonably trivial task thanks to the help of the requests and BeautifulSoup Python packages.

As with the Channel information, the data on the Day of the Week that an article had been published was also spread over multiple(7 in this case) columns and required similar merging into a single weekday column.

Partitioning — Low, Medium and High Popularity

Once a complete dataset was available, it was possible to split the dataset into low, medium and high popularity entries ,using some simple statistical techniques, based on the number of times(shares) an article had been viewed.

Distribution by Channel

Here’s a look at the distribution of the articles by Channel in each of the low, medium and high popularity partitions.

Tech and Entertainment articles do quite well in the medium and high popularity partitions, with Entertainment doing particularly well in the high popularity stakes. It can also be noticed that while there were not many articles on Culture, a large percentage of those wound up in the medium and high popularity partitions. Probably why these are the only surviving channels from the early 2010s.

Distribution by Day of Publishing

Here’s another look at the distribution of articles, this time based on the day of the week on which they were published.

In this case, the picture is a little boring in that most of the articles in all 3 partitions were published during the so called “work week” i.e Mon-Fri. However, the weekend, especially Sunday looks like a particularly good time to publish an article if you want to get into the medium or high popularity partitions.

--

--