Analysing global news headlines with NLTK’s Vader library.
Sentiment analysis is a branch of Natural Language Processing that allows data scientists to extract human emotion from a corpus of text.
It is a very important field of Machine Learning as it provides value and insight into the information that people have published, providing a better understanding of the digital ecosystem.
In this project I have used sentiment analysis to provide a real-time snapshot of the state of the world news based on three online news publications.
This has been achieved by extracting the headlines from The Guardian world news page, BBC world news page and The Times home page headlines by using Beautiful Soup.
url = 'https://www.bbc.co.uk/news/world'page = urllib.request.urlopen(url)soup = BeautifulSoup(page, 'lxml')right_text = soup.find('h3', class_="gs-c-promo-heading...")bbc_list = bbc_list.append(right_text.text)
I did this by inspecting the relevant page and targeting the HTML tag that held the headline I wanted to extract.
Once I had the headline in string format I applied the NLTK Vader module. I chose to use Vader for a couple of reasons; it has been designed to perform well with social media text — headlines are not social media but they are short and snappy like Tweets and also it processes information very quickly, which works well for this project as we are looking for instant feedback.
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()bbc_string = str(bbc_list)
bbc_world = sid.polarity_scores(bbc_string)['compound']#print the headlineprint(bbc_string)
["Australia 'not intimidated' by Facebook news ban"]#print the compound scoreprint(bbc_world)
Vader provides polarity scores which break the corpus down and feedback how positive/negative the text is. Positive, negative and neutral scores represent the proportion of the text that sits in these buckets and all three of them will add up to 1. The compound score is a metric that calculates the sum of all of the lexicon ratings that have been normalised between negative and positive one.
I am just focusing on the compound score as it provides the most amount of information on the text in just one value and makes for easy comparison across the publications to get an overall score.
Once I had the compound scores for all three publications I plotted this information out to get a clear picture of the headlines’ sentiments.
For good measure, an average of the three scores was taken to provide an overall conclusion to the sentiment of the global news.
Given the times we live in, that’s a good news day!
Points to note:
- Vader is designed to analyse social media posts, not headlines. Even though they share some of the same characteristics, they are produced with different intentions. Social media is more emotive, people are communicating with their friends on topics they feel passionate about. Headlines are designed to inform readers in a matter of fact and succinct fashion. So maybe Vader is not the right library to be using for this project.
- Headlines are very short and provide very little information on the subject they are referring to. Perhaps it would be better to include an extract from the article itself to allow Vader to analyse the subject in more detail.
- A compound score of 0 is often realised as headlines are often neutral. This makes sense if the press are well balanced, but maybe it is better to be looking at all of the polarity scores rather than just the compound score.
- Only three publications have been used in this project, to get a better understanding of the state of the global news we need far more information than this.
Next steps: This project is very scalable and customisable — I am planning on creating a web application as an extension of this project where anyone will be able to visit and click one button and get a score that represents the sentiment of the global news at that given moment in time.
Thank you for reading this article, if you would like to get in touch please connect with me on LinkedIn.
GitHub repo for code.