Extracting and cleaning data from the Twitter API in Python

I’ve heard from a tonne of people that they don’t read the news because it’s negative and the constant presence of articles about assault, killings and other terrible things bring their mood down. So I started thinking, how could we filter the news to only show us things that we want to see.

The way I see it, we have two options:

  1. Filter out specific keywords which people may find offensive or upsetting
  2. Use machine learning to identify the positive and negative articles

let’s talk about option 2. If we took a Native Bayes algorithm for example, that works by looking at the frequencies of words appearing in the articles labeled as positive and the frequency in those that are negative. We can then assess the probability of a word appearing in positive news versus the probability of it appearing in negative news. HOWEVER, the really key thing here is that we would need the article – all of it, which I have not been able to find.

With all the APIs available on the web that I have found, you can get the article title, link and description but you can’t get the article corpus. Hence, I have a new approach:

  1. Extract the tweets from news companies from Twitter and filter by certain keywords
  2. Loop through the comments of each tweet & assess sentiment to give a negativity score

In this article, I am going to run through how we extract the tweets and filter them based on our keywords & then in a follow up article, we’ll look at how we would extract the comments & run sentiment analysis on them.

So here we go then. The first step, as always, is to import our required libraries. We then setup our Twitter API details, which you can obtain by signing up as a Twitter developer and creating a Twitter app.

Next, we are going to use the user_timeline call to pull the last 200 tweets from both BBCSport and BBCNews. Here, I won’t include retweets as I don’t want to see articles retweeted by BBCNews / Sport. I’ll also set tweet_mode to extended, as this will mean I don’t have truncated outputs.

Then, we take the output data from each and normalize the JSON into a Pandas dataframe. We then union (concatenate) then BBCSport and BBCNews dataframes to get one big table.

Now, we’re going to run a load of apply functions on our dataframe to give us YES/NO answers for specific types of news. This will allow us to have ‘switches’ so the user can turn on/off certain types of news that they specifically don’t want to see.

Now as the output has a tonne of columns, we’ll select just a few of them & rename them to easier to use names.

We can now cleanup the data to extract the URL, text and media in the format that we need.

Finally, let’s drop the duplicates. It’s possible that BBCNews and BCCSport post the same articles sometimes so we don’t want to include them both.

That’s it! We’ve got our Twitter data, we’ve tidied it up and in an upcoming article, we’ll look at assessing comments for sentiment.