Scraping COVID-19 data from websites using Beautiful Soup

This article will overview how to extract data through screen scraping from a website. Specifically, this will focus on the UK government, open data website.

Scraping websites is a contentious topic – while some websites don’t mind you doing it; some really would rather you didn’t and put in measures to try and stop you. In this case, the gov.uk website is OK with web scraping, as long as you do it in a reasonable manner – as shown below. Please review the government website for changes to their policy before attempting this yourself.

I should note that it’s easy to download data from the government website using their API. The snippet of code below will give you exactly what you need.

import pandas as pd
pd.read_csv('https://api.coronavirus.data.gov.uk/v2/data?areaType=overview&metric=newAdmissions&metric=newCasesByPublishDate&metric=newDeaths28DaysByPublishDate&format=csv')


However, sometimes easy isn’t very fun or interesting! Below, I will describe how to scrape the same data from the government website. We start off by importing the libraries that we will need to complete our analysis. The two required for the web scraping are Beautiful Soup and requests.

The requests library will be used to make HTTP calls & pull back big blocks of HTML code and the Beautiful Soup library is used to extract useful data from that big block of HTML.

from bs4 import BeautifulSoup
import requests
import pandas as pd 
from datetime import datetime

Now, we do an HTTP GET request, to pull the page content into a variable called page. We then use the Beautiful Soup HTML parser to parse the page content into a variable called soup.

If we printed our page variable, it would simply show the response header we got (e.g. < Response [200] >); while the soup variable contains a whole lot of HTML.

page = requests.get("https://coronavirus.data.gov.uk/")
soup = BeautifulSoup(page.content, 'html.parser')

Upon analysing the page, I have identified that all of our figures live within a div with the class ‘total-figure2’. To do this, I simply right clicked on the area of the page I was interested in & selected ‘inspect’ – you can pull the information you need from here.


Now that we know which classes we are looking for; we can find all instances of that class, using the below command. I’ve stored it in a variable called x.

x = soup.findAll("div", {"class": "total-figure2"})

If we print X, we see a list. Each item in the list, is an occurence of the class total-figure2:

If we were to print item 1 of the list, we would see the below. You can see, we have the class total-figure2 and within that class, we have the value 156,771; which is the value we want to extract.

<div class="float govuk-heading-m govuk-!-margin-bottom-0 govuk-!-padding-top-0 total-figure2"> <a class="govuk-link--no-visited-state number-link" href="#" onclick="showHelp()">156,771 <span class="tooltiptext govuk-!-font-size-16"> Total number of people tested positive reported in the last 7 days (28 January 2021 – 3 February 2021)</span></a> </div>

To extract specific data from the webpage, we can do some string functions on each of the items in our list. In the below, I am doing some string splits to select the exact pieces I need.

yesterday_cases = int(str(x[0]).split('href="#">')[1].split('<span')[0].replace(' ', '').replace(',',''))
weekly_cases = int(str(x[1]).split('howHelp()">')[1].split('<span')[0].replace(' ', '').replace(',',''))
weekly_increase = int(str(x[2]).split('(180deg)" width="12px"/>')[1].split('<span')[0].replace(' ', '').replace(',',''))
weekly_increase_pct = float(str(x[2]).split('<span class="govuk-!-font-weight-regular">(')[1].split(')')[0].replace(' ', '').replace(',','').replace('%',''))

I now have weekly cases; the number of cases yesterday and some data around the weekly change, which leaves us with four numbers and none of the clunky HTML content.

print(yesterday_cases, weekly_cases, weekly_increase, weekly_increase_pct)
19202 156771 -52530 -25.1

We can do similar things for the deaths, admissions and tests from the website. The HTML for each is very slightly different. The full code is:

from bs4 import BeautifulSoup
import requests
import pandas as pd 
from datetime import datetime
page = requests.get("https://coronavirus.data.gov.uk/")
soup = BeautifulSoup(page.content, 'html.parser')
x = soup.findAll("div", {"class": "total-figure2"})
'''
CASE DATA
'''
yesterday_cases = int(str(x[0]).split('href="#">')[1].split('<span')[0].replace(' ', '').replace(',',''))
weekly_cases = int(str(x[1]).split('howHelp()">')[1].split('<span')[0].replace(' ', '').replace(',',''))
weekly_increase = int(str(x[2]).split('(180deg)" width="12px"/>')[1].split('<span')[0].replace(' ', '').replace(',',''))
weekly_increase_pct = float(str(x[2]).split('<span class="govuk-!-font-weight-regular">(')[1].split(')')[0].replace(' ', '').replace(',','').replace('%',''))
'''
DEATHS DATA
'''
yesterday_deaths = int(str(x[3]).split('number-link number" href="#">')[1].split('<span class="tooltiptext govuk')[0].replace(' ', '').replace(',',''))
weekly_deaths = int(str(x[4]).split('howHelp()">')[1].split('<span')[0].replace(' ', '').replace(',',''))
weekly_increase_deaths = int(str(x[5]).split('(180deg)" width="12px"/>')[1].split('<span')[0].replace(' ', '').replace(',',''))
weekly_increase_pct_deaths = float(str(x[5]).split('<span class="govuk-!-font-weight-regular">(')[1].split(')')[0].replace(' ', '').replace(',','').replace('%',''))
'''
ADMISSIONS DATA
'''
yesterday_admissions = int(str(x[6]).split('number-link number" href="#">')[1].split('<span class="tooltiptext govuk')[0].replace(' ', '').replace(',',''))
weekly_admissions = int(str(x[7]).split('howHelp()">')[1].split('<span')[0].replace(' ', '').replace(',',''))
weekly_increase_admissions = int(str(x[8]).split('(180deg)" width="12px"/>')[1].split('<span')[0].replace(' ', '').replace(',',''))
weekly_increase_pct_admissions = float(str(x[8]).split('<span class="govuk-!-font-weight-regular">(')[1].split(')')[0].replace(' ', '').replace(',','').replace('%',''))
'''
VIRUS TEST DATA
'''
yesterday_tests = int(str(x[9]).split('number-link number" href="#">')[1].split('<span class="tooltiptext govuk')[0].replace(' ', '').replace(',',''))
weekly_tests = int(str(x[10]).split('howHelp()">')[1].split('<span')[0].replace(' ', '').replace(',',''))
weekly_increase_tests = int(str(x[11]).split('rotate(0deg)" width="12px"/>')[1].split('<span')[0].replace(' ', '').replace(',',''))
weekly_increase_pct_tests = float(str(x[11]).split('<span class="govuk-!-font-weight-regular">(')[1].split(')')[0].replace(' ', '').replace(',','').replace('%',''))

This gives me the below outputs:

print('cases')
print(yesterday_cases, weekly_cases, weekly_increase, weekly_increase_pct)
print('deaths')
print(yesterday_deaths, weekly_deaths, weekly_increase_deaths, weekly_increase_pct_deaths)
print('admissions')
print(yesterday_admissions, weekly_admissions, weekly_increase_admissions, weekly_increase_pct_admissions)
print('tests')
print(yesterday_tests, weekly_tests, weekly_increase_tests, weekly_increase_pct_tests)
cases
19202 156771 -52530 -25.1
deaths
1322 7448 -1149 -13.4
admissions
2651 20946 -5990 -22.2
tests
606382 4450020 477991 12.0

We can compare this against a picture of the website and it does look to be correct!

Now, we can save the results to a CSV; which makes for easy analysis of trends etc..

today = datetime.today().strftime('%Y-%m-%d')
df = pd.DataFrame(data).T
df.columns = ['date', 'yday_cases', 'weekly_cases', 'cases_change', 'cases_change_pct', 'yday_deaths', 'weekly_deaths', 'deaths_change', 'deaths_change_pct','yday_admissions', 'weekly_admissions', 'admissions_change', 'admissions_change_pct','yday_tests', 'weekly_tests', 'tests_change', 'tests_change_pct']
df.to_csv('covid_data.csv')

As mentioned above, please do ensure that the website you are going to scrape is OK with you scraping them. Furthermore, please note that the code I provide on this website is largely untested & you use it at your own risk.

Share the Post:

Related Posts