Scraping other websites is amazing – it’s a rich source of data that you don’t have to pay for, that you can use for some serious analysis.
The problem is, websites may have provisions in place to stop you from scraping. So you need to take a few precautions.
Firstly, you can rotate through different user agents, which will make it less obvious you’re scraping. Below, we have a list of many chrome and firefox user agents, which I found online. We can then use the random module to choose one of those user agents at random from the list.
The next thing is to add a delay into the script. If you hit tens of pages on the website within a second, you’re going to be found out as a scraper. Hence, we can use the time module sleep function to add an x second delay into each read.
We then do a bit of data wrangling to remove HTML tags & split the response at a particular point. This gives us the scraped content that we require.
Now that we have that content, we can use pysummarization to extract the key points from the article. The below is taken directly from their website & added into my script – it works perfectly and reduces the amount of text significantly. As it’s covered well on their own website, I won’t cover it again. You can find out more on their website.