Data Scraping for Investigative Journalism
Some say that ‘knowledge is power’. But to others, ‘knowledge is everything’ as it forms the basis of their entire career.
Take investigative journalists for example: They require current, accurate and as-yet-unknown data in order to expose the next big story, however it’s nearly impossible to source this by using the same methods as everyone else.
A good investigative journalist needs to widen their search basis, and find ways to be as thorough as possible. But this can result in information overload, with a lot of long hours and some killer eye strain.
Now, a better investigative journalist would figure out a way to make this mindless, overwhelmingly infinite task as efficient as possible. So how do we do that? Well, you can actually set up automatic programs to do all that brain-numbing searching for you.
If you have experience in the IT industry or in computer programming software such as SQL or Python (or feel up to exploring a handful of YouTube videos to learn how), you can easily set up low-cost web scrapers, or ‘spiders’, to crawl across the net and collect any relevant information. These programs can then sort through the compiled information, and gather a record of the data most applicable to your next article.
If you’re a little scared of messing with your computer, or don’t have the time to set up your own spider, that’s okay! There’s a solution for you too! You can easily hire a professional data scraper to manage all that fuss for you. Using a pro-geek has its own benefits too. You’ll be able to work with some trickier sites that might be too difficult to scrape on your own, giving you that powerful knowledge, and an edge over competing writers.
According to a 2017 study by Google, “42% of reporters use data to tell stories regularly (twice or more per week)” as an element of transparency, persuasion and accuracy. It is such a central part to the industry that “51% of all news organizations in the U.S. and Europe now have a dedicated data journalist – and this rises to 60% for digital-only platforms”.
Using data in investigative journalism
But how does it really work in the reporting industry? Well, investigative journalist, Paul Bradshaw, explained it quite nicely in the diagram below.
This illustration explains the process of extracting data from sources and using it to create an article. Bradshaw goes into beautiful detail in an understandable language on his blog, Online Journalism Blog, but we’re just going to go over it even simpler here:
1. Compile: this is the process of finding and collecting the data you need for your story. In terms of data scraping, here we would use spiders to crawl over the web to search for whatever you need to write your article. This forms the basis of your entire piece.
2. Clean: once you have all your data, it can often be quite messy and difficult to analyze. Many different formats, duplicates where websites have used the same data, or even just human error. So, in order for this data to really be usable, you can set up programs to convert it all into the same format, and to sort through it all to make it nice and clean for you to go through.
3. Context: not all data is reliable or accurate. It can be biased and tainted. So, it is important for you to go through the data and ask how it was gathered, when, and by who? For what purpose, what point is it trying to prove? Upon analyzing your information, you might want to gather more data if you find something interesting you want to follow, or something suspicious.
4. Combine: sometimes data can be more powerful when combined together with even more data. If the information is strong enough, you might choose to section it by itself for a specific impact. But every now and then, doubling down and pairing multiple sources together can have an even better effect. This is the part of the process where you decide what data goes with what, whether it be names, statistics, locations, etc., and start to get a feel for the format of your piece.
5. Communicate: How are you going to communicate your data to your audience? Will you visualize it in a chart or a map, or will you use it in your writing as a major point in your argument? How can you personalize it to your readers?
In summary, data is essential to the news industry, and web scraping is just one way to make things drastically easier.