How to: plan a journalism project that needs data entry
This was first published on the Online Journalism Blog. Read it there if you prefer to avoid the Medium ‘free articles’ nonsense.
Data-driven reporting regularly involves some form of data entry — some of the stories I’ve been involved with, for example, have included entering information from Freedom of Information (FOI) requests, compiling data from documents such as companies’ accounts, or working with partners to collect information from a range of sources.
But you’ll rarely hear the challenges of managing these projects discussed in resources on data journalism.
Last week I delivered a session on exactly those challenges to a factchecking team in Albania, so I thought it might be useful to share the tips from that session here.
They include some steps to take to reduce the likelihood of problems arising, while also helping to ensure a data entry project takes as little time as possible.
The challenges of data entry projects
First, it’s worth outlining just what challenges data entry projects present.
Most obviously, data entry projects can take a lot of time — good planning can reduce that time significantly.
Secondly, the data in such projects can end up inconsistent and require cleaning — especially if more than one person is responsible for entering the data. Individuals can make errors that need identifying and correcting. Designing the data entry process well can make it much easier to prevent and spot those errors.
Finally, it’s not uncommon in data entry projects to find yourself going back to the source material to collect data that you should have collected to begin with. Again, good planning should avoid this.
Tip 1: Sketch out the table structure(s) to avoid having to repeat work
A project involving data should store that data in the correct format. So, let’s be clear: that means a spreadsheet — not a Word document (yes, I’ve seen FOI projects where journalists have collated the data in a Word document).
In a spreadsheet you can sort, filter, aggregate and calculate — all things that you cannot do with information trapped in a Word document.
But before you create that spreadsheet, it’s worth mapping out on paper what information it should contain. This should ensure that a) you only collect the data you need and b) you don’t end up having to go back to collect data that you might have overlooked.
Start by identifying the pieces of data (the “fields”) in your raw material. Try to be as specific as possible. Here are some examples of information from documents about ‘liabilities owed in the oil industry’ that could be separated into different fields:
- Liabilities amount
- Name of company owing money
- Name of company owed money
- Agreement owed under
In the case of dates try to be specific about what the date refers to. Is it the date of an event, or the date that something was recorded?
Consider having both a start and end date if the information you are collecting refers to events that take place over more than one day.
Consider what unique identifiers may exist in the information that you are dealing with. Unique identifiers are very useful in finding other information and helping you to combine data.
Company codes are a good example: if you have two company names that are similar, or the same company using different names, a company code is much more useful in helping distinguish or match those.
Contract codes and spending codes are other examples. They should definitely be in there.
One table — or multiple tables?
If your story relates to different types of entities — a classic example would be companies and company directors — then consider creating more than one table.
Why? Well consider the following scenario: a company often has more than one director; and it has different directors at different times. So we can’t simply add a field to a table about companies saying ‘director’.
Any situation where you are having to enter more than one entity in one cell (e.g. multiple names, codes etc) should make you ask whether it’s better to have a separate table for those things.
Here a second table is needed where the directorship is the focal element.
Now a company can appear more than once (against different directors), and a director can too (against different companies). The table might only have 4 fields: director name; company name; start date; and end date (if applicable).
Tip 2: Add columns to your table plans for attribution and checking, notes and newsworthiness
Once you’ve planned out the fields in your table, make sure you add a column titled ‘entered by’.
This is so that the person entering the data can put their name or initials in a row every time they enter data.
If problems are discovered at any point in the project, it means you can easily find out:
- Who entered the problematic data;
- If it is actually an error — or the data is what it seems;
- What may have caused any error;
- Whether the same mistake has been made by the same person elsewhere.
Add another column for ‘source’.
A source can be a URL or a description (e.g. “Agreement23.doc in the docs folder”). It should help you track down evidence later should you need it — especially useful when going through any legal discussions before publication.
Columns to check the data
Add checking columns for any data that should always be a particular type (e.g. numbers and dates). For example in our data plan we have a column which is supposed to contain a number, and a second column which is supposed to contain a date. Next to those we can specify a column checking that those have been entered correctly:
- Entered by
- Liabilities amount
- AMOUNT CHECK
- Name of company owing money
- Company number
- DATE CHECK
- Name of company owed money
- Agreement owed under
More on these later.
In any journalism project with multiple contributors it’s also worth adding a ‘newsworthiness’ column where people can enter a number to indicate that something is particularly interesting. Later the spreadsheet can be sorted by this column (biggest numbers top) to help you prioritise follow-ups.
A ‘notes’ column will also be needed: inevitably there will be instances where the journalist wants to add some contextual information or something that won’t fit in the other columns, e.g. “Amount doesn’t look right — need to check” or “In a later interview the CEO said they owed more [URL]”.
Now we can start to create the table(s)…
Creating the table — and the checking columns
Create a spreadsheet and type your column headings across the top.
The two checking columns will have formulae copied down them. These formulae will look at the cells next to them and say TRUE or FALSE that they are the right type of data.
Let’s say column B is being used for amounts, and column C has the check. In cell C2 then, you would type a formula like this:
When you press Enter, it should return FALSE, because B2 doesn’t have a number in it — it’s empty. Copy this formula all the way down the column, as far as you think you’ll need it.
Now say that in column F we are going to have dates, and in column G we will need a date check. There is no specific function to check if a cell contains a date — because a date in Excel is actually stored as a number (the number of days since 1 January 1900), and formatted to look like a date.
So we just need to use ISNUMBER again:
This will still return FALSE if someone types something like “June 2 1989” where Excel doesn’t convert that to a proper date. Unfortunately it won’t catch errors like people typing the year 2019 instead of a specific date.
But formatting will help us with that problem…
Tip 3: Formatting columns to avoid bad data entry
We want to make sure that data is formatted correctly to avoid the following two common errors:
- Company numbers, telephone numbers, contract codes and other ID codes being stored as numbers (which means leading zeroes are removed).
- Dates being stored as text
Although company numbers and telephone numbers are called “numbers” they are actually codes: there is no numerical value to them (one telephone number is not bigger than another; you don’t perform calculations with them).
In our plan earlier we had the company number in column E. To format that column so that it doesn’t treat those codes as numbers, do the following:
- In Excel: right-click on the column letter at the top of the column. Then select Format cells and on the window that appears select Text.
- In Google Sheets: select the whole column, then click on the Format menu and select the Number menu. This will open up another menu where you can specify the format you want (in this case, Plain Text)
Test it by entering a number beginning with 0. The number should stay left-aligned, and the zero should remain. In Excel you may get a little warning telling you that this is a ‘number stored as text’ but that’s fine — that’s what you want.
For dates, repeat the same process but this time select Date.
Again, you can test this by entering a date. It should format correctly.
Try entering just a year — 2019. Note that this is formatted as 11/07/1905. Why? Because that date is 2019 days from 1 January 1900!
Hopefully the person entering should notice — but you can add another column checking for unusually early dates if you want (for example
=F2>2030 which returns
FALSE if the number of days is less than 2,030, equivalent to the date 22/07/1905)
Tip 4: The brutal option: data validation
A final technique you can use to keep data clean and consistent is data validation. This means we limit the values that a person can enter in a cell — it’s especially useful when there are only a limited number of possibilities, such as categories or regions or countries.
Here’s how that could work for our ‘companies owed’ column.
- First, you need to create a list of values that you will allow. That’s best done in a separate sheet in Excel.
- Create a second sheet, then. In A1 in this new sheet type ‘Companies’ as your column heading for column A then in A2 type ‘Company A’ and in A3 type ‘Company B’. You now have a list of 2 companies.
- Now go back to your main sheet, and select all the cells below the heading in column H (‘Company owed’).
- Make sure you are in the Data tab in Excel, and click the button marked Data validation.
- A window will appear where you can specify what’s allowed in these cells. Select List.
- Click into the Source: box at the bottom. Now, while your cursor is still there, move your mouse away from the window and click on the second sheet in Excel, where your list is. Then click and drag to select the cells containing the list of company names you want to allow. The box should start to fill with the location of those cells: `=Sheet2!$A$1:$A$3`
- There are other options in the other tabs about the warning that is shown when someone tries to enter a value not in the list, and whether they can override that, but for now… Click OK.
Now, if you click on any of the cells in the main sheet in the column you applied this to, a dropdown menu should appear where the user can select the options from that list. This helps ensure consistency.
If they try to enter a value which isn’t in that list, they will get an error. They can either click ‘Yes’ to ignore that, or ‘No’ to accept and return to the cell to change it.
You can use data validation to ensure people only enter numbers or dates as well (you can specify the range of dates allowed), as an alternative to the ‘checking’ columns outlined above.
Tip 5: Using Google Forms to enter data
Google Sheets has a few advantages when it comes to data entry projects:
- Multiple people can work on the same spreadsheet at the same time
- The spreadsheet can be accessed from any computer as long as they have an internet connection
- Google Sheets stores a history of the sheet as it is changed (look under the File menu for ‘Version history’). This means that if mistakes are made you can return to a previous version of the spreadsheet, compare changes, and revert to earlier versions if you need to (removing any changes made after that version)
- Data can be entered using a form (Google Forms) rather than directly into the spreadsheet itself
To create a form for data entry, go into a spreadsheet in Google Sheets, click on the Tools menu and select Create a form (the first option).
This brings up a form creation tool. Edit the first question (for the first column of data): on the left specify the text on the form (e.g. “Your name”) and on the right select the type of answer you want: for things like names and places you want ‘Short answer’.
Repeat this process for each piece of information — add a new question each time by clicking on the + button in the right-hand menu.
- For dates remember to change to ‘Date’;
- For notes use ‘Paragraph’ so there’s enough space if they need it.
- For newsworthiness you can use the ‘Linear scale’ option that makes them choose between 1 and 5 or another range you specify.
- You can also use the Drop-down option to force them to specify from a range of options (such as countries).
- For yes/no answers use checkboxes.
When you’ve finished click the ‘eye’ icon in the upper right corner to preview the form. You can enter some data and submit it at the end.
Back in Google Drive you should be able to now see the form in the same location as the spreadsheet. If you open the spreadsheet you will find a new sheet where the responses are being stored.
To share the form, open it up again and click the ‘Send’ button in the upper right corner. You can input some emails, or switch to the ‘link’ tab to get a URL to share instead. There’s also an embed option if you are collecting responses online (for example in an article).
Taking it further: structured journalism
Although these tips relate to data entry projects as a whole, if you are tracking the development of an issue over time it is well worth looking at some of the resources on ‘structured journalism’. The webinar below provides more information on this and some further tips on how to approach a structured journalism project.