Skip to content

Do not Forget to Cover your Import Scripts with Tests!

Antoine Toubhans4 min read

A major issue during my last project was regressions in production occurring twice a week. The PO felt anxious, and the project was in danger,
even though the team was reactive and able to quickly fix regressions.

Why so many regressions?

A good explanation could be the lack of tests. However, both Node.js/Express.js back-end and AngularJS front-end were 80% covered by unit/end-to-end tests. Tests are supposed to prevent regressions so what’s wrong with them?

Something was clear: most regressions were related to the data.

Data regression: a case study

In our app, people can have phone numbers on their profile page. The app receives on a daily basis a file containing people IDs and phone numbers. The team wrote a script loading phone numbers into the database every night.

A regression occurred on the second day: people have the same phone number twice on their profile pages :( This is typically a data issue:

The culprit was the import script that did not remove old phone numbers before loading the new ones. Neither back-end tests nor front-end tests could have spotted this issue. If we look at the data flow of the application: import scripts lack tests!

data-flux

What do we need to test?

Data come from CSV files we received daily through a FTP. A Python script fetches the files and loads data in the database every night.

File encoding

Files are encoded in various formats: UTF-8, ISO-8859-1, Windows-1252 and even EBCDIC!. Checking that scripts can properly read, decode and reincode these files is a first step.

Data formatting

Data is usually processed before insertion into database. Due to having heterogeneous sources, phone numbers have various format e.g., “0123456789”, “123456789”, “0033123456789”, “+33123456789” or “6789 - 0123456789”. To prevent bad format errors from occurring later in the process, the import scripts format all phone number in a standard format, say “+33123456789”.

A good test should verify that loading “0033123456789” from a CSV file ends up with “+33123456789” in db.

Data consistency

The import scripts not only load data but also establish relations upon it. For instance, imagine the script loads two csv files respectively containing people names and phone numbers:

ID

lastname

firstname

ID

phone number

424242

Obama

Michelle

424242

+33123456789

Our test should check that Michelle Obama gets the phone number “+33123456789” in the database.

Data resilience

The import script is launched every day, hence all data is refreshed every day. Let say that the app received the following two files respectively on day 0 and day 1:

ID

phone number

ID

phone number

424242

+33123456789

424242

+33123456789

424242

+33123456790

424242

+33123456791

On day 0, Michelle Obama should have two phone numbers “+33123456789” and “+33123456790”. On day 1, we expect the import script to remove “+33123456790” and to add “+33123456791” so that Michelle ends up with two phone numbers again.

How do we write the test?

The directory structure of the test folder looks like this:

test
├── import-script-test.py
├── payloads
|   ├── day0 // Folder containing csv files that are loaded on day 0
|   |   └── names.csv
|   |   └── phone_numbers.csv
|   ├── day1 // Folder containing csv files that are loaded on day 1
|   |   └── names.csv
|   |   └── phone_numbers.csv
|   ├── day2 // Folder containing csv files that are loaded on day 2
|   |   └── ...
| ...
├── data
|   ├── day0 // Folder containing a dump of the expected data for day 0
|   |   └── people.json
|   ├── day1 // Folder containing a dump of the expected data for day 1
|   |   └── people.json
|   ├── day2 // Folder containing a dump of the expected data for day 2
|   |   └── ...
... ...

The test import-script-test.py is written in Python and consists in:

  1. emptying the test database;
  2. loading test/payload/day0/*.csv files with the import scripts;
  3. getting a JSON extraction of the database;
  4. comparing “field by field” the extraction with test/data/day0/people.json;
  5. starting over again steps 1, 2 and 3 for day1, day2, … and so on.

import-script-tests

It is an end-to-end test for the import script that allows us to verify all the aforementioned points. Payload files are kept small so that the test can be easily updated. Running consecutive imports allows to check data resilience.

Epilogue

Testing the import scripts prevents many regressions due to data.

Not all projects need import scripts tests. You should evaluate the data flow of your app to decide whether it is worth it.