The Regex warrior: transform a text file in <2 mins

Recently I needed to transform a csv file with some simple processing. If you want to transform a file line by line, replacing a few things, deleting a few others, regular expressions are your friend.

Sed is a unix command line utility for modifying files line by line with regular expressions. It’s simpler than awk, and works very similarly to the vim substitution you may be vaguely familiar with.

We’re going to take a csv file and do the following with regex + sed:

  1. Delete some rows that meet a condition
  2. Delete a column (alright, I admit excel’s not bad for this one)
  3. Modify some cells in the table
  4. Move some data to its own column

We’ll be using this mock csv file data set:

[code]
name;id;thumbnail;email;url;admin;enabled
Dan Smith;56543678;dan_profile.jpg;dan@test.com;http://site.com/user/dan;false;true
James Jones;56467654;james_profile.png;james@test.com;http://site.com/user/james;false;true
Clément Girard;56467632;clement_profile.png;clement@test.com;http://site.com/user/clement;false;false
Jack Mack;56485367;jack_profile.png;jack@test.com;http://site.com/user/jack;true;false
Chris Cross;98453;chris_profile.png;chrisk@test.com;http://site.com/user/chris;true;true
[/code]

Removing some users

First let’s remove users who are not enabled (last column == false). Sed lets us delete any line that return a match for a regex.

The basic structure of a sed command is like this:

[bash]sed ‘s/{regex match}/{substitution}/’ <input_file >output_file[/bash]

There’s also a delete command, deleting any line that has a match:

[bash]sed ‘/{regex match}/d’ <input_file >output_file[/bash]

Let’s use it

[bash]sed ‘/false$/d’ <test.csv >output.csv[/bash]

regex 1 delete unenabled users

That command deletes these line in the csv:

[code]
Clément Girard;56467632;clement_profile.png;clement@test.com;http://site.com/user/clement;false;false
Jack Mack;56485367;jack_profile.png;jack@test.com;http://site.com/user/jack;true;false
[/code]

Some of our users also have an old user Id (only 5 digits). Let’s delete those users from our csv too.

[bash]sed -r ‘/;[1-9]{5};/d’ <test.csv >output.csv[/bash]

regex 2 delete user ids

That command deletes this line in the csv:

[code]
Chris Cross;98453;chris_profile.png;chrisk@test.com;http://site.com/user/chris;true;true
[/code]

Here we’re using the OR syntax: []. This means that the next character in the match will be one of the chars between the braces, in this case it’s a range of possible digits between 1 and 9.

We’re also using the quantifier {}, it repeats the previous character match rule 5 times, so will match a 5 digit number.

Note we added the -r flag to sed so that we could use regex quantifier. This flag enabled extended regular expressions, giving us extra syntax.

Removing a column

Next we want to remove the admin column. Removing the ‘admin’ column title is easy enough, but let’s use regex to remove the data. Our csv has 2 boolean columns, admin and enabled, we want to match both of these, and replace the match with just the ‘enabled column’, which we want to keep.

[bash]sed -r ‘s/(true|false);(true|false)$/\2/’ <test.csv >output.csv[/bash]

regex 3 delete column

In this example we’ve used capture groups. By surrounding a match in parentheses, we can save it to a variable – the first capture is saved to ‘\1’, the second to ‘\2’ etc.

In the substitution section of our sed string, we replaced the entire match with capture group ‘\2’. In other words we’ve replaced the final two columns in each row with just the final column, thus removing the second-to-last column from the csv.

We’ve also used the pipe ‘|’ as an OR operator, to match ‘true’ or ‘false’.

We’re left with a csv that looks like this:

[code]
name;id;thumbnail;email;url;enabled
Dan Smith;56543678;dan_profile.jpg;dan@test.com;http://site.com/user/dan;true
James Jones;56467654;james_profile.png;james@test.com;http://site.com/user/james;true
[/code]

Modifying cells in the table

Next we’re going to modify the url column to store the relative url rather than the absolute path. We can use a regex like this:

[bash]sed ‘s_http:\/\/site[.]com\/\(.*\)_\1_’ <test.csv >output.csv[/bash]

This is very difficult to read because we have to escape each forward slash in the url with a backslash. Fortunately, we can change the sed delimiter from a forward slash to an underscore. This means we don’t have to escape forward slashes in the regex part of our sed command:

[bash]sed -r ‘s_http://site[.]com/(.*)_\1_’ <test.csv >output.csv[/bash]

That’s much more readable!

regex 4 modify cell

Here we match any characters after the base url using .* (this will match everything after the base url in the row). We save that in a capture group, so we now have a string starting with the relative url. By substituting the match with this, we’ve replaced the full url with the relative url.

We’re left with a csv that looks like this:

[code]
name;id;thumbnail;email;url;enabled
Dan Smith;56543678;dan_profile.jpg;dan@test.com;user/dan;true
James Jones;56467654;james_profile.png;james@test.com;user/james;true
[/code]

Moving data to its own column

Finally, let’s take a column and split it into 2, moving some of its data to the new column. Let’s replace the name column with ‘first name’ and ‘last name’. We can start by renaming the headers in the first row, then use sed + regex to split each row in our csv in 2 automatically!

We could start with this:

[bash]sed -r ‘s/^([a-zA-Z]* )/\1;/’ <test.csv >output.csv[/bash]

regex 5 make column

Here we use OR square bracket [] notation again. In this case we match a character in a range of upper or lower case alphabetical characters. On Ubuntu Linux, this matches international alphabet characters like é, but this depends on your environment.

We save everything up to a space character (which delimits the first name from the last name) into a capture group and substitute it with itself plus a ‘;’ – thus moving first name into its own column.

The problem with this is our first name is left with a trailing space character before the column delimiter (;). It would look like this:

[code]
first name;last name;id;thumbnail;email;url;enabled
Dan ;Smith;56543678;dan_profile.jpg;dan@test.com;user/dan;true
James ;Jones;56467654;james_profile.png;james@test.com;user/james;true
[/code]

Instead, we can do something like this:

[bash]sed -r ‘s/^([a-zA-Z]*)( )(.*)$/\1;\3/’ <test.csv >output.csv[/bash]

This matches the space character, but also saves it into capture group ‘\2’. We then substitute the whole match with \1;\3 – effectively putting everything back together without the space, and putting a ‘;’ character in its place. We now have our new columns, first name and last name.

There’s actually an even easier solution than this, we just replace the first empty space in each row with a ‘;’ like this:

[bash]sed -r ‘s/ /;/’ <test.csv >output.csv[/bash]

That was fast and easy!

We’re left with a csv that looks like this:

[code]
first name;last name;id;thumbnail;email;url;enabled
Dan;Smith;56543678;dan_profile.jpg;dan@test.com;user/dan;true
James;Jones;56467654;james_profile.png;james@test.com;user/james;true
[/code]

With only a few commands, we’ve managed to transform a csv from our terminal.