In this article we see how to make a quick data extraction from text files with structured data, organized in rows and with the elements of each row separated by a particular character. The classic CSV or similar files, for instance.
We often have to extract from this type of files a certain subset of columns of interest, ignoring the others. This kind of operation can easily be performed on unix-like systems using shell commands.
Let’s see a practical example; suppose we have a text file that contains information about users, organized as follows:
ID;Name;LastName;Age;Gender;City;Tel;Email; 0001;Davis;Molinari;32;M;Trino;000-123456789;firstname.lastname@example.org 0002;Mario;Bianchi;50;M;Milano;02-987654;email@example.com 0003;Paolo;Rossi;45;M;Torino;011-555443;firstname.lastname@example.org 0004;Maria;Neri;35;F;Roma;00-2121212;email@example.com
Our file has a first header row and then the rows that contain the anagrafic and contact user information. In each line the various elements are divided by a separator character which in this case is represented by ‘;’ (semicolon).
Now suppose we want to extract from the original file only a few columns, such as columns “Name”, “LastName” and “Email” and save the new subset of data to a new file.
The content of the new file will then be the following:
Name;LastName;Email Davis;Molinari;firstname.lastname@example.org Mario;Bianchi;email@example.com Paolo;Rossi;firstname.lastname@example.org Maria;Neri;email@example.com
We can easily obtain this result with the cut command.
This command allows us to specify, using the -d option, the delimiter that separates the fields of our lines and, using the -f option, the indexes of the columns we want to extract.
The column indixes start with 1, so to obtain our goal we have to extract the columns 2 (Name), 3 (LastName) and 8 (Email).
Let’s run the command: cut -d ‘;’ -f 2,3,8 ListaUtenti.txt > ListaOutput1.txt
As we can see from the previous image the file created with the command contains exactly the extract of the data we wanted.
The columns to be extracted indicated with the -f option can be listed individually, separated by a ‘,’ (comma) in the list or as intervals by separating in this case the bounds of the range with a ‘-‘. (Example: cut -d ‘;’ -f 1-4)
Now let’s add some other operation.
Often some of the values of the rows of a .csv file are enclosed in single quotes or double quotes. For example, suppose we have again the content of our original file, but this time with email addresses enclosed by double quotes. We modify the file “ListaUtenti.txt” so that it contains:
ID;Name;LastName;Age;Gender;City;Tel;Email; 0001;Davis;Molinari;32;M;Trino;000-123456789;"firstname.lastname@example.org" 0002;Mario;Bianchi;50;M;Milano;02-987654;"email@example.com" 0003;Paolo;Rossi;45;M;Torino;011-555443;"firstname.lastname@example.org" 0004;Maria;Neri;35;F;Roma;00-2121212;"email@example.com"
What we want to do is get the same output as before, so this time as well as select the columns of interest we must remove the quotation marks from the values of the Email column.
To do this we use the command tr (translate characters) with the -d option, with which we indicate to delete characters that we provide. We’ll provide to tr command the output of our cut command through the ‘|’ (Pipe).
Let’s run the command: cut -d ‘;’ -f 2,3,8 ListaUtenti.txt | tr -d ‘”‘ > ListaOutput2.txt
From the image above we can see that the content is equal to the previous one and we have eliminated the quotation marks that enclosed values of the email column.
Finally we add a further refinement.
Typically this type of file has a first header line that lists the names of the columns that the values of the following lines correspond to. Often we need to eliminate this header line.
So now, always starting from the users file still containing quotation marks around email values, we want to get the following output, where the header row has been deleted:
Davis;Molinari;firstname.lastname@example.org Mario;Bianchi;email@example.com Paolo;Rossi;firstname.lastname@example.org Maria;Neri;email@example.com
To do this we use the tail command. This command, by default, allows us to display the last 10 lines of a file. But it has the interesting -n option that allows us to specify:
- The number of lines to be extracted from the end of the file, by specifying an integer number
- The row number from which to start by specifying it preceded by a +
tail -n 4 extracts the last 4 lines
tail -n +2 extracts from the second row onward
And it is precisely the latter option that is right for us!
Let’s run the command: cut -d ‘;’ -f 2,3,8 ListaUtenti.txt | tail -n +2 | tr -d ‘”‘ > ListaOutput3.txt
As we can see from the picture we get the desired result because we have eliminated both the header row and the quotes.