Adsense 1

Thursday, September 14, 2006

Cleaning Up Data with Perl

We had a peculiar problem. We got a plain text file of about 150G. The file had many records in it. Most of the records had control characters like ^M(equivalent to \n in Unix), ^@(0x000) & tabs. We tried to view the record and count the characters in each. But the word count for each record varied as there were special and control characters. Some records did not have \n at the end and were merged with the next record. Trying to open the file in vi was horrible. Only thing we knew was the record size. We tried to remove the control characters with sed and see what happens.

But sed hanged after processing 6G and took very long time ( 6 Hrs for 6G ). So i thought of doing this with perl. I posted the question in the perl forums and got a good solution. The solution was to read n characters(where n = each record size that we already knew), remove special characters from them and append \n to the end and write it in a new file. That worked fine! Below is the perl code i used:

#! /usr/bin/perl
# Script to Remove control characters from the BIG data file and print
the output -
# Note: Redirect the output to a new file while calling this from bash.
open(MYINPUTFILE, @ARGV[0]);
$/=\1000;
while(<MYINPUTFILE>){
s/[[:cntrl:]]/ /g;
print "$_\n";
}
close(MYINPUTFILE);

I got this solution from perl forum