Adsense 1

Thursday, September 14, 2006

Cleaning Up Data with Perl

We had a peculiar problem. We got a plain text file of about 150G. The file had many records in it. Most of the records had control characters like ^M(equivalent to \n in Unix), ^@(0x000) & tabs. We tried to view the record and count the characters in each. But the word count for each record varied as there were special and control characters. Some records did not have \n at the end and were merged with the next record. Trying to open the file in vi was horrible. Only thing we knew was the record size. We tried to remove the control characters with sed and see what happens.

But sed hanged after processing 6G and took very long time ( 6 Hrs for 6G ). So i thought of doing this with perl. I posted the question in the perl forums and got a good solution. The solution was to read n characters(where n = each record size that we already knew), remove special characters from them and append \n to the end and write it in a new file. That worked fine! Below is the perl code i used:

#! /usr/bin/perl
# Script to Remove control characters from the BIG data file and print
the output -
# Note: Redirect the output to a new file while calling this from bash.
open(MYINPUTFILE, @ARGV[0]);
$/=\1000;
while(<MYINPUTFILE>){
s/[[:cntrl:]]/ /g;
print "$_\n";
}
close(MYINPUTFILE);

I got this solution from perl forum

Saturday, August 26, 2006

PGP - Encryption, mails and files

About Peculiar problem i encountered.

One of my colleagues wanted to send a pgp public key to the client. I thought it was for encrypting mails and created a key pair in Openpgp(Thunderbird) and sent the public key. But at last it turned out that the key was used to encrypt files and not mails.

We have to use gpg command in linux for encrypting/decypting files. I tried importing the public key(that i sent) to the gpg keyring in the machine that had encrypted files.
$gpg --import <public-key-file

and tried decrypting the file:
gpg -o output-file --decrypt encrypted_file

But it did not work. By no way i could decrypt the files. Then a thought stuck me that "Shouldn't i import the private key to decrypt the file rather than the public key". So, i tried the key management tool in
Openpgp(Thunderbird). In that i selected my key and exported it to a file. I compared that key file with my public key. I noticed that the file that i exported from Thunderbird had both private and public key
info. I copied the key file to the system having all the encrypted files. Then i imported the copied key
to the keyring with:
#gpg --import keyfile eg: gpg --import my.key.asc

This imported both the public key and secret key(private key). I could see the keys with:
#gpg --list-keys
#gpg --list-secret-keys

Then i used gpg to decrypt the files:
gpg gpg -o output-file --decrypt encrypted_file

This did the trick and i was able to decrypt all the files one by one.

Wednesday, August 23, 2006

Converting Mysql table to CSV

Trying to export the data from mysql to csv will end up in creating the csv file in the mysql data directory (/var/lib/mysql). Use this method to covert the individual tables from mysql to CSV format

$mysql -u <username> -h <hostname> <databasename> -p -B -e "select * from table;" | sed 's/\t/","/g;s/^/"/;s/$/"/;s/\n//g' > test.csv

Note: The whole command above must essentially be in a single line

Thursday, August 17, 2006

My New Tech Blog

I had a tech blog at livejournal and making new post there was time consuming(had to login to livejournal to post new entries). Mails to journal id was allowed only for paid customers.

I have a different blog for my personal events here which allowed me to post the from anywhere by just sending a mail. I like this feature very much in blogger and am posting all the blogs i had in liverjournal to blogger(this blog). I will be using this new blog to post all my further findings/experiments in Linux and other technologies that interests me.

I am posting the blogs here from livejournal with the same date i posted there so as to remember things and days. So you will see back dated posts here.

Thanks! for reading this.

Monday, May 08, 2006

Find Files

Commands to find files:

Files modified with in 24 hours and list them

find / -mtime 1 -exec ls -l {} \;

Same but copies to /tmp

find / -mtime 1 -exec cp {} /tmp \;

Other options:
-atime n Finding files that were accessed n days ago. You can also use the +n for finding files that were accessed over n days ago, or even -n for finding files that were accessed less than n days ago.
-mtime n Same as -atime but here we find files that were modified n days ago.
-ctime n Same as -atime and -mtime but here we find files that were changed n days ago.
-newer file Find files that have been modified more recently than file.

More help here