Thursday, April 15, 2010

Converting XLS to CSV on Linux

While programming, its a real pain if the data that you want to process is present in xls spreadsheets. If its only one xls file, then you can open it using MS Excel or Open Office and save it as a csv file. Once you get the data in csv format, life becomes simple. Almost all the programming languages have libraries for parsing csv files. With scripting languages like Python or Ruby, you can literally play around with csv files with just a few lines of code.

But what if the data is contained in many xls files? This is a typical situation that a programmer may face when he is working with non-programmers. For example, in the research project that I am working on, we receive data on plant species from Botanists at the Smithsonian Institute. The data that they send is usually spread across 200 or so xls spreadsheets. In such situations, its great to have a command line tool that can convert xls to csv.

There are two command line tools that I found that do a pretty good job in converting xls to csv. They both have their advantages and disadvantages (which I will be talking about). Here they are:

1. xls2csv (by V.B.Wagner)

Here is the project webpage and here is the direct link to download the source.

After installing it, here is how I ran it:

$ xls2csv myfile.xls > myfile.csv

xls2csv prints the output to stdout which can then be redirected to a file

The good news:

1. Installation is straightforward.

2. It works great, even when the xls file has multiple worksheets. In such a case (xls file with multiple worksheets), xls2csv will print all the worksheets onto the csv file with the contents of each worksheet separated by a form-feed character (^L)

3. In any line, even if the first "n" columns are empty and data is present in the "(n+1)th" column, xls2csv will recognize that line and include it in the generated csv file

The bad news:

1. xls2csv has problems with dates. If the xls file has dates, like "03/31/85" or "31-Mar-1985" etc, then xls2csv will not reproduce the date as is. Instead it will convert the date into a number while generating the csv file. The man page for xls2csv mentions that using the -f flag to specify the date format will solve this problem. But I could never got it work; several others have also reported this problem in discussion boards.

So, if your xls files do not have dates, this is the thing for you. You will be happy with the results.

Here is the second tool:

2. xls2csv (by Ken Prows)

This tool is also called as xls2csv. But this is a different implementation by a different author. This is in Perl by Ken Prows whereas the first one was in C by V.B.Wagner (Note: Both these tools install to /usr/local/bin. So, if you will be installing both, make sure to configure them so that they have separate installation paths).

Here is the project webpage and here is the direct link to download the source.

Installing this can be a pain if you don't have the required Perl modules that xls2csv requires for its installation. However installing the required modules using CPAN is very easy. If you are having problems with the installation, here is a very nice tutorial on installing xls2csv.

After installing it, here is how I ran it:

$ xls2csv -x myfile.xls -c myfile.csv

The good news:

1. Unlike the 1st tool, this works great with dates. The generated csv file will have the dates in the exact same format as the original xls file.

The bad news:

1. Installation can become a bit tricky if you don't have the required Perl modules.

2. If the xls has multiple worksheets, by default, it will convert only the first worksheet to csv. However, it does support a -w flag wherein you can specify the sheet name that you would like to convert:

$ xls2csv -x myfile.xls -c myfile.csv -w worksheet_name

It also supports a -W flag using which you can list all the worksheets in the xls file. Usage:

$ xls2csv -W -x myfile.xls

3. If the first column is empty, the entire row is ignored. So if you have a xls file in which the first column is blank, but has tons of data from second column onwards, xls2csv will generate an empty csv file... That's bad !!

Well, if you have xls files which have lots of dates (and hopefully single worksheets per xls file), this is the tool for you.

Now lets batch-convert...

Here is my Ruby script to batch-convert xls files to csv. It takes 2 arguments: 1) the source directory that has all the xls files and 2) the target directory where you want to save the generated csv files. So, the usage is:

$ ruby xls2csv.rb "/home/vinay/myxls" "/home/vinay/mycsv"

where "myxls" is the directory which has all the xls files and "mycsv" is the directory into which all the csv files will be generated.

Here is the Ruby script:

# Author: Vinay Kumar Bettadapura

if ARGV.length != 2
    puts "usage: \"ruby xls2csv.rb source_dir target_dir\""
    exit -1
end

source_dir = ARGV[0]
target_dir = ARGV[1]

if !File.exists?(target_dir)
    puts "target_dir \"#{target_dir}\" is not a valid directory"
    exit -1
end

source_entries = []
begin
    source_entries = Dir.entries(source_dir).sort
    # To remove the first two array elements which 
    # will be "." and ".."
    source_entries.shift
    source_entries.shift
rescue Exception => e
    puts "source_dir \"#{source_dir}\" is not a valid directory"
    exit -1
end

if source_entries.empty?
    puts "source_dir \"#{source_dir}\" is empty"
    exit 0
end

source_entries.each{|file|
    source_file = source_dir + "/" + file
    target_file = target_dir + "/" + file.gsub(".xls", ".csv")

    puts "Converting \"#{source_file}\" to \"#{target_file}\""
    `xls2csv \"#{source_file}\" > \"#{target_file}\"`
}

puts "Done..."

The Ruby script uses the 1st xls2csv tool. If you want to use the 2nd xls2csv tool, then replace

`xls2csv \"#{source_file}\" > \"#{target_file}\"`

on line number 39 with

`xls2csv -x \"#{source_file}\" -c \"#{target_file}\"`

in the Ruby script.

16 comments:

  1. However installing the required modules using cspan is very easy

    I believe you meant CPAN.

    ReplyDelete
  2. Yes. I meant CPAN. Thanks for pointing it out.

    ReplyDelete
  3. For the xls2csv (by V.B.Wagner) there is a easy way to fix that bug with dates (counting from 1904 instead of 1900):

    https://bugs.launchpad.net/ubuntu/+source/catdoc/+bug/349016

    ReplyDelete
  4. Thanks for mentioning the row truncate stuff..I did not see that coming...
    However, in my opinion multiple sheet stuff is not a demerit for xll2csv...

    ReplyDelete
  5. Hello,

    I have, inside my Excel files, the following character ' . This is an issue for the converter xlx2csv tool. Does it exist a solution to solve this problem (ie replace this character before conversion).

    Thanks in advance.

    Best regards.

    ReplyDelete
    Replies
    1. The ' character could be translated to something else like an underscore character using the tr command like this:
      tr "'" "_" outputfile

      (Note that those are ordinary quotes. Some blogs automatically change them to so-called smart-quotes. If you try to use smart-quotes on the commandline it won't work.)

      Delete
  6. Option -f for the Perl xls2csv solve the issue with empty first column, documentation here : http://search.cpan.org/~ken/xls2csv/script/xls2csv

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. onverting "/home/credentek/Desktop/gaurav/xlstocsv/myxls/test.xls" to "/home/credentek/Desktop/gaurav/xlstocsv/mycsv/test.csv"

    sh: /home/credentek/Desktop/gaurav/xlstocsv/myxls/test.xls: cannot execute binary file
    Done...

    Dear sir,

    When i run this script above msg have been populated. cannot execute binary file even those my file has .xls and content are normal. please help me out and if there is any change in script plese tell me as soon as possilbe its greatfull for me.


    Regards
    gaurav Sahoo
    7276481825

    ReplyDelete
  9. Thanks Vinay. Your ruby script worked beautifully and saved me a ton of time!

    ReplyDelete
  10. ty! 2. xls2csv (by Ken Prows) saved my day o_0 THANKS A LOT YOU AWESOME CREATURE! =)

    ReplyDelete
  11. It's better to parse the output of xls2csv and create an HTML table page into a temporary file and view that using w3m or links; the lynx is not good for viewing HTML tables.

    I wrote a shell script for this, ruby is an ourkill for such tasks.

    ReplyDelete
  12. It's better to parse the output of xls2csv and create an HTML table page into a temporary file and view that using w3m or links; the lynx is not good for viewing HTML tables.

    I wrote a shell script for this, ruby is an ourkill for such tasks.

    ReplyDelete
  13. I suppose that is one of the most widespread hint for me. And im satisfied studying your article. however ought to observation upon some fashionable matters, The web web page fashion is unadulterated, the articles is really all-powerful : D. appropriate process, cheers free csv converter

    ReplyDelete
  14. It is amazing and wonderful to see your blog. Thanks for sharing this information,..
    advanced-xls-converter-crack

    ReplyDelete