Option to ignore errors
Created by: brandonpoc
I use the tools from 'csvkit' on very, very large files, piped in from stdout - often times in the tens of gigabytes and hundreds of millions of lines - and these processes can take a very long while, sometimes in excess of 12 hours, to run. A major bane in my side is when I check on its progress and find that it bailed out after processing for just a few minutes because of some Python exception that wasn't caught. The older version of csvkit that I had was full of these problems, and the latest version remedied many, but I encountered an error involving a bad UTF character causing the program to bail out. The error was as follows from csvgrep:
Your file is not "utf-8" encoded. Please specify the correct encoding with the -e flag. Use the -v flag to see the complete error.
I found the line, and ran it against iconv / uni2ascii / recode etc. and its unanimous - there was some bad byte pattern present in the input file for whatever reason. Using -e to specify different types (e.g. ascii, utf-8, etc) did not work. Ultimately, because it would take too long to use iconv and recode and such on it, and uni2ascii was bailing out, I just piped the file through the "strings" utility before passing into csvgrep as ASCII.
So, in order to prevent these types of errors from causing the program to unequivocally exit (crash, in my opinion!), it would be nice to have an option common to all csvkit tools that forces all errors to be ignored and perhaps just output to stderr or written to a log file along with the content of the accompanying record(s), line number(s), and reason for exception. The line could then be left out of the output, and if needed the particular line could be fixed manually before re-running csvkit tools.
This would make it much, much more friendly for running against large file sets. Again, when it takes for example 12 hours to pipe just one of my data sets against csvgrep, it absolutely crushes me to see an error that stopped it cold in its tracks just 45 minutes in and having to use grep to find the line that it crashed on from the original file, do the subtraction to get the remaining lines to convert, tail that remaining count from the source file to another file, try to figure out the problem line, and then re-run csvkit to AGAIN find that a SINGLE BAD BYTE crashed the dang thing.
I hope you understand my frustration, and why an option to forcefully and explicitly continue in the face of errors, ignore the record(s) in error, and just output them to stderr and/or a log would be helpful.
Thank you!