csvclean should handle UnicodeDecodeErrors more gracefully
Created by: ghing
I was working with a CSV file that was prepared by a colleague. It started as an HTML table, was copied and pasted into Excel, manually manipulated and finally saved as CSV. When trying to load the file into a spatial database using ogr2ogr, I got a not-too-helpful error. So, I tried to inspect it using csvclean -n badfile.csv
However, when I ran the command, I just got an error message stating:
'utf8' codec can't decode byte 0xa0 in position 20: invalid start byte
Not very helpful.
The error originates from an uncaught exception in csvkit.unicsv.UTF8Recoder.next() with the call to self.reader.next() (https://github.com/onyxfish/csvkit/blob/master/csvkit/unicsv.py#L22)
It seems that csvclean could at least report the line number of the CSV file with the wonky characters to allow for further inspection.
Here's a patch that shows my hack to identify the messed up line:
diff --git a/csvkit/cleanup.py b/csvkit/cleanup.py
index 86cdc00..1426c46 100644
--- a/csvkit/cleanup.py
+++ b/csvkit/cleanup.py
@@ -58,32 +58,38 @@ class RowChecker(object):
def checked_rows(self):
"""A generator which yields OK rows which are ready to write to output."""
- for row in self.reader:
- self.input_rows += 1
- line_number = self.input_rows + 1 # add one for 1-based counting
+ try:
+ for row in self.reader:
+ self.input_rows += 1
+ line_number = self.input_rows + 1 # add one for 1-based counting
- try:
- if len(row) != len(self.column_names):
- raise LengthMismatchError(line_number,row,len(self.column_names))
- # any other tests?
- yield row
- except LengthMismatchError, e:
- self.errs.append(e)
- # see if we can actually clean up those length mismatches
- joinable_row_errors = extract_joinable_row_errors(self.errs)
- while joinable_row_errors:
- fixed_row = join_rows([err.row for err in joinable_row_errors], joiner=' ')
- if len(fixed_row) < len(self.column_names): break
- if len(fixed_row) == len(self.column_names):
- self.rows_joined += len(joinable_row_errors)
- self.joins += 1
- yield fixed_row
- for fixed in joinable_row_errors:
- self.errs.remove(fixed)
- break
- joinable_row_errors = joinable_row_errors[1:] # keep trying in case we're too long because of a straggler
+ try:
+ if len(row) != len(self.column_names):
+ raise LengthMismatchError(line_number,row,len(self.column_names))
+ # any other tests?
+ yield row
+ except LengthMismatchError, e:
+ self.errs.append(e)
+ # see if we can actually clean up those length mismatches
+ joinable_row_errors = extract_joinable_row_errors(self.errs)
+ while joinable_row_errors:
+ fixed_row = join_rows([err.row for err in joinable_row_errors], joiner=' ')
+ if len(fixed_row) < len(self.column_names): break
+ if len(fixed_row) == len(self.column_names):
+ self.rows_joined += len(joinable_row_errors)
+ self.joins += 1
+ yield fixed_row
+ for fixed in joinable_row_errors:
+ self.errs.remove(fixed)
+ break
+ joinable_row_errors = joinable_row_errors[1:] # keep trying in case we're too long because of a straggler
- except CSVTestException, e:
- self.errs.append(e)
+ except CSVTestException, e:
+ self.errs.append(e)
+
+ except UnicodeDecodeError, e:
+ e.line_number = line_number
+ e.msg = str(e)
+ self.errors.append(e)