Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • C csvkit
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 61
    • Issues 61
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 4
    • Merge requests 4
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • wireservice
  • csvkit
  • Issues
  • #121
Closed
Open
Issue created Sep 22, 2011 by Administrator@rootContributor

csvclean should handle UnicodeDecodeErrors more gracefully

Created by: ghing

I was working with a CSV file that was prepared by a colleague. It started as an HTML table, was copied and pasted into Excel, manually manipulated and finally saved as CSV. When trying to load the file into a spatial database using ogr2ogr, I got a not-too-helpful error. So, I tried to inspect it using csvclean -n badfile.csv

However, when I ran the command, I just got an error message stating:

'utf8' codec can't decode byte 0xa0 in position 20: invalid start byte

Not very helpful.

The error originates from an uncaught exception in csvkit.unicsv.UTF8Recoder.next() with the call to self.reader.next() (https://github.com/onyxfish/csvkit/blob/master/csvkit/unicsv.py#L22)

It seems that csvclean could at least report the line number of the CSV file with the wonky characters to allow for further inspection.

Here's a patch that shows my hack to identify the messed up line:

diff --git a/csvkit/cleanup.py b/csvkit/cleanup.py
index 86cdc00..1426c46 100644
--- a/csvkit/cleanup.py
+++ b/csvkit/cleanup.py
@@ -58,32 +58,38 @@ class RowChecker(object):

     def checked_rows(self):
         """A generator which yields OK rows which are ready to write to output."""
-        for row in self.reader:
-            self.input_rows += 1 
-            line_number = self.input_rows + 1 # add one for 1-based counting
+        try:
+            for row in self.reader:
+                self.input_rows += 1 
+                line_number = self.input_rows + 1 # add one for 1-based counting

-            try:
-                if len(row) != len(self.column_names):
-                    raise LengthMismatchError(line_number,row,len(self.column_names))
-                # any other tests?
-                yield row
-            except LengthMismatchError, e:
-                self.errs.append(e)
-                # see if we can actually clean up those length mismatches
-                joinable_row_errors = extract_joinable_row_errors(self.errs)
-                while joinable_row_errors:
-                    fixed_row = join_rows([err.row for err in joinable_row_errors], joiner=' ')
-                    if len(fixed_row) < len(self.column_names): break
-                    if len(fixed_row) == len(self.column_names):
-                        self.rows_joined += len(joinable_row_errors)
-                        self.joins += 1
-                        yield fixed_row
-                        for fixed in joinable_row_errors:
-                            self.errs.remove(fixed)
-                        break
-                    joinable_row_errors = joinable_row_errors[1:] # keep trying in case we're too long because of a straggler
+                try:
+                    if len(row) != len(self.column_names):
+                        raise LengthMismatchError(line_number,row,len(self.column_names))
+                    # any other tests?
+                    yield row
+                except LengthMismatchError, e:
+                    self.errs.append(e)
+                    # see if we can actually clean up those length mismatches
+                    joinable_row_errors = extract_joinable_row_errors(self.errs)
+                    while joinable_row_errors:
+                        fixed_row = join_rows([err.row for err in joinable_row_errors], joiner=' ')
+                        if len(fixed_row) < len(self.column_names): break
+                        if len(fixed_row) == len(self.column_names):
+                            self.rows_joined += len(joinable_row_errors)
+                            self.joins += 1
+                            yield fixed_row
+                            for fixed in joinable_row_errors:
+                                self.errs.remove(fixed)
+                            break
+                        joinable_row_errors = joinable_row_errors[1:] # keep trying in case we're too long because of a straggler

-            except CSVTestException, e:
-                self.errs.append(e)
+                except CSVTestException, e:
+                    self.errs.append(e)
+
+        except UnicodeDecodeError, e:
+            e.line_number = line_number
+            e.msg = str(e)
+            self.errors.append(e)
Assignee
Assign to
Time tracking