ETL fails on value errors in single data point #184

WillEngler · 2015-09-22T16:41:14Z

ETL jobs fail on one bad data point. For example, here's the end of a stack trace of a failure to upload 311 garbage cart requests. It contains an out of range integer, which causes the whole process to fail.

PlenarioETLError: value "2150009386" is out of range for type integer CONTEXT: COPY src_311_service_requests_garbage_carts, line 49124, column number_of_black_carts_delivered: "2150009386"

Why does this happen? in _insert_src_data(), we run a postgres COPY operation. COPY is all or nothing. When it fails, we catch the exception and roll back the transaction.

How could we fix it?

One idea: when we see this particular kind of exception, load the input csv into memory, then sanitize it before the COPY. Seems error-prone and processor intensive. Maybe there is a nice library out there for it. I looked into csvkit, but it doesn't look like it will solve this problem.

Other idea: skip the trouble making rows and just load in everything you can. pgloader is a tool that does this. Install is a pain though.

The text was updated successfully, but these errors were encountered:

Pinkalicious · 2015-09-22T16:56:51Z

Depends on the type of data and the data source.

If it is point data, the pgloader mechanism is ok---assuming a bad point
can be corrected independent of other points. It surely doesn't work for
polygon data.

Currently, I vote for the pgloader mechanism with some process in place to
inform the user to clean up the bad rows and confirm upload of remaining
rows.

Robust rules for fixing data might be data source dependent. For some
standard sources, rules can be developed, but in general the problem is
hard. Maybe a IFTTT framework can be developed for data ingest that give s
a user mechanism to quickly write a script correcting the data.

On Tue, Sep 22, 2015 at 11:41 AM, Will Engler [email protected]
wrote:

ETL jobs fail on one bad data point. For example, here's the end of a
stack trace of a failure to upload 311 garbage cart requests. It contains
an out of range integer, which causes the whole process to fail.

PlenarioETLError: value "2150009386" is out of range for type integer
CONTEXT: COPY src_311_service_requests_garbage_carts, line 49124, column
number_of_black_carts_delivered: "2150009386"

Why does this happen? in _insert_src_data(), we run a postgres COPY
operation. COPY is all or nothing. When it fails, we catch the exception
and roll back the transaction.

How could we fix it?

One idea: when we see this particular kind of exception, load the input
csv into memory, then sanitize it before the COPY. Seems error-prone and
processor intensive. Maybe there is a nice library out there for it. I
looked into csvkit, but it doesn't look like it will solve this problem.

Other idea: skip the trouble making rows and just load in everything you
can. pgloader http://pgloader.io/index.html is a tool that does this.
Install is a pain though.

—
Reply to this email directly or view it on GitHub
#184.

WillEngler mentioned this issue Sep 22, 2015

PlenarioETLError: Missing data for column #181

Closed

vforgione added bug etl labels Jul 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETL fails on value errors in single data point #184

ETL fails on value errors in single data point #184

WillEngler commented Sep 22, 2015

Pinkalicious commented Sep 22, 2015

ETL fails on value errors in single data point #184

ETL fails on value errors in single data point #184

Comments

WillEngler commented Sep 22, 2015

Pinkalicious commented Sep 22, 2015