Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETL fails on value errors in single data point #184

Open
WillEngler opened this issue Sep 22, 2015 · 1 comment
Open

ETL fails on value errors in single data point #184

WillEngler opened this issue Sep 22, 2015 · 1 comment

Comments

@WillEngler
Copy link
Contributor

ETL jobs fail on one bad data point. For example, here's the end of a stack trace of a failure to upload 311 garbage cart requests. It contains an out of range integer, which causes the whole process to fail.

PlenarioETLError: value "2150009386" is out of range for type integer CONTEXT: COPY src_311_service_requests_garbage_carts, line 49124, column number_of_black_carts_delivered: "2150009386"

Why does this happen? in _insert_src_data(), we run a postgres COPY operation. COPY is all or nothing. When it fails, we catch the exception and roll back the transaction.

How could we fix it?

One idea: when we see this particular kind of exception, load the input csv into memory, then sanitize it before the COPY. Seems error-prone and processor intensive. Maybe there is a nice library out there for it. I looked into csvkit, but it doesn't look like it will solve this problem.

Other idea: skip the trouble making rows and just load in everything you can. pgloader is a tool that does this. Install is a pain though.

@Pinkalicious
Copy link

Depends on the type of data and the data source.

If it is point data, the pgloader mechanism is ok---assuming a bad point
can be corrected independent of other points. It surely doesn't work for
polygon data.

Currently, I vote for the pgloader mechanism with some process in place to
inform the user to clean up the bad rows and confirm upload of remaining
rows.

Robust rules for fixing data might be data source dependent. For some
standard sources, rules can be developed, but in general the problem is
hard. Maybe a IFTTT framework can be developed for data ingest that give s
a user mechanism to quickly write a script correcting the data.

On Tue, Sep 22, 2015 at 11:41 AM, Will Engler [email protected]
wrote:

ETL jobs fail on one bad data point. For example, here's the end of a
stack trace of a failure to upload 311 garbage cart requests. It contains
an out of range integer, which causes the whole process to fail.

PlenarioETLError: value "2150009386" is out of range for type integer
CONTEXT: COPY src_311_service_requests_garbage_carts, line 49124, column
number_of_black_carts_delivered: "2150009386"

Why does this happen? in _insert_src_data(), we run a postgres COPY
operation. COPY is all or nothing. When it fails, we catch the exception
and roll back the transaction.

How could we fix it?

One idea: when we see this particular kind of exception, load the input
csv into memory, then sanitize it before the COPY. Seems error-prone and
processor intensive. Maybe there is a nice library out there for it. I
looked into csvkit, but it doesn't look like it will solve this problem.

Other idea: skip the trouble making rows and just load in everything you
can. pgloader http://pgloader.io/index.html is a tool that does this.
Install is a pain though.


Reply to this email directly or view it on GitHub
#184.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants