Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECONNRESET when uploading a large file. #27

Open
drob opened this issue May 29, 2014 · 9 comments
Open

ECONNRESET when uploading a large file. #27

drob opened this issue May 29, 2014 · 9 comments

Comments

@drob
Copy link

drob commented May 29, 2014

I'm getting ECONNRESET errors when uploading a 350mb file with knox-mpu.

In particular:

{
    "part": 1,
    "message": {
        "code": "ECONNRESET",
        "errno": "ECONNRESET",
        "syscall": "read"
    }
}

The part specified is different each time but is always between 1 and 4. (I am using the default batchSize of 4.)

Is there any other info that would be helpful in debugging this?

@mikermcneil
Copy link
Collaborator

@drob hey man, just a guess, but have you tried:

// Max # of miliseconds client sockets (i.e. for our purposes: **requests**) should be allowed to stay connected to this particular route.
// 0 = infinite
res.setTimeout(0);

(see pillarjs/multiparty#49 (comment) for details)

@luccastera
Copy link

I'm getting these errors as well.

Could it be caused by S3 rate limiting? See http://blog.blitline.com/post/29157492002/things-to-know-about-s3 or Automattic/knox#199

@mikermcneil
Copy link
Collaborator

OK- so what I posted before is really the solution for a different issue, involving aborted requests (although you'll want to consider it as well). As for the issue at hand, here's the best of my understanding atm:

ECONNRESET started showing up in node 0.10- it was stifled before that. It seems to be improved in 0.11, but it will still sometimes fire. It seems that the situation will be greatly improved in Node v0.12, but that doesn't help us now.

Anyways, ECONNRESET originates when a TCP client receives an unexpected RST signal -or- potentially (not sure on this) even if it receives a FIN before an expected ACK from an earlier SYN. Furthermore this seems to be an unavoidable result of dealing with S3, at least for the moment. This very well may be b/c of what @Dambalah just pointed out:

Could it be caused by S3 rate limiting?

So nonetheless, the question becomes "how do we address it?" @sgress454 put together a workaround, for which we're going to send another PR to knox-mpu soon (hopefully by Monday at the latest). We saw promising results in a test of a 160MB file upload, and just need to take it out for a few more spins. Essentially, the reason knox is crashing on ECONNRESET is two-fold:

  1. The res stream here needs an .on('error', ...) handler.

  2. The existing .on('error', ...) handler for the knox client itself (here) needs a condition variable to make sure the callback to batch (or if you're using @dustMason's fork, async) is called only once.

Btw, here's some additional background on ECONNRESET in case anyone smarter than me comes along and knows more about what's going on here :)

From http://stackoverflow.com/questions/17245881/node-js-econnreset:

"ECONNRESET" means the other side of the TCP conversation abruptly closed its end of the connection. This is most probably due to one or more application protocol errors. You could look at the API server logs to see if it complains about something.

Sources:

@mikermcneil
Copy link
Collaborator

  1. The res stream here needs an .on('error', ...) handler.

  2. The existing .on('error', ...) handler for the knox client itself (here) needs a condition variable to make sure the callback to batch (or if you're using @dustMason's fork, async) is called only once.

@nathanoehlman are you cool w/ merging fixes to those two things?

@dustMason
Copy link
Contributor

@mikermcneil Thanks for looking deeply into this one! I think your 2 suggestions are spot on.

@sgress454
Copy link
Collaborator

Update: it doesn't appear that adding the .on('error') handler for the response stream prevents the ECONNRESET errors from occurring. However, our workaround involving checking that the callback is only called once was successful in handling the issue.

sgress454 added a commit to sgress454/knox-mpu that referenced this issue Jul 7, 2014
…x-MPU are only called once.

In some cases, even after a part has been successfully uploaded, S3 will send a response error, which currently causes Knox-MPU to consider the part a failure and either retry it or bail completely.   This fix causes Knox-MPU to ignore errors that come in on the response stream after it has already received a "success" status code.
See nathanoehlman#27 (comment)
@drob
Copy link
Author

drob commented Jul 25, 2014

Fwiw, adding a maxRetries setting to my uploads fixed this issue for me. (That option wasn't documented when I first started using knox-mpu.)

I'm not sure how to fix the underlying issue, though, or if there even is one. (If I'm uploading a 350mb file, it's reasonable for one of the chunks to fail at some point, right?)

Is there a philosophical reason a default maxRetries of 3, e.g., might not be preferable?

@sgress454
Copy link
Collaborator

The underlying issue is that sometimes a chunk will upload successfully, but later send an ECONNRESET error anyway. The knox-mpu code handles this by declaring that the chunk was invalid and retrying it, or by failing altogether, when really the event should just be ignored.

@mikermcneil
Copy link
Collaborator

To add to that, node <= 0.8 didn't even used to announce these sorts of tcp errors-- it has to do with unexpected packets being received after sending the FIN, eg if an ACK is late, but still arrives, or s3 tries to give us more data than we wanted and shoots over an extra SYN or whatever

Mike's phone

On Jul 24, 2014, at 20:40, sgress454 [email protected] wrote:

The underlying issue is that sometimes a chunk will upload successfully, but later send an ECONNRESET error anyway. The knox-mpu code handles this by declaring that the chunk was invalid and retrying it, or by failing altogether, when really the event should just be ignored.


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants