Skip to content

Commit

Permalink
Improve logging of errors in copy (istio#1188)
Browse files Browse the repository at this point in the history
Fixes istio#1183.

Before, we logged basically any error during copying. It turns out this
confuses users a lot when they see "connection reset by peer" just
because the other end shutdown. There are a number of sequences of
events that can lead to non-graceful termination of copying, and
exposing those to users just brings confusion based on some feedback.

For instance, Postgresql always closes connections with a hard RST.
This, previously, meant every postgres connection was reported as an
"error" in Ztunnel - no good.

Looking at Envoy codebase, they do NOT report these as errors: once the
connection is started, it can close for any reason silently (*except at
debug level, of course).

We take the same approach here. Also, add more context on where we are
failing (shutdown, write, etc)
  • Loading branch information
howardjohn authored Jun 28, 2024
1 parent fdab2fb commit fa32afd
Show file tree
Hide file tree
Showing 3 changed files with 53 additions and 19 deletions.
67 changes: 49 additions & 18 deletions src/copy.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
// See the License for the specific language governing permissions and
// limitations under the License.

use crate::proxy;
use crate::proxy::ConnectionResult;
use crate::proxy::Error::{BackendDisconnected, ClientDisconnected, ReceiveError, SendError};
use bytes::{Buf, Bytes, BytesMut};
Expand Down Expand Up @@ -149,42 +150,72 @@ where
{
let (mut rd, mut wd) = downstream.split_into_buffered_reader();
let (mut ru, mut wu) = upstream.split_into_buffered_reader();

let downstream_to_upstream = async {
let res = copy_buf(&mut rd, &mut wu, stats, false).await;
let translate_error = |e: io::Error| {
SendError(Box::new(match e.kind() {
io::ErrorKind::NotConnected => BackendDisconnected,
io::ErrorKind::WriteZero => BackendDisconnected,
io::ErrorKind::UnexpectedEof => ClientDisconnected,
_ => e.into(),
}))
};
let res = ignore_io_errors(copy_buf(&mut rd, &mut wu, stats, false).await)
.map_err(translate_error);
trace!(?res, "send");
ignore_shutdown_errors(shutdown(&mut wu).await)?;
ignore_shutdown_errors(shutdown(&mut wu).await)
.map_err(translate_error)
.map_err(|e| proxy::Error::ShutdownError(Box::new(e)))?;
res
};

let upstream_to_downstream = async {
let res = copy_buf(&mut ru, &mut wd, stats, true).await;
let translate_error = |e: io::Error| {
ReceiveError(Box::new(match e.kind() {
io::ErrorKind::NotConnected => ClientDisconnected,
io::ErrorKind::WriteZero => ClientDisconnected,
_ => e.into(),
}))
};
let res = ignore_io_errors(copy_buf(&mut ru, &mut wd, stats, true).await)
.map_err(translate_error);
trace!(?res, "receive");
ignore_shutdown_errors(shutdown(&mut wd).await)?;
ignore_shutdown_errors(shutdown(&mut wd).await)
.map_err(translate_error)
.map_err(|e| proxy::Error::ShutdownError(Box::new(e)))?;
res
};

// join!() them rather than try_join!() so that we keep complete either end once one side is complete.
let (sent, received) = tokio::join!(downstream_to_upstream, upstream_to_downstream);

// Convert some error messages to easier to understand
let sent = sent
.map_err(|e| match e.kind() {
io::ErrorKind::NotConnected => BackendDisconnected,
io::ErrorKind::UnexpectedEof => ClientDisconnected,
_ => e.into(),
})
.map_err(|e| SendError(Box::new(e)))?;
let received = received
.map_err(|e| match e.kind() {
io::ErrorKind::NotConnected => ClientDisconnected,
_ => e.into(),
})
.map_err(|e| ReceiveError(Box::new(e)))?;
let sent = sent?;
let received = received?;
trace!(sent, received, "copy complete");
Ok(())
}

// During copying, we may encounter errors from either side closing their connection. Typically, we
// get a fully graceful shutdown with no errors on either end, but can if one end sends a RST directly,
// or if we have other non-graceful behavior, we may see errors. This is generally ok - a TCP connection
// can close at any time, really. Avoid reporting these as errors, as generally users expect errors to
// occur only when we cannot connect to the backend at all.
fn ignore_io_errors<T: Default>(res: Result<T, io::Error>) -> Result<T, io::Error> {
use io::ErrorKind::*;
match &res {
Err(e) => match e.kind() {
NotConnected | UnexpectedEof | ConnectionReset | BrokenPipe => {
trace!(err=%e, "io terminated ungracefully");
// Returning Default here is very hacky, but the data we are returning isn't critical so its no so bad to lose it.
// Changing this would require refactoring all the interfaces to always return the bytes written even on error.
Ok(Default::default())
}
_ => res,
},
_ => res,
}
}

// During shutdown, the other end may have already disconnected. That is fine, they shutdown for us.
// Ignore it.
fn ignore_shutdown_errors(res: Result<(), io::Error>) -> Result<(), io::Error> {
Expand Down
3 changes: 3 additions & 0 deletions src/proxy.rs
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,9 @@ pub enum Error {
#[error("io error: {0}")]
Io(#[from] io::Error),

#[error("while closing connection: {0}")]
ShutdownError(Box<Error>),

#[error("destination disconnected before all data was written")]
BackendDisconnected,
#[error("receive: {0}")]
Expand Down
2 changes: 1 addition & 1 deletion src/proxy/inbound_passthrough.rs
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ impl InboundPassthrough {
}
.in_current_span();

assertions::size_between_ref(1500, 2750, &serve_client);
assertions::size_between_ref(1500, 3000, &serve_client);
tokio::spawn(serve_client);
}
Err(e) => {
Expand Down

0 comments on commit fa32afd

Please sign in to comment.