Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pausing/Resuming partitions causes lots of errors #585

Closed
Tracked by #589
stampy88 opened this issue Oct 6, 2023 · 3 comments · Fixed by #601
Closed
Tracked by #589

Pausing/Resuming partitions causes lots of errors #585

stampy88 opened this issue Oct 6, 2023 · 3 comments · Fixed by #601
Labels
enhancement New feature or request

Comments

@stampy88
Copy link

stampy88 commented Oct 6, 2023

Hi,

I have noticed a problem with pausing and resuming of partitions that is causing lots of errors. When I stopped doing the pause/resume, the errors went away. Here is a screenshot of the Franz prometheus metrics when the application was actively pausing and resuming partitions:
pause_errors

Here is a screenshot after commenting out the pause/resume keeping everything else the same:
stop_pause

Here is a sample summary of errors during a 5 minute period from a pod:

 [
   {
    "msg": "read Fetch v6",
    "err": "context canceled",
    "count": 827
  },
  {
    "msg": "unable to open connection to broker",
    "err": "context canceled",
    "count": 117
  },
  {
    "msg": "wrote Fetch v6",
    "err": "context canceled",
    "count": 4
  },
  {
    "msg": "wrote Fetch v6",
    "err": "use of closed network connection",
    "count": 18
  },
  {
    "msg": "wrote Fetch v6",
    "err": "write tcp 10.73.0.182:35770->xx.xx.147.1:9093: use of closed network connection",
    "count": 1
  },
  {
    "msg": "wrote Fetch v6",
    "err": "write tcp 10.73.1.0:45126->xx.xx.147.1:9093: use of closed network connection",
    "count": 1
  },
  {
    "msg": "wrote Fetch v6",
    "err": "write tcp 10.73.1.0:54114->xx.xx.147.1:9093: use of closed network connection",
    "count": 1
  }
]

@twmb twmb added the enhancement New feature or request label Oct 10, 2023
@twmb twmb mentioned this issue Oct 10, 2023
7 tasks
@twmb
Copy link
Owner

twmb commented Oct 21, 2023

So we discussed in discord, and the reason this occurs is because the client takes a simplistic way to ensure partitions are not returned after they are paused (solving your original issue #489). To eliminate the chance that any in flight fetch request could be returning data for paused partitions, the client cuts all active fetch requests at the end of every pause.

An alternative approach than what's done in c3b083b would be to let these fetches return and strip the data after. An easyish way to do this would be, whenever we receive a fetch response, if we have paused partitions while processing the response, drop any response for paused partitions. This may not be too difficult.

An advanced approach would be to go further and try to do intelligence on what is inflight -- so that you are not waiting for a fetch response for data entirely for paused partitions. This is too complicated.

twmb added a commit that referenced this issue Oct 21, 2023
Issue #489 asked to stop returning data after a partition was paused --
the original implementation of pausing kept returning any data that was
in flight or already buffered, and simply stopped fetching new data.

489 was dealt with by bumping the consumer session, which kills all
in flight fetch requests. This was easy, but can cause a lot of
connection churn if pausing and resuming a lot -- which is #585.

The new implementation allows fetches to complete, but strips data
from fetches based on what is paused at the moment the fetches are being
returned to the client. This does make polling paused fetches very
slightly slower (a map lookup per partition), but there's only so much
that's possible. If a partition is paused, we drop the data and do not
advance the internal offset. If a partition is not paused, we keep the
data and return it -- same as before.
@twmb
Copy link
Owner

twmb commented Oct 21, 2023

I think #601 solves this

@stampy88
Copy link
Author

I think what you have done makes sense. That is a better way to handle it. Thanks for turning it around quickly.

twmb added a commit that referenced this issue Oct 21, 2023
Issue #489 asked to stop returning data after a partition was paused --
the original implementation of pausing kept returning any data that was
in flight or already buffered, and simply stopped fetching new data.

489 was dealt with by bumping the consumer session, which kills all
in flight fetch requests. This was easy, but can cause a lot of
connection churn if pausing and resuming a lot -- which is #585.

The new implementation allows fetches to complete, but strips data
from fetches based on what is paused at the moment the fetches are being
returned to the client. This does make polling paused fetches very
slightly slower (a map lookup per partition), but there's only so much
that's possible. If a partition is paused, we drop the data and do not
advance the internal offset. If a partition is not paused, we keep the
data and return it -- same as before.
twmb added a commit that referenced this issue Oct 21, 2023
Issue #489 asked to stop returning data after a partition was paused --
the original implementation of pausing kept returning any data that was
in flight or already buffered, and simply stopped fetching new data.

489 was dealt with by bumping the consumer session, which kills all
in flight fetch requests. This was easy, but can cause a lot of
connection churn if pausing and resuming a lot -- which is #585.

The new implementation allows fetches to complete, but strips data
from fetches based on what is paused at the moment the fetches are being
returned to the client. This does make polling paused fetches very
slightly slower (a map lookup per partition), but there's only so much
that's possible. If a partition is paused, we drop the data and do not
advance the internal offset. If a partition is not paused, we keep the
data and return it -- same as before.
twmb added a commit that referenced this issue Oct 21, 2023
Issue #489 asked to stop returning data after a partition was paused --
the original implementation of pausing kept returning any data that was
in flight or already buffered, and simply stopped fetching new data.

489 was dealt with by bumping the consumer session, which kills all
in flight fetch requests. This was easy, but can cause a lot of
connection churn if pausing and resuming a lot -- which is #585.

The new implementation allows fetches to complete, but strips data
from fetches based on what is paused at the moment the fetches are being
returned to the client. This does make polling paused fetches very
slightly slower (a map lookup per partition), but there's only so much
that's possible. If a partition is paused, we drop the data and do not
advance the internal offset. If a partition is not paused, we keep the
data and return it -- same as before.
twmb added a commit that referenced this issue Oct 21, 2023
Issue #489 asked to stop returning data after a partition was paused --
the original implementation of pausing kept returning any data that was
in flight or already buffered, and simply stopped fetching new data.

489 was dealt with by bumping the consumer session, which kills all
in flight fetch requests. This was easy, but can cause a lot of
connection churn if pausing and resuming a lot -- which is #585.

The new implementation allows fetches to complete, but strips data
from fetches based on what is paused at the moment the fetches are being
returned to the client. This does make polling paused fetches very
slightly slower (a map lookup per partition), but there's only so much
that's possible. If a partition is paused, we drop the data and do not
advance the internal offset. If a partition is not paused, we keep the
data and return it -- same as before.
twmb added a commit that referenced this issue Oct 21, 2023
Issue #489 asked to stop returning data after a partition was paused --
the original implementation of pausing kept returning any data that was
in flight or already buffered, and simply stopped fetching new data.

489 was dealt with by bumping the consumer session, which kills all
in flight fetch requests. This was easy, but can cause a lot of
connection churn if pausing and resuming a lot -- which is #585.

The new implementation allows fetches to complete, but strips data
from fetches based on what is paused at the moment the fetches are being
returned to the client. This does make polling paused fetches very
slightly slower (a map lookup per partition), but there's only so much
that's possible. If a partition is paused, we drop the data and do not
advance the internal offset. If a partition is not paused, we keep the
data and return it -- same as before.
@twmb twmb closed this as completed in #601 Oct 22, 2023
twmb added a commit that referenced this issue Oct 31, 2023
Follow up from #585, there was a bug in the commit for it. If any topic
was paused, then all non-paused topics would be returned once, but they
would not be marked as fetchable after that.

I _think_ the non-fetchability would eventually be cleared on a metadata
update, _but_ the source would re-fetch from the old position again. The
only way the topic would advance would be if no topics were paused after
the metadata update.

However this is a bit confusing, and overall this patch is required.
twmb added a commit that referenced this issue Oct 31, 2023
Follow up from #585, there was a bug in the commit for it. If any topic
was paused, then all non-paused topics would be returned once, but they
would not be marked as fetchable after that.

I _think_ the non-fetchability would eventually be cleared on a metadata
update, _but_ the source would re-fetch from the old position again. The
only way the topic would advance would be if no topics were paused after
the metadata update.

However this is a bit confusing, and overall this patch is required.

This also patches a second bug in PollFetches with pausing: if a topic
has a paused partition, if the fetch response does NOT contain any
paused partitions, then the logic would actually strip the entire topic.

The pause tests have been strengthened a good bit -- all lines but one
are hit, and the one line that is not hit could more easily be hit if
more partitions are added to the topic / a cluster of size one is used.
The line is currently not hit because it requires one paused partition
and one unpaused partition to be returned from the same broker at the
same time.
twmb added a commit that referenced this issue Oct 31, 2023
Follow up from #585, there was a bug in the commit for it. If any topic
was paused, then all non-paused topics would be returned once, but they
would not be marked as fetchable after that.

I _think_ the non-fetchability would eventually be cleared on a metadata
update, _but_ the source would re-fetch from the old position again. The
only way the topic would advance would be if no topics were paused after
the metadata update.

However this is a bit confusing, and overall this patch is required.

This also patches a second bug in PollFetches with pausing: if a topic
has a paused partition, if the fetch response does NOT contain any
paused partitions, then the logic would actually strip the entire topic.

The pause tests have been strengthened a good bit -- all lines but one
are hit, and the one line that is not hit could more easily be hit if
more partitions are added to the topic / a cluster of size one is used.
The line is currently not hit because it requires one paused partition
and one unpaused partition to be returned from the same broker at the
same time.

Lastly, this adds an error reason to why list or epoch is reloading,
which was used briefly while investigating test slowness.
peczenyj pushed a commit to peczenyj/franz-go that referenced this issue Dec 21, 2023
author Travis Bischel <[email protected]> 1698807017 -0600
committer Tiago Peczenyj <[email protected]> 1703159929 +0100

parent 6ebcb43
author Travis Bischel <[email protected]> 1698807017 -0600
committer Tiago Peczenyj <[email protected]> 1703159889 +0100

kgo: no-op mark functions when not using AutoCommitMarks

Closes twmb#598.

kgo: pin AddPartitionsToTxn to v3 when using one transaction

KIP-890 has been updated such that v3 must be used by clients. We will
pin to v3 unless multiple transactions are being added, or unless any
transaction is verify only.

Closes twmb#609.

GHA: try redpandadata/redpanda

Latest stable

kgo: be sure to use topics when other topics are paused

Follow up from twmb#585, there was a bug in the commit for it. If any topic
was paused, then all non-paused topics would be returned once, but they
would not be marked as fetchable after that.

I _think_ the non-fetchability would eventually be cleared on a metadata
update, _but_ the source would re-fetch from the old position again. The
only way the topic would advance would be if no topics were paused after
the metadata update.

However this is a bit confusing, and overall this patch is required.

This also patches a second bug in PollFetches with pausing: if a topic
has a paused partition, if the fetch response does NOT contain any
paused partitions, then the logic would actually strip the entire topic.

The pause tests have been strengthened a good bit -- all lines but one
are hit, and the one line that is not hit could more easily be hit if
more partitions are added to the topic / a cluster of size one is used.
The line is currently not hit because it requires one paused partition
and one unpaused partition to be returned from the same broker at the
same time.

Lastly, this adds an error reason to why list or epoch is reloading,
which was used briefly while investigating test slowness.

sticky: further improvements

* Introduces separate functions for go 1.21+, allowing to eliminate
  unremoveable allocs from sort.Sort. To keep it simple, this simplifies
  <=1.20 a little bit, so that is **slightly** more inefficient.
* Improves new-partition assignment further -- ensure we always place
  unassigned partitions on the least consuming member.

CHANGELOG: update for v1.15.2

parent 33e15f9
author Victor <[email protected]> 1699638659 -0300
committer Tiago Peczenyj <[email protected]> 1703159232 +0100

parent 33e15f9
author Victor <[email protected]> 1699638659 -0300
committer Tiago Peczenyj <[email protected]> 1703159156 +0100

chore: fix typo

define public interface instead use *logrus.Logger

add example

fix lint issue with exhaustive

add new api

improve tests, format code

Update klogrus.go

Improve existing documentation

Update klogrus.go

Fix typos

Update klogrus.go

remove period

kgo source: use the proper topic-to-id map when forgetting topics

Adding topics to a session needs to use the fetch request's topic2id
map (which then promotes IDs into the session t2id map).

Importantly, and previously this was wrong / not the case: removing
topics from a session needs to use the session's t2id map. The topic
does not exist in the request's topic2id map, because well, it's being
forgotten. It's not in the fetch request.

Adds some massive comments explaining the situation.

Closes twmb#620.

consuming: reset to nearest if we receive OOOR while fetching

If we receive OOOR while fetching after a fetch was previously
successful, something odd happened in the broker. Either what we were
consuming was truncated underfoot, which is normal and expected (for
very slow consumers), or data loss occurred without a leadership
transfer. We will reset to the nearest offset after our prior consumed
offset (by time!) because, well, that's what's most valid: we
previously had a valid offset, and now it is invalid.

Closes twmb#621.

use bytes buffer instead of ReadAll

CHANGELOG: note incoming v1.15.3

pkg/sr: improve base URL and resource path joining

* use `url.JoinPath()` to join the base URL with the path for cleaning any ./ or ../ element
* also move hardDelete to a request context query parameter

kfake: add SleepControl

This function allows you to sleep a function you are controlling until
your wakeup function returns. The control function effectively yields to
other requests. Note that since requests must be handled in order, you
need to be a bit careful to not block other requests (unless you
intentionally do that).

This basically does what it says on the tin. The behavior of everything
else is unchanged -- you can KeepControl, you can return false to say it
wasn't handled, etc.

The logic and control flow is a good bit ugly, but it works and is
fairly documented and "well contained".

In working on this, I also found and fixed a bug that resulted in
correllation errors when handling join&sync.

kgo group tests still work against kfake's "hidden" main.go, and I have
tested SleepControl with/without KeepControl, and with/without returning
handled=true.

build(deps): bump golang.org/x/crypto in /pkg/sasl/kerberos

Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.14.0 to 0.17.0.
- [Commits](golang/crypto@v0.14.0...v0.17.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>

build(deps): bump golang.org/x/crypto from 0.13.0 to 0.17.0 in /pkg/kadm

Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.13.0 to 0.17.0.
- [Commits](golang/crypto@v0.13.0...v0.17.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

build(deps): bump golang.org/x/crypto in /examples/bench

Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.11.0 to 0.17.0.
- [Commits](golang/crypto@v0.11.0...v0.17.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>

bump all deps, except klauspost/compress

bumping klauspost/compress requires go1.19. We'll do this bump with
v1.16.

fix go.*

kgo source: use the proper topic-to-id map when forgetting topics

Adding topics to a session needs to use the fetch request's topic2id
map (which then promotes IDs into the session t2id map).

Importantly, and previously this was wrong / not the case: removing
topics from a session needs to use the session's t2id map. The topic
does not exist in the request's topic2id map, because well, it's being
forgotten. It's not in the fetch request.

Adds some massive comments explaining the situation.

Closes twmb#620.

consuming: reset to nearest if we receive OOOR while fetching

If we receive OOOR while fetching after a fetch was previously
successful, something odd happened in the broker. Either what we were
consuming was truncated underfoot, which is normal and expected (for
very slow consumers), or data loss occurred without a leadership
transfer. We will reset to the nearest offset after our prior consumed
offset (by time!) because, well, that's what's most valid: we
previously had a valid offset, and now it is invalid.

Closes twmb#621.

use bytes buffer instead of ReadAll

CHANGELOG: note incoming v1.15.3

pkg/sr: improve base URL and resource path joining

* use `url.JoinPath()` to join the base URL with the path for cleaning any ./ or ../ element
* also move hardDelete to a request context query parameter

kfake: add SleepControl

This function allows you to sleep a function you are controlling until
your wakeup function returns. The control function effectively yields to
other requests. Note that since requests must be handled in order, you
need to be a bit careful to not block other requests (unless you
intentionally do that).

This basically does what it says on the tin. The behavior of everything
else is unchanged -- you can KeepControl, you can return false to say it
wasn't handled, etc.

The logic and control flow is a good bit ugly, but it works and is
fairly documented and "well contained".

In working on this, I also found and fixed a bug that resulted in
correllation errors when handling join&sync.

kgo group tests still work against kfake's "hidden" main.go, and I have
tested SleepControl with/without KeepControl, and with/without returning
handled=true.

fix go.*

chore: fix typo

define public interface instead use *logrus.Logger

add example

fix lint issue with exhaustive

add new api

improve tests, format code

Update klogrus.go

Improve existing documentation

Update klogrus.go

Fix typos

Update klogrus.go

remove period

kgo source: use the proper topic-to-id map when forgetting topics

Adding topics to a session needs to use the fetch request's topic2id
map (which then promotes IDs into the session t2id map).

Importantly, and previously this was wrong / not the case: removing
topics from a session needs to use the session's t2id map. The topic
does not exist in the request's topic2id map, because well, it's being
forgotten. It's not in the fetch request.

Adds some massive comments explaining the situation.

Closes twmb#620.

consuming: reset to nearest if we receive OOOR while fetching

If we receive OOOR while fetching after a fetch was previously
successful, something odd happened in the broker. Either what we were
consuming was truncated underfoot, which is normal and expected (for
very slow consumers), or data loss occurred without a leadership
transfer. We will reset to the nearest offset after our prior consumed
offset (by time!) because, well, that's what's most valid: we
previously had a valid offset, and now it is invalid.

Closes twmb#621.

use bytes buffer instead of ReadAll

CHANGELOG: note incoming v1.15.3

pkg/sr: improve base URL and resource path joining

* use `url.JoinPath()` to join the base URL with the path for cleaning any ./ or ../ element
* also move hardDelete to a request context query parameter

kfake: add SleepControl

This function allows you to sleep a function you are controlling until
your wakeup function returns. The control function effectively yields to
other requests. Note that since requests must be handled in order, you
need to be a bit careful to not block other requests (unless you
intentionally do that).

This basically does what it says on the tin. The behavior of everything
else is unchanged -- you can KeepControl, you can return false to say it
wasn't handled, etc.

The logic and control flow is a good bit ugly, but it works and is
fairly documented and "well contained".

In working on this, I also found and fixed a bug that resulted in
correllation errors when handling join&sync.

kgo group tests still work against kfake's "hidden" main.go, and I have
tested SleepControl with/without KeepControl, and with/without returning
handled=true.

build(deps): bump golang.org/x/crypto in /pkg/sasl/kerberos

Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.14.0 to 0.17.0.
- [Commits](golang/crypto@v0.14.0...v0.17.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>

build(deps): bump golang.org/x/crypto from 0.13.0 to 0.17.0 in /pkg/kadm

Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.13.0 to 0.17.0.
- [Commits](golang/crypto@v0.13.0...v0.17.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

build(deps): bump golang.org/x/crypto in /examples/bench

Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.11.0 to 0.17.0.
- [Commits](golang/crypto@v0.11.0...v0.17.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>

bump all deps, except klauspost/compress

bumping klauspost/compress requires go1.19. We'll do this bump with
v1.16.

fix go.*

kgo source: use the proper topic-to-id map when forgetting topics

Adding topics to a session needs to use the fetch request's topic2id
map (which then promotes IDs into the session t2id map).

Importantly, and previously this was wrong / not the case: removing
topics from a session needs to use the session's t2id map. The topic
does not exist in the request's topic2id map, because well, it's being
forgotten. It's not in the fetch request.

Adds some massive comments explaining the situation.

Closes twmb#620.

consuming: reset to nearest if we receive OOOR while fetching

If we receive OOOR while fetching after a fetch was previously
successful, something odd happened in the broker. Either what we were
consuming was truncated underfoot, which is normal and expected (for
very slow consumers), or data loss occurred without a leadership
transfer. We will reset to the nearest offset after our prior consumed
offset (by time!) because, well, that's what's most valid: we
previously had a valid offset, and now it is invalid.

Closes twmb#621.

use bytes buffer instead of ReadAll

CHANGELOG: note incoming v1.15.3

pkg/sr: improve base URL and resource path joining

* use `url.JoinPath()` to join the base URL with the path for cleaning any ./ or ../ element
* also move hardDelete to a request context query parameter

kfake: add SleepControl

This function allows you to sleep a function you are controlling until
your wakeup function returns. The control function effectively yields to
other requests. Note that since requests must be handled in order, you
need to be a bit careful to not block other requests (unless you
intentionally do that).

This basically does what it says on the tin. The behavior of everything
else is unchanged -- you can KeepControl, you can return false to say it
wasn't handled, etc.

The logic and control flow is a good bit ugly, but it works and is
fairly documented and "well contained".

In working on this, I also found and fixed a bug that resulted in
correllation errors when handling join&sync.

kgo group tests still work against kfake's "hidden" main.go, and I have
tested SleepControl with/without KeepControl, and with/without returning
handled=true.

fix go.*

kfake: add DropControl, SleepOutOfOrder, CoordinatorFor, RehashCoordinators

* Sleeping was a bit limited because if two requests came in on the same
  connection, you could not really chain logic. Sleeping out of order
allows you to at least run some logic to gate how requests behave with
each other. It's not the most obvious, so it is not the default.

* Adds SleepOutOfOrder

* Adds CoordinatorFor so you can see which "broker" a coordinator
  request will go to

* Adds RehashCoordinators to change where requests are hashed to

The latter two allow you to loop rehashing until a coordinator for your
key changes, if you want to force NotCoordinator requests.

kgo: do not cancel FindCoordinator if the parent context cancels

Some load testing in Redpanda showed a failure where consuming quit
unexpectedly and unrecoverably.

The sequence of events is:
* if OffsetCommit is issued just before Heartbeat
* and the group needs to be loaded so FindCoordinator is triggered,
* and OffsetCommit happens again, canceling the prior commit's context
Then,
* FindCoordinator would cancel
* Heartbeat, which is waiting on the same load, would fail with
  context.Canceled
* This error is seen as a group leave error
* The group management logic would quit entirely.

Now, the context used for FindCoordinator is the client context, which
is only closed on client close. This is also better anyway -- if two
requests are waiting for the same coordinator load, we don't want the
first request canceling to error the second request. If all requests
cancel and we have a stray FindCoordinator in flight, that's ok too,
because well, worst case we'll just eventually have a little bit of
extra data cached that is likely needed in the future anyway.

Closes redpanda-data/redpanda#15131

CHANGELOG: document incoming v1.15.4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants