Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: retry health check #594

Closed
sebest opened this issue Jun 19, 2015 · 6 comments
Closed

nsqd: retry health check #594

sebest opened this issue Jun 19, 2015 · 6 comments

Comments

@sebest
Copy link

sebest commented Jun 19, 2015

Hello,

The disk got full and /ping was reporting it even after the disk usage went down as we resume consuming the channel.

We add to restart the process, otherwise /ping kept reporting NOK

it is with version 0.3.2, so maybe it is fixed now?

thanx

@mreiferson
Copy link
Member

@sebest is it possible that you hadn't published any additional messages to that channel during that time?

The health is related to the ability to successfully write to disk. For better or worse, if you never again write to disk it won't currently clear that unhealthy state.

We could force an occasional retry?

@mreiferson mreiferson changed the title disk full and /ping nsqd: unhealthy state after channel resumes Jun 19, 2015
@sebest
Copy link
Author

sebest commented Jun 20, 2015

Got it, as i only write when the status is ok, and if the status is only updated when i write in it, i ended up in a dead lock.
So i guess the only way would be nsq to update its status on a regular basis.

@mreiferson
Copy link
Member

👍

@mreiferson mreiferson changed the title nsqd: unhealthy state after channel resumes nsqd: retry health check Oct 3, 2015
judwhite added a commit to judwhite/nsq that referenced this issue Oct 13, 2015
- channel and topic put reset ctx.nsqd.SetHealth
- only set if !IsHealthy; atomic load and flag check should be faster than mutex
- nsqd_test.go: change `exp` to `nexp` in `nequal`
- relates to nsqio#594
judwhite added a commit to judwhite/nsq that referenced this issue Oct 13, 2015
- channel and topic put reset ctx.nsqd.SetHealth
- only set if !IsHealthy; atomic load and flag check should be faster than mutex
- nsqd_test.go: change `exp` to `nexp` in `nequal`
- relates to nsqio#594
judwhite added a commit to judwhite/nsq that referenced this issue Oct 13, 2015
- channel and topic put reset ctx.nsqd.SetHealth
- only set if !IsHealthy; atomic load and flag check should be faster than mutex
- nsqd_test.go: change `exp` to `nexp` in `nequal`
- relates to nsqio#594
judwhite added a commit to judwhite/nsq that referenced this issue Oct 13, 2015
- channel and topic put reset ctx.nsqd.SetHealth
- only set if !IsHealthy; atomic load and flag check should be faster than mutex
- nsqd_test.go: change `exp` to `nexp` in `nequal`
- relates to nsqio#594
judwhite added a commit to judwhite/nsq that referenced this issue Oct 13, 2015
- channel and topic put reset ctx.nsqd.SetHealth
- change nsqd SetHealth/GetError to use atomic.Value; skip allocation in
  SetHealth if attempting to set an already healthy queue to healthy
- nsqd_test.go: change `exp` to `nexp` in `nequal` output
- relates to nsqio#594
@mreiferson
Copy link
Member

Now that #671 landed, I'm tempted to close this.

From a client's perspective, it seems like a more reasonable strategy to implement some sort of backoff and retry strategy for a failed PUB rather than trying to do something racey like waiting for healthiness after failure and then trying to PUB again (i.e. don't check then do something, just try to do it).

What do you think @jehiah (cc @judwhite)?

@ploxiln
Copy link
Member

ploxiln commented Apr 15, 2016

I think the client should have to try to PUB and detect failure of that, rather than try to detect health first, because as you say it's race-y to do the latter.

If #625 lands, then you could add a "wait_sync" option to PUB to not return success until the topic's WAL has synced to disk (and nsqd could still batch up a few seconds worth of PUBs before syncing and responding).

Of course, nsqd hasn't bothered with this option thus far because in a typical setup a publisher always PUBs to nsqd running on localhost, and nsqd is pretty darn reliable, and if the server running the publisher and nsqd has some problem then there's not much hope for the publisher anyway. I see now that "filesystem temporarily full" or "filesystem or volume is remote and temporarily unavailable" are actually reasonable cases where the publisher could wait/retry for nsqd to be able to accept the PUB.

@mreiferson
Copy link
Member

agreed, I'm sold, closing

dolfly pushed a commit to dolfly/nsq that referenced this issue Jul 27, 2017
- channel and topic put reset ctx.nsqd.SetHealth
- change nsqd SetHealth/GetError to use atomic.Value; skip allocation in
  SetHealth if attempting to set an already healthy queue to healthy
- nsqd_test.go: change `exp` to `nexp` in `nequal` output
- relates to nsqio#594
dolfly added a commit to dolfly/nsq that referenced this issue Jul 27, 2017
* nsqadmin: fix counter removal issues

* nsqadmin: fix memory depth is always 0

* nsqadmin: send version with UserAgent

* nsqadmin: eslint fixes

* internal/quantile: fix JSON decoding

* nsqadmin: round times to 2 decimal points

* nsqadmin: fix column count on /topic

* nsqadmin: handle/display AJAX errors

* nsqadmin: bin dater

* docs: update ChangeLog

* internal/clusterinfo: fix dedupe regression in GetLookupdTopicProducers

* build: bump go-nsq dependency to v1.0.5 stable

* nsqadmin: fix deferred count

* nsqadmin: compile static assets

* nsqadmin: fix nsqd backwards compatibility when querying nsqd directly

* nsqadmin: expand notify context

* gofmt

* nsqadmin: update header

* Fix broken merge from nsqio#654

* nsqadmin: bump gulp/gulp-sass deps

* nsqadmin: parse ?t=Xh timeframes

* nsqadmin: bin dater

* nsqadmin: Update statsd counters prefix from 'stats_counts' to 'stats.counters'

* internal/clusterinfo: Expose detailed connection errors

* nsqadmin: fix backwards compatible /info parsing

* docs: add Code of Conduct

* bump v0.3.6 stable

* Makefile: BINARIES and EXAMPLES are gone, install depends on APPS

* Makefile: no need to write apps/ dir everywhere

also, we can line up variables nicely

* nsqd: optimize NewGUID() division for "milliseconds"

64-bit division operations appear to be slow on some
ARM systems, so replace division by 1000000 with a
bitshift equivalent to division by 1048576.

adapt guid twepoch comment for revised guid timestamp scheme

* dist.sh: cleanup temp install dir

* dist.sh: little cleanups

* nsqd: bound reads from nsqlookupd peers to prevent OOM

* nsqd: flock --data-path for unix-like platforms

* dist: build freebsd binaries

* nsqadmin: added conf flags for formatting statsd counters and gauges

* docs: README tweak (wsup travis)

* nsqd topic/channel: reset health on successful backend write

- channel and topic put reset ctx.nsqd.SetHealth
- change nsqd SetHealth/GetError to use atomic.Value; skip allocation in
  SetHealth if attempting to set an already healthy queue to healthy
- nsqd_test.go: change `exp` to `nexp` in `nequal` output
- relates to nsqio#594

* nsqd: simplify health / loading checks

* nsqd: add msg header size to diskqueue MaxMsgSize

- purpose: diskqueue was enforcing a maxmsgsize of the whole message
  including header, while protocol enforces maxmsgsize on the body. this
  could lead to successful writes to the memory queue but failed writes
  to the diskqueue, intermittently if the body size was within the
  26 bytes range.
- channel / topic tests assert configuration is passed correctly to
  diskqueue; it doesn't test PutMessage directly because enforcing msg
  size is the responsibility of protocol.
- add maxMsgSize to diskqueue writeOne error message.

* nsqd / nsqlookupd protocol IOLoop updates

- nsqd: update IOLoop so FatalClientErr's are returned. current
  implementation shadows err, returned err was result of
  client.Reader.ReadSlice.
- nsqlookupd: update IOLoop err shadowing in similar manner, rename
  SendResponse err to sendErr to be closer to nsqd implementation. in this
  case FatalClientErr's were effectively ignored since they would fail the
  type assertion.
- add nequal to nsqlookupd_test
- http.go / nsqd.go: resolve some err shadowing which had no impact.

* docs: update README

* add common test funcs to internal/test

* nsqd: moved runlock in GetStats

* nsqd: don't need to RLock channel on every requeue

doRequeue is always called inside exitLock now

* nsqd: use queue-specific lock for inflight/deferred

* nsqd: tighten up stats related locking even more

just as we don't need to lock the whole instance for the lifetime
of the stats gathering, we don't need to lock the topic for the
lifetime of the channel stats gathering

* nsqd: separate client locks into write and meta locks

* bench: add requirements.txt; update bootstrap/options

* nsqd: fix stats data race

* bump 0.3.7-alpha

* should allow POST method, since the newer go tool pprof need it.

* 1.should use rlock while aggregate
2.try rlock first to reduce write lock

* nsq_to_*: update hasArg function to use flag.Visit

* enable block profile and change block rate from http endpoint

* nsqd/nsqlookupd: support running as windows service

* dist: add windows

* nsqd: fix flakey TestHTTPEmptyChannelPOST

* bump v0.3.7 stable

* docs: update changelog

* nsqadmin: fix connected durations >1h

* update npm dependencies & go-bindata

* Update README URLs based on HTTP redirects

* Makefile: consolidate apps sources deps

* Makefile: install: add .exe if building for windows, consolidate recipe

* dist.sh: build and install in one make command

so that install can see what GOOS was used

* nsqadmin: add dependency on gulp-task-listing

seems to need it to gulp build

* nsqadmin: make "rate" column work without --proxy-graphite

Requests to graphite /render endpoint for "rate" column would always
be proxied. Further, when not proxied, they need to use "jsonp" to
avoid CORS problems.

* nsqd: diskqueue syncs when only reads have occurred

* godeps: bump go-options to latest

* nsqd: use defaults from Options struct and go-options fixes

* nsqd: update test certs

* DecodeMessage: skip bytes Buffer, reducing alloc overhead

this is the nsqd equivalent of a very similar change in go-nsq:
nsqio/go-nsq#175

* nsqd: remove channel messagePump

performance improvement:

    SUB: 2014/08/17 10:28:16 duration: 3.875343474s - 49.218mb/s - 258041.644ops/s - 3.875us/op

* dist.sh: strip release builds of debug info

results in 30% smaller binaries

* (re)-handle SIGTERM

* nsqd/nsqlookupd: explicitly set exit signals

* bump v0.3.8-rc1

* godeps: bump go-nsq to latest

* bump v0.3.8-rc2

* docs: update changelog

* bump v0.3.8 stable

* docker: Adds support for alpine as base image

Alpine is equally as small as busybox (~5 MB) but has more features
including the ability to install other packages.

* docs: add wiredcraft and deis to README

* travis: bump golang versions/gpm

* nsqadmin: add /config API

Provides the ability to modify the configuration nsqadmin for setting
`nsqlookupd_http_addresses` to support dynamic configuration.

Closes nsqio#769

* nsqd: deflake TestDiskQueueSyncAfterRead

* nsqd: configurable HTTP client timeouts

Adds configuration options HTTPClientConnectTimeout and
HTTPClientRequestTimeout to control the connection and request timeout
repectively of the HTTP client.

Also added to the following app binaries:

- nsqadmin
- nsq_stat
- nsq_to_file
- nsq_to_http

Closes nsqio#715
Closes nsqio#680

* coverage.sh: adds Code Coverage generation

Adds script for generating code coverage statistics for Go packages.

Borrowed from https://github.com/hashicorp/vault

* docs: fix coveralls badge on README

* nsqlookupd: Adds test cases

Adds test cases to improve code coverage to protect against potential
compatibility conflicts moving forward.

Existing test cases refactored to leverage common test helpers within
the internal/test package to avoid code duplication and simplify
the testing effort.

* nsqd: adds delayed retry to TestDiskQueueSyncAfterRead

Adds a delay retry when attempting to opening metadata file as part
of TestDiskQueueSyncAfterRead to handle timing issue between the
ioloop and possible disk IO delays.

* nsqadmin: improve test coverage

Adds test cases to improve code coverage to protect against potential
compatibility conflicts moving forward.

Existing test cases refactored to leverage common test helpers within
the internal/test package to avoid code duplication and simplify
the testing effort.

* nsqd: improve test coverage

Adds test cases to improve code coverage to protect against potential
compatibility conflicts moving forward.

Existing test cases refactored to leverage common test helpers within
the internal/test package to avoid code duplication and simplify
the testing effort.

* nsqlookupd: Add conn.Close() in IOLoop

* docs: add sproutsocial to README

* apps/to_nsq: add --rate option for throttling message rate

* fix nsqadmin root ca verification

* add test to ensure nsqadmin connects to nsqd with verified tls

* nsq_to_file: require --topic or --topic-pattern

fixes nsqio#789

* travis: update build matrix

* nsqd: fix diskqueue benchmark test, maxMsgSize too small

* nsqadmin: Fix handling of IPv6 broadcast addresses

Fixes nsqio#815

* nsqd: deflake TestReconfigure

* nsqd: switch to golang/snappy

* nsqd: more intuitive channel message accounting

* nsqd: deflake TestDiskQueueSyncAfterRead

* nsqd: log when we can't get channels for pre-creation

* deps: bump go-options

* dist: scrub file ownership for tarball

* nsqd: bound MPUB messages to derived limit

* *: remove deprecated features

* *: remove go-simplejson

* internal/stringy: remove unused functions

* nsqadmin: remove --use-statsd-prefixes

* nsqadmin: go bin dater

* nsq_to_*: go fmt

* nsqlookupd: pass defaults from opts into flags

* release: bump verison 1.0.0-alpha

* nsqd: log errors if no nsqlookupd broadcast address

* Godeps: update rev of go-options (remove extra []float64 log)

* Godeps: update rev of go-nsq (switch to golang/snappy)

* nsqadmin: allow ctr/meta+click to open a new tab

* nsqadmin: generate bindata.go

* nsqd: replace missing requeue exit check; cleanup doRequeue

* nsqd: add benchmark for multiple topics

* nsqd: per-topic message IDs

* nsqd: read/write concurrently in MPUB benchmarks

* new --log-prefix flag / option for nsqd, nsqlookupd, nsqadmin

instantiate Logger in $APP.New() instead of $APP.NewOptions(),
only if not already present in opts

Now, setting Logger to nil in Options is not sufficient
to disable logging. A no-op logger must be created and assigned.

* nsqd: no longer need atomicRename() for windows

os.Rename() now does the same thing on Windows
that atomicRename() did (since go 1.5)

* nsqd: rename option --worker-id to --node-id

* nsqd: new metadata filename without ID

symlink old metadata filename to new
when loading, if both exist, ensure they match
this makes rollback possible without losing messages
(when rolling back forward, some manual intervention is required)

on windows, Symlink() needs Administrator privs,
so just write plain old metadata file

includes tests

* tests: ioutil.TempDir() adds a pseudo-random suffix

no need to add our own pseudo-random "unique" suffix

* nsqd: move diskqueue into nsqio/go-diskqueue

* nsqd: GUID benchmark update

* nsqadmin: add --allow-config-from-cidr flag/option

* bump v1.0.0-compat stable

* test: update travis build matrix

* apps/to_nsq: fix --rate divide by zero bug

* nsqd: clamp requeue timeout to range instead of dropping connection

* nsqd: remove back-compat for flag duration ints

--statsd-interval and --msg-timeout are the only two duration flags
that have special functionality to be backwards-compatible with
plain int arguments (meaning seconds or milliseconds). We can
clean this up before nsq-1.0

* nsqd: expose memory stats under /stats

* nsqd: refresh test TLS certs

* test more go versions on Travis CI

* test.sh: avoid redundant build of nsqadmin, use bash dir glob

bash glob trick to match only folders

nsqadmin moved to apps/ and is thus included in the glob already

* nsqd: retry sub to ephemeral topic/channel which is Exiting

quick hacky fix, instead of "proper" locking

* test: ignore failure to push to coveralls

avoid coveralls service failure making travis tests fail

* replace verbose with log-level to reduce log sizes

* nsqd: a few misc logging line cleanups

* nsq*: clean up, regularize logging Options

minor fix for --log-level flag help text

must set opts before logging errors in New()

re-order log stuff in Options structs
treat logLevel the same in nsqd, nsqlookupd, nsqadmin

Fix tests: multiple nsqlookupds need their own Options structs,
now that nsqlookupd has logLevel in its Options.
Otherwise the race detector complains when one of the nsqlookupds
writes the derived int logLevel while another reads it for logging.

* new internal/lg/ logging package for nsqd nsqlookupd nsqadmin

introduce LogLevel type
NilLogger for disabling logging
consolidate LogLevel tests into internal/lg/
consolidate TestNoLogger

* tests: more delay in TestClientMsgTimeout

* use app.logf() for internal packages for nsqd nsqlookupd nsqadmin

Instead of setting a Logger for github.com/nsqio/nsq/internal packages,
pass a logf() function, so it is called with and honors a LogLevel.
 * internal/clusterinfo/
 * internal/http_api/
 * internal/protocol/

nsqd lookupPeer also needed to be converted

Get rid of interal.app.Logger type, but internal/test/ needs
its own Logger definition to avoid circular import with
internal/lg/ tests.

* Fix segment.com logo URL in README.md

Instead segmentio_logo.png it should be segment_logo.png
See: https://github.com/nsqio/nsqio.github.io/blob/master/static/img/segment_logo.png

* fix vet error in clusterinfo/data.go

* nsqd: fix nil pointer when memstats enabled

* nsqadmin: X-Forwarded-User based ACL

* nsqadmin: update bindata

* nsqadmin: Update bindata for admin-user ACL changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants