cache unit test results [NET-4965] #18333

nfi-hashicorp · 2023-07-31T20:20:33Z

I was curious and started looking into setup-go's new on-by-default cache flag, and it turns out it probably doesn't work very well unfortunately. TLDR: it not only caches GOMODCACHE, but also GOCACHE; it isn't specific enough about input hashing and ends up caching a bunch of junk that actually slows the build down.

As part of repro, I experimented with this repo and ended up getting about halfway to a working implementation of GOCACHE including test caching, so I did a bunch more experimenting, tidied it up, and here we are.

There are some kind of big changes here to the CI strategy, so I'll justify them with data from my experimental PR.

Test result and build object caching

The main course. In Go, test results and build objects are stored in the same cache.

I made some attempts to keep them separate (e.g. go test -c to pre-compile tests), but to no avail. It would have been nice, since the build cache is 100s of MBs, and the test cache is 10s.

Pros

A run from an empty cache takes ~ 9 minutes, which is already better than main, which takes ~ 11 min. A second run takes ~ xxx min.

Cons

Cache blobs are scoped to branches (GHA policy). The way I've designed this, we will have a cache blob for each (Go version, GOARCH, module) tuple (at least), which at current count is 8. These caches are 100s of MBs, so I can imagine them getting out of hand. There is a 10GB repo-level cache size limit, but it apparently isn't enforced since we are currently sitting at about 50GB.

Cache blobs mostly only grow, until they are reaped by GHA's recency-based garbage collector. In this implementation, I do no GC inside the cache blob. Go itself does trim cache objects, but it is based on mtime and not disk usage. There's no way to set a maximum disk usage limit.

Cache blobs are relatively fast to restore/save. IME it's on the order of 15s for a typical 500 MB blob, either way.

For some reason, cache blobs are not overwriteable. So I worked around this by deleting and then pushing the amended blob. If you're doing something non-linear like rerunning an old commit while the current commit is running, last write wins, and there's no locking AFAICT, so you might end up writing a stale cache.

No shared dev build

I deleted the dev-build step. Now the tests just each build it. dev-build builds a consul binary and shares it as an artifact to all the tests. It's a well-intentioned idea, but unfortunately the binary is quite big 100s of MB) and GHA artifact upload/download is for some mysterious reason about 10x slower than cache save/restore.

Also, after push, the tests all need to wait for dev-build and can do no useful work until then.

I'd be willing to bet the consul binary shares some cache objects with the unit tests. So on subsequent builds, it should be 0s to build it.

No test splitting

Mostly I removed this because caching would be hard. We need per-runner caches because there's no way to do concurrent writes to the same cache (safely). So with test splitting by package into N buckets, we'd have to do some sort of consistent hashing/sharding/etc that is just not the sort of thing I want to do in Bash.

But also, I found that by fixing flakiness and adding test/build caching, module-level splitting (like we have in this PR) is plenty fast.

Explicitly pass the Go version

I declare the Go versions we use (N and N-1) at the top and explicitly feed them to every unit test. Caches are sensitive to any change in Go version, so we must keep this controlled. Previously, we would just specify 1.20.x to setup-go and it may or may not download the latest.

No `gotestsum` retries

A test results cache is basically like retries. IMO gotestsum retries are a hack. IMO if we consistently see a test flaking, we should disable it and fix it later. (I made a PR here. Probably bears further discussion.)

Additionally, I've discovered that the go test -run flag is not specially handled; go test -run '.*' . is a different cache key than go test ., even though they should be equivalent. gotestsum's retry mechanism uses -run, so we can't use them together effectively. When the first run (without -run) fails, none of its PASSes are cached. (This is maybe a bug with Go, maybe with gotestsum, IDK.)

Sidequest: `freeport` retries

I added a new function to freeport: RetryMustGetN, which is like GetN but retries. This was a big source of flakiness.

Sidequest: fix test results artifacts

Our invocations of reusable-unit{,split}.yml actually saved their test results over top of one another. The artifact namespace is shared for all jobs in the workflow (even called workflows).

PR Checklist

updated test coverage
external facing docs updated
appropriate backport labels added
not a security concern
changelog: I only touched CI and test code; nothing user-facing

huikang · 2023-08-02T17:11:57Z

Also, after push, the tests all need to wait for dev-build and can do no useful work until then.

Could you point to one or two examples in some tests to support this statement?

nfi-hashicorp · 2023-08-02T17:24:59Z

Also, after push, the tests all need to wait for dev-build and can do no useful work until then.

Could you point to one or two examples in some tests to support this statement?

All of the tests depend on the dev-build job in the workflow, so they can't start until it's finished. For example, latest run on main: https://github.com/hashicorp/consul/actions/runs/5739962589, you can see that it blocks:

In dev-build, it spends 25s uploading the artifact

Then later in each test job, a few seconds downloading it. Compare this to the build time, which is 33s. Not much savings to be had.

nfi-hashicorp · 2023-08-02T17:36:36Z

Noticing some cache misses due to testcache: github.com/hashicorp/consul/tlsutil: input file /home/runner/work/consul/consul/tlsutil/cert.pem: file used as input is too new; looking into it.

/tlsutil/cert.pem cache miss

nfi-hashicorp · 2023-08-02T21:02:52Z

There is something uncacheable happening with oss_root_amd64

Compare to any_root_386

investigate oss_root_amd64 caching

any_root_386 uses -short, so I guess it must be a long test

nfi-hashicorp · 2023-08-02T22:35:00Z

It may also be worth saving the GOMODCACHE. Even with cached binaries, I suspect Go still needs to download all the source. We should have gotestsum and our build of consul in cache, so all this time (almost a minute) must be downloading source?

Update: it saved about 20s between this and this re-run. The cost is a 612MB.

nfi-hashicorp · 2023-08-03T01:03:04Z

Hmm, I've discovered a surprising behavior. go test doesn't really cache at the test function level, it tests at the invocation level. -run is keyed literally, so regexes that select the same tests actually cache separately! Even equivalent regexes, like ^ and .*.

For example

cachetest % cat cache_test.go 
package test

import "testing"

func TestA(t *testing.T) {

}

func TestB(t *testing.T) {

}

func TestC(t *testing.T) {

}
cachetest % GODEBUG=gocachetest=1 GOCACHE=/tmp/foo go test -v .
testcache: github.com/hashicorp/consul/cachetest: test ID 4d57304767637a616a435375486550573334757a => fc9086925f29b4fb6f5b160239f09241e426a5c299b35e9698a96b74d6241439
testcache: github.com/hashicorp/consul/cachetest: input list not found: cache entry not found: open /tmp/foo/fc/fc9086925f29b4fb6f5b160239f09241e426a5c299b35e9698a96b74d6241439-a: no such file or directory
testcache: github.com/hashicorp/consul/cachetest: test ID 3859515357797133445f754c39746b5844336c41 => cdf05723beb1e5f9d1474de78ff6355918f05f603dfef5814d7111caacac25d2
testcache: github.com/hashicorp/consul/cachetest: input list not found: cache entry not found: open /tmp/foo/cd/cdf05723beb1e5f9d1474de78ff6355918f05f603dfef5814d7111caacac25d2-a: no such file or directory
=== RUN   TestA
--- PASS: TestA (0.00s)
=== RUN   TestB
--- PASS: TestB (0.00s)
=== RUN   TestC
--- PASS: TestC (0.00s)
PASS
ok      github.com/hashicorp/consul/cachetest   0.460s
testcache: github.com/hashicorp/consul/cachetest: save test ID fc9086925f29b4fb6f5b160239f09241e426a5c299b35e9698a96b74d6241439 => input ID c2df4457862b4f5136498dde4586e3e5fce4229bd3f1228a9383e019288d6839 => 647d17e204bbd809d03a83e20e15b017fb6bf1babbb5148a035a23d1d6ce3452
testcache: github.com/hashicorp/consul/cachetest: save test ID cdf05723beb1e5f9d1474de78ff6355918f05f603dfef5814d7111caacac25d2 => input ID c2df4457862b4f5136498dde4586e3e5fce4229bd3f1228a9383e019288d6839 => fd57189da362f7c3e41d812d175b51b81169f6c342e8dd33bfe43cfd74103757
cachetest % GODEBUG=gocachetest=1 GOCACHE=/tmp/foo go test -run '.*' -v .  
testcache: github.com/hashicorp/consul/cachetest: test ID 4d57304767637a616a435375486550573334757a => 40584c921d8b9af89bbed4c85a7b4c548252a4d7e6c17556b9f3ad7eca3b7442
testcache: github.com/hashicorp/consul/cachetest: input list not found: cache entry not found: open /tmp/foo/40/40584c921d8b9af89bbed4c85a7b4c548252a4d7e6c17556b9f3ad7eca3b7442-a: no such file or directory
testcache: github.com/hashicorp/consul/cachetest: test ID 3859515357797133445f754c39746b5844336c41 => b6743758461c64234ec892693221c32ad05fc7f67192fc9dc3c04bb5c7f79e11
testcache: github.com/hashicorp/consul/cachetest: input list not found: cache entry not found: open /tmp/foo/b6/b6743758461c64234ec892693221c32ad05fc7f67192fc9dc3c04bb5c7f79e11-a: no such file or directory
=== RUN   TestA
--- PASS: TestA (0.00s)
=== RUN   TestB
--- PASS: TestB (0.00s)
=== RUN   TestC
--- PASS: TestC (0.00s)
PASS
ok      github.com/hashicorp/consul/cachetest   0.155s
testcache: github.com/hashicorp/consul/cachetest: save test ID 40584c921d8b9af89bbed4c85a7b4c548252a4d7e6c17556b9f3ad7eca3b7442 => input ID c2df4457862b4f5136498dde4586e3e5fce4229bd3f1228a9383e019288d6839 => 6bf89d65b0147026d24412adeb0b8f58b9e5f9b087ea2e559879c37a30b2c3c7
testcache: github.com/hashicorp/consul/cachetest: save test ID b6743758461c64234ec892693221c32ad05fc7f67192fc9dc3c04bb5c7f79e11 => input ID c2df4457862b4f5136498dde4586e3e5fce4229bd3f1228a9383e019288d6839 => bcd5bc9bb4c714711a24e3e2df77c0d4185e17400abea766b5ecbf0ade44307e
cachetest % GODEBUG=gocachetest=1 GOCACHE=/tmp/foo go test -run 'Test.*' -v .
testcache: github.com/hashicorp/consul/cachetest: test ID 4d57304767637a616a435375486550573334757a => fbfd9fa1d1152f5c478e6bb0274ba944e1fa4c354452e78191e1ba2bf999b435
testcache: github.com/hashicorp/consul/cachetest: input list not found: cache entry not found: open /tmp/foo/fb/fbfd9fa1d1152f5c478e6bb0274ba944e1fa4c354452e78191e1ba2bf999b435-a: no such file or directory
testcache: github.com/hashicorp/consul/cachetest: test ID 3859515357797133445f754c39746b5844336c41 => 8dd5185abd7e636035d8cb504056a731097128ddf3156efa7a98e6e6f0cfc7fe
testcache: github.com/hashicorp/consul/cachetest: input list not found: cache entry not found: open /tmp/foo/8d/8dd5185abd7e636035d8cb504056a731097128ddf3156efa7a98e6e6f0cfc7fe-a: no such file or directory
=== RUN   TestA
--- PASS: TestA (0.00s)
=== RUN   TestB
--- PASS: TestB (0.00s)
=== RUN   TestC
--- PASS: TestC (0.00s)
PASS
ok      github.com/hashicorp/consul/cachetest   0.119s
testcache: github.com/hashicorp/consul/cachetest: save test ID fbfd9fa1d1152f5c478e6bb0274ba944e1fa4c354452e78191e1ba2bf999b435 => input ID c2df4457862b4f5136498dde4586e3e5fce4229bd3f1228a9383e019288d6839 => 17e04a10db95742bbfaa9cfd7d34c412b0fd4ec7dde26260e9ca0bf82da42578
testcache: github.com/hashicorp/consul/cachetest: save test ID 8dd5185abd7e636035d8cb504056a731097128ddf3156efa7a98e6e6f0cfc7fe => input ID c2df4457862b4f5136498dde4586e3e5fce4229bd3f1228a9383e019288d6839 => cff587dad94536fcdc754b5b942dce86881323c92f004275515c5c83a16eeebd

To be fair, I think it says this on the tin, just surprising

Update: this totally precludes us from combining the Go test cache with gotestsum reruns, which use -run to rerun specific test funcs. Will update the caveats section.

nfi-hashicorp · 2023-08-04T03:20:55Z

I rebased onto this branch which moves retries inside the function in question. However, I think it was probably too blunt, because those flaky tests still flaked. We may need to capture FailNow and friends somehow (recover()?).

nfi-hashicorp · 2023-08-08T19:33:22Z

In light of #18365 (comment), I've gone back to just skipping the flaky tests and basing on main. Given that I don't have a great solution to deal with the flaky tests (short of y'know, actually fixing them), I'm going to put this back in draft

github-actions · 2023-10-08T01:29:24Z

This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions.

…esults

github-actions · 2024-01-18T01:29:53Z

This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions.

github-actions · 2024-02-29T01:25:07Z

Closing due to inactivity. If you feel this was a mistake or you wish to re-open at any time in the future, please leave a comment and it will be re-surfaced for the maintainers to review.

github-actions bot added the theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies label Jul 31, 2023

nfi-hashicorp changed the title ~~add freeport.RetryMustGetN and use it~~ cache unit test results [NET-4965] Jul 31, 2023

nfi-hashicorp added backport/1.15 This release series is no longer active on CE. Use backport/ent/1.15. backport/1.16 This release series is no longer active on CE. Use backport/ent/1.16. pr/no-changelog PR does not need a corresponding .changelog entry labels Jul 31, 2023

nfi-hashicorp force-pushed the nfi-cache-unit-test-results branch 7 times, most recently from 3dcb67b to 3285fcc Compare August 2, 2023 21:01

nfi-hashicorp force-pushed the nfi-cache-unit-test-results branch from 5ca2187 to 4f3aea1 Compare August 3, 2023 17:55

nfi-hashicorp changed the base branch from main to disable-flaky August 4, 2023 03:12

nfi-hashicorp added 9 commits August 8, 2023 12:31

skip tests that are >= 10% flaky on main

baefb11

test caching in go-tests.yml; no split

3ac8f96

no go mod download

010fd00

no LDFLAGS for us

2595397

fix jsonfile

4285599

calculate testcache hit rate

77d5c4e

fix tlsutil to actually use its tempdir

699d05b

fine, argparse

cc23d45

restore/save GOMODCACHE

959b054

nfi-hashicorp added 5 commits August 8, 2023 12:31

print cachekey_GOCACHE in summary

b404d29

include GOTESTFLAGS in cache key

c52a09e

pipefail

0989365

note for disable gotestsum reruns

723d815

fixup! pipefail

a3d8b22

nfi-hashicorp force-pushed the nfi-cache-unit-test-results branch from c656dbf to a3d8b22 Compare August 8, 2023 19:31

nfi-hashicorp changed the base branch from disable-flaky to main August 8, 2023 19:34

bump CI

cc63065

github-actions bot added the meta/stale Automatically flagged for inactivity by stalebot label Oct 8, 2023

bump

5e5fd05

vercel bot deployed to Preview – consul November 2, 2023 23:37 View deployment

Merge remote-tracking branch 'origin/main' into nfi-cache-unit-test-r…

c479af2

…esults

nfi-hashicorp force-pushed the nfi-cache-unit-test-results branch from 8025b79 to c479af2 Compare November 2, 2023 23:41

nfi-hashicorp added 3 commits November 2, 2023 16:44

TEMP: rm lint, etc

3604c84

always hit rate

d813a68

fresh cache, bump

f9f872d

github-actions bot removed the meta/stale Automatically flagged for inactivity by stalebot label Nov 18, 2023

github-actions bot added the meta/stale Automatically flagged for inactivity by stalebot label Jan 18, 2024

github-actions bot closed this Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache unit test results [NET-4965] #18333

cache unit test results [NET-4965] #18333

nfi-hashicorp commented Jul 31, 2023 •

edited

Loading

huikang commented Aug 2, 2023

nfi-hashicorp commented Aug 2, 2023

nfi-hashicorp commented Aug 2, 2023 •

edited

Loading

nfi-hashicorp commented Aug 2, 2023 •

edited

Loading

nfi-hashicorp commented Aug 2, 2023 •

edited

Loading

nfi-hashicorp commented Aug 3, 2023 •

edited

Loading

nfi-hashicorp commented Aug 4, 2023

nfi-hashicorp commented Aug 8, 2023

github-actions bot commented Oct 8, 2023

github-actions bot commented Jan 18, 2024

github-actions bot commented Feb 29, 2024

cache unit test results [NET-4965] #18333

cache unit test results [NET-4965] #18333

Conversation

nfi-hashicorp commented Jul 31, 2023 • edited Loading

Test result and build object caching

Pros

Cons

No shared dev build

No test splitting

Explicitly pass the Go version

No gotestsum retries

Sidequest: freeport retries

Sidequest: fix test results artifacts

PR Checklist

huikang commented Aug 2, 2023

nfi-hashicorp commented Aug 2, 2023

nfi-hashicorp commented Aug 2, 2023 • edited Loading

nfi-hashicorp commented Aug 2, 2023 • edited Loading

nfi-hashicorp commented Aug 2, 2023 • edited Loading

nfi-hashicorp commented Aug 3, 2023 • edited Loading

nfi-hashicorp commented Aug 4, 2023

nfi-hashicorp commented Aug 8, 2023

github-actions bot commented Oct 8, 2023

github-actions bot commented Jan 18, 2024

github-actions bot commented Feb 29, 2024

nfi-hashicorp commented Jul 31, 2023 •

edited

Loading

No `gotestsum` retries

Sidequest: `freeport` retries

nfi-hashicorp commented Aug 2, 2023 •

edited

Loading

nfi-hashicorp commented Aug 2, 2023 •

edited

Loading

nfi-hashicorp commented Aug 2, 2023 •

edited

Loading

nfi-hashicorp commented Aug 3, 2023 •

edited

Loading