Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query containing "uniq" causes lake service crash #5259

Closed
philrz opened this issue Sep 4, 2024 · 3 comments · Fixed by #5278
Closed

Query containing "uniq" causes lake service crash #5259

philrz opened this issue Sep 4, 2024 · 3 comments · Fixed by #5278
Labels
bug Something isn't working

Comments

@philrz
Copy link
Contributor

philrz commented Sep 4, 2024

tl;dr

After loading the zed-sample-data, this Zed program triggers a crash:

from "Zeek" | _path=="conn" | typeof(id) | uniq

Details

Repro is with Zed commit f8fb140. This problem was found by @jameskerr when using Zui.

To repro, first start a Zed lake service.

$ zed -version
Version: v1.17.0-47-gf8fb1406

$ zed -lake lake serve
{"level":"info","ts":1725472428.4225721,"logger":"core","msg":"Started","auth_enabled":false,"root":"file:///Users/phil/lake","version":"v1.17.0-47-gf8fb1406"}

Load the zed-sample-data, execute the query, and observe the stack trace. Sometimes it doesn't trigger on the first try, so I find looping the query saves time.

$ zed create -use Zeek
pool created: Zeek 2lcHiyLz4XwxJHkQ4Yshjq2M288
Switched to branch "main" on pool "Zeek"

$ zed load *.zng.gz
(35/35) 51.49MB/51.49MB 0B/s 100.00%
2lcHm1G7gkaoj8kUtLQuO6cacp3 committed

$ while true; do zed query 'from "Zeek" | _path=="conn" | typeof(id) | uniq'; done
<{orig_h:ip,orig_p:port=uint16,resp_h:ip,resp_p:port}>
<{orig_h:ip,orig_p:port=uint16,resp_h:ip,resp_p:port}>
<{orig_h:ip,orig_p:port=uint16,resp_h:ip,resp_p:port}>
<<ERR bad type ID in type value: type ID too large for primitive: 99>>
panic: runtime error: index out of range [403] with length 230
goroutine 449 [running]:
runtime/debug.Stack()
	/usr/local/opt/[email protected]/libexec/src/runtime/debug/stack.go:24 +0x5e
github.com/brimdata/zed/runtime/sam/op.(*Catcher).Pull.func1()
	/Users/phil/work/zed/runtime/sam/op/catcher.go:25 +0x3d
panic({0x27840a0?, 0xc013a6a000?})
	/usr/local/opt/[email protected]/libexec/src/runtime/panic.go:914 +0x21f
github.com/brimdata/zed.(*Arena).offsetAndLength(...)
	/Users/phil/work/zed/arena.go:216
github.com/brimdata/zed.(*Arena).bytes_(0x0?, 0xc0018deed0?)
	/Users/phil/work/zed/arena.go:195 +0x40d
github.com/brimdata/zed.Value.Bytes({0x100000c00604c240, 0x2000001c00000193})
	/Users/phil/work/zed/value.go:232 +0x20a
github.com/brimdata/zed/runtime/sam/op/uniq.(*Op).appendUniq(0xc002852d00, {0x0?, 0x0, 0x0}, {0xc00008ae70?, 0x1010f25?})
	/Users/phil/work/zed/runtime/sam/op/uniq/uniq.go:53 +0x85
github.com/brimdata/zed/runtime/sam/op/uniq.(*Op).Pull(0xc002852d00, 0x40?)
	/Users/phil/work/zed/runtime/sam/op/uniq/uniq.go:82 +0x265
github.com/brimdata/zed/runtime/sam/op.(*Single).Pull(0xc0028bc5d0, 0xa0)
	/Users/phil/work/zed/runtime/sam/op/mux.go:120 +0x33
github.com/brimdata/zed/runtime/sam/op.(*Catcher).Pull(0x1?, 0x60?)
	/Users/phil/work/zed/runtime/sam/op/catcher.go:28 +0x62
github.com/brimdata/zed/runtime/exec.(*Query).Pull(0xc0001d9e40?, 0x0?)
	/Users/phil/work/zed/runtime/exec/query.go:59 +0x3e
github.com/brimdata/zed/service.handleQuery.func2()
	/Users/phil/work/zed/service/handlers.go:92 +0x42
created by github.com/brimdata/zed/service.handleQuery in goroutine 525
	/Users/phil/work/zed/service/handlers.go:90 +0x753

For whatever reason, it seems to crash more easily when the equivalent load/query is done within Zui similar to how @jameskerr originally found it. The crash also can cause the Zed lake service to exit in that case, making this a pretty bad bug.

@philrz philrz added the bug Something isn't working label Sep 4, 2024
@philrz
Copy link
Contributor Author

philrz commented Sep 11, 2024

A user reported in a community Slack thread what may have been another sighting of this issue when they were working in Zui Insiders. In their own words:

Lol I accidentally did two cut | uniq | fuse in a row and that seems to stop the service

I attempted to repro using the small sample.zng test data by entering the query cut | uniq | fuse | cut | uniq | fuse but was unsuccessful. However, recalling this bug, I just confirmed in 1.17.1-insiders.22 that if I loaded the zed-sample-data, I could repro a crash with just uniq. So the fact the user's pipeline used uniq leads me to guess he might be hitting the same problem.

Repro.mp4

FWIW, when I ran this variation against a local lake service at commit 7d8c575, I did get a different stack dump:

panic: runtime error: slice bounds out of range [:508169] with capacity 0

goroutine 613 [running]:
github.com/brimdata/zed.(*Arena).bytes_(0x1033229?, 0xc0021b6d20?)
	/Users/phil/work/zed/arena.go:191 +0x3dc
github.com/brimdata/zed.Value.Bytes({0x100000c007d50120, 0x400000270000118b})
	/Users/phil/work/zed/value.go:232 +0x20a
github.com/brimdata/zed/runtime/sam/expr.(*DotExpr).Eval(0xc00078bd40, {0x2b50a58?, 0xc00913ece0?}, {0x4e142e0?, 0xc0021b6d58?})
	/Users/phil/work/zed/runtime/sam/expr/dot.go:50 +0x21c
github.com/brimdata/zed/runtime/sam/expr.(*filter).Eval(0xc017a0fe18, {0x2b50a58?, 0xc00913ece0?}, {0xc0021b6d70?, 0x101e796?})
	/Users/phil/work/zed/runtime/sam/expr/filter.go:240 +0x31
github.com/brimdata/zed/runtime/sam/expr.EvalBool(0x16000?, {0x2b50a58, 0xc00913ece0}, {0x0?, 0xc0021b6e00?}, {0x2b4b1e0?, 0xc017a0fe18?})
	/Users/phil/work/zed/runtime/sam/expr/eval.go:71 +0x42
github.com/brimdata/zed/runtime/sam/expr.(*filterApplier).Eval(0xc017a0fe30, {0x2b50a58, 0xc00913ece0}, {0x0?, 0x1?})
	/Users/phil/work/zed/runtime/sam/expr/filter.go:260 +0x3e
github.com/brimdata/zed/runtime/sam/op.(*applier).Pull(0xc00078bd80, 0x88?)
	/Users/phil/work/zed/runtime/sam/op/apply.go:38 +0x28f
github.com/brimdata/zed/runtime/sam/op/groupby.(*Op).run(0xc00072be00)
	/Users/phil/work/zed/runtime/sam/op/groupby/groupby.go:211 +0xa6
created by github.com/brimdata/zed/runtime/sam/op/groupby.(*Op).Pull.func1 in goroutine 612
	/Users/phil/work/zed/runtime/sam/op/groupby/groupby.go:179 +0x66

@nwt
Copy link
Member

nwt commented Sep 16, 2024

Fixed in #5278.

@nwt nwt closed this as completed Sep 16, 2024
@philrz
Copy link
Contributor Author

philrz commented Sep 16, 2024

Verified in Zed commit 4a1ebf5.

Repeating the repro steps, the looping query that previously triggered the crash now runs as long as I leave it without triggering a crash.

$ zed -version
Version: v1.17.0-64-g4a1ebf5b

$ zed create -use Zeek
pool created: Zeek 2mAcf5r7RFPERXW6ZTvGY0eeZcQ
Switched to branch "main" on pool "Zeek"

$ zed load *.zng.gz
(35/35) 51.49MB/51.49MB 0B/s 100.00%
2mAcgEjjcGGt5fDmLUpzLGHAb9E committed

$ while true; do zed query 'from "Zeek" | _path=="conn" | typeof(id) | uniq'; done
<{orig_h:ip,orig_p:port=uint16,resp_h:ip,resp_p:port}>
<{orig_h:ip,orig_p:port=uint16,resp_h:ip,resp_p:port}>
[...ran 100+ times before I stopped it...]

I also confirmed that Zui Insiders 1.17.1-insiders.26 which has the benefit of the same Zed fixes can also run the plain uniq repeatedly on the sample data instead of crashing as I showed in the previous video.

Verify.mp4

Thanks @nwt!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants