-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upgrade sqlparser 0.47 -> 0.48 #11453
Conversation
This reverts commit 7be2263.
@alamb (I don't know who else to ask for help) I have no idea how to fix the tests. The issue seems to stem from sqllogictests + backtrace feature. minimal command: Adding the Does SQLParser 0.48 have some massive struct on the stack that needs to be boxed and isn't cleaned in debug mode? Why does it pass if backtrace feature isn't enabled or when sqllogictests are skipped? |
Thanks @MohamedAbdeen21 I don't know why the I think you are likely right that something in sqlparser has grown the stack space slightly so now we are hitting stack overflows. Maybe if we can find what exactly is on the stack when it overflows we can refactor datafusion (e.g. break a single large function into multiple smaller ones) to work around the issue |
I've been running some experiments for the past few hours. Here are some observations:
It doesn't look like it's a single struct/enum filling the stack, but multiple small structs/enums being copied, which kinda explains why the release build passes, but not the debug build. Maybe we've been slightly increasing the stack usage with every change and this PR caused it to finally overflow. I think we should temporarily increase the stack size to something > 2 MB (I'm thinking 2.5 MB) for now and try to find the problematic test in a follow-up, just to keep things moving. To replicate, change #[tokio::main]
#[cfg(not(target_family = "windows"))]
pub async fn main() -> Result<()> {
run_tests().await
} to #[cfg(not(target_family = "windows"))]
fn main() -> Result<()> {
tokio::runtime::Builder::new_multi_thread()
.thread_stack_size(2 * 1024 * 1024) // 2 MB
// .thread_stack_size(2 * 1024 * 1024 - 128 * 1024) // 1.9 MB
// .thread_stack_size(2 * 1024 * 1024 + 128 * 1024) // 2.1 MB
.enable_all()
.build()
.unwrap()
.block_on(run_tests())
} and run the tests Also, why do we spawn a thread in the windows main function that spawns tokio threads? Is there a reason why we don't spawn the tokio threads directly? #[cfg(target_family = "windows")]
pub fn main() {
// Tests from `tpch/tpch.slt` fail with stackoverflow with the default stack size.
thread::Builder::new()
.stack_size(2 * 1024 * 1024) // 2 MB
.spawn(move || {
tokio::runtime::Builder::new_multi_thread()
.enable_all()
.build()
.unwrap()
.block_on(async { run_tests().await })
.unwrap()
})
.unwrap()
.join()
.unwrap();
} |
I think Rust debug builds keep space on the stack for all local variables to make debugging easier, but that also makes for much larger stack frames in debug build
Yes, I think this is a likely explanation |
I think that would be ok
I don't know |
This reverts commit 4fc036a.
@alamb @MohamedAbdeen21 sorry for coming up late. |
Hi @comphead, the original issue was a tokio stack overflow when running tests with backtrace feature enabled. That only happens in debug builds and when sqllogictests are not skipped. The temporary fix I pushed is simply increasing tokio stack size from 2 MBs to 2.5 MBs. You can check my last comment for some observations. If possible, we want to avoid throwing memory at our problems and instead find the test that fills its stack and break it down. |
We try not to hardcode the stack size in the code, if possible please do like I see we removed the stack size recently from the yml, probably we need more investigation before this can be removed |
I can't see how that benefits us. Setting an env var just to read it once in a test, which is meant to be a (hopefully) temporary fix, seems ... unnecessary. Plus, other devs WILL run into this issue in their local envs whenever they use the full test command (with the backtrace feature), so hiding the variable in workflow env doesn't seem like a reasonable move. |
I ran the program under lldb and here is the stack trace on my M3 mac
|
Here is a smaller reproducer: cargo test --features=backtrace --test sqllogictests -- array.slt |
Shows the issue appears to be in the handling of this query select array_ndims(array_repeat(array_repeat(array_repeat(1, 3), 2), 1)), array_ndims([[[[[[[[[[[[[[[[[[[[[1]]]]]]]]]]]]]]]]]]]]]);
|
🤔 the comments suggest this is not a new problem: datafusion/datafusion/sql/src/expr/value.rs Lines 134 to 135 in 48a1754
|
What I suggest we do is:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @MohamedAbdeen21 -- I think this PR's code looks good to me.
The stack overflow is somewhat concerning. We should probably figure out how to handle that better (as a follow on PR).
I will wait to hear @comphead 's thoughts before merging this PR
Thanks @alamb for your analysis, I think this PR is good to go, but we need to think on recursive calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks @MohamedAbdeen21
I plan to file the follow on ticket / description and then merge this later today |
Filed #11499 to track looking into the stack overflow. Thanks everyone |
* upgrade sqlparser 0.47 -> 0.48 * clean imports and qualified imports * update df-cli cargo lock * fix trailing commas in slt tests * update slt tests results * restore rowsort in slt tests * fix slt tests * rerun CI * reset unchanged slt files * Revert "clean imports and qualified imports" This reverts commit 7be2263. * update non-windows systems stack size * update windows stack size * remove windows-only unused import * use same test main for all systems * Reapply "clean imports and qualified imports" This reverts commit 4fc036a.
* upgrade sqlparser 0.47 -> 0.48 * clean imports and qualified imports * update df-cli cargo lock * fix trailing commas in slt tests * update slt tests results * restore rowsort in slt tests * fix slt tests * rerun CI * reset unchanged slt files * Revert "clean imports and qualified imports" This reverts commit 7be2263. * update non-windows systems stack size * update windows stack size * remove windows-only unused import * use same test main for all systems * Reapply "clean imports and qualified imports" This reverts commit 4fc036a.
* upgrade sqlparser 0.47 -> 0.48 * clean imports and qualified imports * update df-cli cargo lock * fix trailing commas in slt tests * update slt tests results * restore rowsort in slt tests * fix slt tests * rerun CI * reset unchanged slt files * Revert "clean imports and qualified imports" This reverts commit 7be2263. * update non-windows systems stack size * update windows stack size * remove windows-only unused import * use same test main for all systems * Reapply "clean imports and qualified imports" This reverts commit 4fc036a.
Which issue does this PR close?
Closes #9949
Closes #11377
Rationale for this change
Recent version of sqlparser addresses some issues with DF
What changes are included in this PR?
Upgrade sqlparser version + cleaning a few imports.
Are these changes tested?
yes, with existing tests
Are there any user-facing changes?
slightly better errors.
BREAKING trailing commas in create statements were enabled by default for all dialects, that was a bug mistaken for a feature. Trailing commas in create statements will now raise an error, unless using a dialect that supports trailing commas (currently only duckdb does that)