Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: validate that either a ndjson or tsv file was given as input #569

Merged
merged 1 commit into from
Sep 12, 2024

Conversation

Taepper
Copy link
Collaborator

@Taepper Taepper commented Sep 12, 2024

resolves #564

Summary

We did not check that either a ndjson or tsv file is actually specified as the input.

The resulting duckdb error is not parsable for users. Instead we now return a proper error message, also hinting at the possible CLI arguments that can be given instead of amending the preprocessing_config

PR Checklist

  • All necessary documentation has been adapted or there is an issue to do so.
  • The implemented feature is covered by an appropriate test.

Copy link
Contributor

github-actions bot commented Sep 12, 2024

This is a preview of the changelog of the next release:

0.2.19 (2024-09-12)

Bug Fixes

  • validate that either a ndjson or tsv file was given as input (6870236)

@Taepper Taepper force-pushed the preprocessing-validate-fix branch 2 times, most recently from e0e2c94 to 747d59f Compare September 12, 2024 12:58
@Taepper Taepper merged commit 7bf2ef7 into main Sep 12, 2024
10 checks passed
@Taepper Taepper deleted the preprocessing-validate-fix branch September 12, 2024 13:20
Copy link
Contributor

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it possible that duckdb threw such a cryptic/mojibake error?

This looks a bit like out of bounds access, which isn't great.

Validating is good, but I wonder how the error could end up so mangled. Can you point me to the place where duckdb crashed/threw?

throw preprocessing::PreprocessingException(fmt::format(
"Cannot specify both a ndjsonInputFilename ('{}') and metadataFilename('{}').",
ndjson_input_filename.value().string(),
metadata_file.value().string()
));
}
if (!ndjson_input_filename.has_value() && !metadata_file.has_value()) {
throw preprocessing::PreprocessingException(fmt::format(
"Neither a ndjsonInputFilename ('{}') nor a metadataFilename ('{}') was specified as "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you could list the various ways in which they can be passed, but not super important (CLI, env, config file)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it after the above comment.

We have them described in the README. Although the part about default values and the configs therefore being optional is unfortunately no longer valid as of my refactoring here...

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Sep 12, 2024

In the change log entry I suggest to:

  • link to the GitHub issue that reported the bug
  • link to the PR that fixed it (this one here)
  • mention that this was about preprocessing, I don't think that's clear from the given summary

@corneliusroemer
Copy link
Contributor

If there is a bug in the CLI option part, might make sense to add some basic tests here (E2E, once you get to it :) )

this stuff should be super easy to test for regressions

@Taepper
Copy link
Collaborator Author

Taepper commented Sep 12, 2024

Can you point me to the place where duckdb crashed/threw?

The actual part that was broken / where the gibberish was generated is an unchecked optional access here. Note that the -> symbol implicitly calls .value() on the optional.

I will add comments / additional action points to #562

@corneliusroemer
Copy link
Contributor

I see, and how come it produced mojibake? Because nullpointer access and C++ being unsafe language? I've only done stuff like that in Rust and there it would throw instead of producing gibberish.

@Taepper
Copy link
Collaborator Author

Taepper commented Sep 13, 2024

I also thought it would throw an exception (which would still puzzle the end-user) as is documented for the value function here.

But it turns out the shorthand syntax is just undefined behaviour :(

@fengelniederhammer
Copy link
Contributor

Probably this was the evil line?

buildMetadataTableFromFile(*preprocessing_config.getMetadataInputFilename());

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants