Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use statistics in Faker CTAS #24585

Merged

Conversation

nineinchnick
Copy link
Member

@nineinchnick nineinchnick commented Dec 26, 2024

Description

Use statistics when using CREATE TABLE AS SELECT in the Faker connector to:

  • set the default_limit table property to the estimated number of rows from the source table
  • set the min and max column properties based on the statistics
  • detect high-cardinality integer columns and use sequences for them
  • detect low-cardinality columns and generate dictionaries to select values from

Additional context and related issues

Previous attempt #24098 was abandoned after #24147 was reported. This time we only use views for sequence columns, and if this is not very useful, we can avoid creating the views automatically. Or this could be yet another column property.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Faker
* Use statistics when using `CREATE TABLE AS SELECT` in the Faker connector. ({issue}`issuenumber`)

@nineinchnick
Copy link
Member Author

@raunaqmorarka this is the last one, I promise :-)

@nineinchnick
Copy link
Member Author

@raunaqmorarka and @losipiuk this is ready for a review. It's the last one about Faker, I don't have anything else planned for it.

@nineinchnick
Copy link
Member Author

@raunaqmorarka @losipiuk a gentle reminder

@nineinchnick nineinchnick force-pushed the faker-range-constraint-views branch from c40b9fa to c2f5400 Compare January 18, 2025 08:57
@nineinchnick nineinchnick force-pushed the faker-range-constraint-views branch from c2f5400 to c7cf0bb Compare January 20, 2025 10:30
@nineinchnick nineinchnick force-pushed the faker-range-constraint-views branch from c7cf0bb to 3d035e3 Compare January 20, 2025 14:43
@nineinchnick nineinchnick force-pushed the faker-range-constraint-views branch from 68f71d4 to 254ad19 Compare January 28, 2025 15:46
@raunaqmorarka
Copy link
Member

please squash the fixups

@nineinchnick nineinchnick force-pushed the faker-range-constraint-views branch from 254ad19 to 626364a Compare January 28, 2025 19:33
@nineinchnick
Copy link
Member Author

I messed up squashing the fixups yesterday, it was too late in the day. I'll clean up the commits and let you know when it's ready.

@nineinchnick nineinchnick force-pushed the faker-range-constraint-views branch from 626364a to 9d7baf1 Compare January 29, 2025 21:02
When creating a table in the Faker connector from an existing table,
gather column statistics to determine range constraints, set them as
column properties.
When creating a table in the Faker connector from an existing table,
using column statistics determine low cardinality columns, and generate
values from a randomly generated set.
When creating tables in the Faker connector using CREATE TABLE AS
SELECT, use the NUMBER_OF_NON_NULL_VALUES column statistic to set the
null_probability column property.
@nineinchnick nineinchnick force-pushed the faker-range-constraint-views branch from 9d7baf1 to c44f57f Compare January 29, 2025 21:23
@raunaqmorarka raunaqmorarka merged commit 1140cb3 into trinodb:master Jan 30, 2025
17 checks passed
@github-actions github-actions bot added this to the 470 milestone Jan 30, 2025
@mosabua
Copy link
Member

mosabua commented Jan 31, 2025

Should we somehow document this? I kinda think yes, assuming it affects the shape of the random data generated .. maybe in a specific section about CTAS in the SQL support section of the connector docs

@nineinchnick
Copy link
Member Author

Yes. The usage section only shows create table like, and we can add a paragraph about create table as select and explain what's the difference.
https://trino.io/docs/current/connector/faker.html#usage

@mosabua
Copy link
Member

mosabua commented Jan 31, 2025

Will you send a PR @nineinchnick ? Also .. if you add a little section for dedicated for CTAS we should link to it from the SQL support statement list, see also...

@nineinchnick nineinchnick deleted the faker-range-constraint-views branch February 1, 2025 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants