Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using dictionary array to store low cardinality tag column #2948

Open
evenyag opened this issue Dec 18, 2023 · 0 comments
Open

Using dictionary array to store low cardinality tag column #2948

evenyag opened this issue Dec 18, 2023 · 0 comments
Labels
C-performance Category Performance
Milestone

Comments

@evenyag
Copy link
Contributor

evenyag commented Dec 18, 2023

What type of enhancement is this?

Performance

What does the enhancement do?

Creating a repeated vector for each batch is costly. It needs to clone the string value many times.

let vector = match cache_manager {
Some(cache) => repeated_vector_with_cache(
&column_schema.data_type,
value,
num_rows,
cache,
)?,
None => new_repeated_vector(&column_schema.data_type, value, num_rows)?,
};
columns.push(vector);

What's more, it might speed up some binary operations while evaluating expressions in the query engine.

We might use dictionary arrays to hold tag values to reduce the overhead. Then we can remove the repeated vector cache.

Implementation challenges

We might replace the schema of string tag columns with dictionary(string) during the optimization phase and tell the mito engine to return dictionary arrays.

For simplicity, we can simply map a string tag column to a dictionary(string) column without considering its cardinality.

@evenyag evenyag added the C-performance Category Performance label Dec 18, 2023
@evenyag evenyag added this to mito2 Dec 18, 2023
@waynexia waynexia self-assigned this Dec 19, 2023
@evenyag evenyag moved this to Todo in mito2 Dec 20, 2023
@killme2008 killme2008 added this to the v0.6 milestone Dec 27, 2023
@fengjiachun fengjiachun modified the milestones: v0.6, v0.8 Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-performance Category Performance
Projects
Status: Todo
Development

No branches or pull requests

4 participants