Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schema refactor: remove redundent fields and align types #2798

Open
1 task
waynexia opened this issue Nov 23, 2023 · 0 comments
Open
1 task

schema refactor: remove redundent fields and align types #2798

waynexia opened this issue Nov 23, 2023 · 0 comments
Labels
C-enhancement Category Enhancements

Comments

@waynexia
Copy link
Member

What type of enhancement is this?

Tech debt reduction

What does the enhancement do?

Important fields in Schema:

pub struct Schema {
    column_schemas: Vec<ColumnSchema>,
    name_to_index: HashMap<String, usize>,
    arrow_schema: Arc<ArrowSchema>,
    /// Index of the timestamp key column.
    ///
    /// Timestamp key column is the column holds the timestamp and forms part of
    /// the primary key. None means there is no timestamp key column.
    timestamp_index: Option<usize>,
    /// Version of the schema.
    ///
    /// Initial value is zero. The version should bump after altering schema.
    version: u32,
}

From this Schema definition, we don't have an easy way to know the semantic type of one field. Only TIME INDEX is recorded. And it's recorded twice, ColumnSchema also has this info.

pub struct ColumnSchema {
    pub name: String,
    pub data_type: ConcreteDataType,
    is_nullable: bool,
    is_time_index: bool,
    default_constraint: Option<ColumnDefaultConstraint>,
    metadata: Metadata,
}

Another missing info in Schema is ColumnId.

This works on querying, where the column might be temporary (e.g., the intermediate compute result) and doesn't have ColumnId and SemanticType. But if this is what Schema is for, then it contains too much unnecessary info. ArrowSchema is enough for the entire query processing phase.

The problem here is, we don't have a clear separation on which schema should be used in a phase. And we have to convert them from and into another.

We may need to be able to convert Schema to ArrowSchema for doing queries, by simply discarding unnecessary infos. And needn't to support converting ArrowSchema back to Schema, which seems to be a wrong requirement at present.

From my perspective, RegionMetadata is closer to what Schema should be. We may consider merging them into one:

pub struct RegionMetadata {
    /// Columns in the region. Has the same order as columns
    /// in [schema](RegionMetadata::schema).
    pub column_metadatas: Vec<ColumnMetadata>,
    /// Maintains an ordered list of primary keys
    pub primary_key: Vec<ColumnId>,

    /// Immutable and unique id of a region.
    pub region_id: RegionId,
    /// Current version of the region schema.
    ///
    /// The version starts from 0. Altering the schema bumps the version.
    pub schema_version: u64,
}

Tasks

  • Replace Schema with RegionMetadata and simplify RegionMetadata correspondingly.

Implementation challenges

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category Enhancements
Projects
None yet
Development

No branches or pull requests

2 participants