schema refactor: remove redundent fields and align types #2798

waynexia · 2023-11-23T06:47:48Z

What type of enhancement is this?

Tech debt reduction

What does the enhancement do?

Important fields in Schema:

pub struct Schema {
    column_schemas: Vec<ColumnSchema>,
    name_to_index: HashMap<String, usize>,
    arrow_schema: Arc<ArrowSchema>,
    /// Index of the timestamp key column.
    ///
    /// Timestamp key column is the column holds the timestamp and forms part of
    /// the primary key. None means there is no timestamp key column.
    timestamp_index: Option<usize>,
    /// Version of the schema.
    ///
    /// Initial value is zero. The version should bump after altering schema.
    version: u32,
}

From this Schema definition, we don't have an easy way to know the semantic type of one field. Only TIME INDEX is recorded. And it's recorded twice, ColumnSchema also has this info.

pub struct ColumnSchema {
    pub name: String,
    pub data_type: ConcreteDataType,
    is_nullable: bool,
    is_time_index: bool,
    default_constraint: Option<ColumnDefaultConstraint>,
    metadata: Metadata,
}

Another missing info in Schema is ColumnId.

This works on querying, where the column might be temporary (e.g., the intermediate compute result) and doesn't have ColumnId and SemanticType. But if this is what Schema is for, then it contains too much unnecessary info. ArrowSchema is enough for the entire query processing phase.

The problem here is, we don't have a clear separation on which schema should be used in a phase. And we have to convert them from and into another.

We may need to be able to convert Schema to ArrowSchema for doing queries, by simply discarding unnecessary infos. And needn't to support converting ArrowSchema back to Schema, which seems to be a wrong requirement at present.

From my perspective, RegionMetadata is closer to what Schema should be. We may consider merging them into one:

pub struct RegionMetadata {
    /// Columns in the region. Has the same order as columns
    /// in [schema](RegionMetadata::schema).
    pub column_metadatas: Vec<ColumnMetadata>,
    /// Maintains an ordered list of primary keys
    pub primary_key: Vec<ColumnId>,

    /// Immutable and unique id of a region.
    pub region_id: RegionId,
    /// Current version of the region schema.
    ///
    /// The version starts from 0. Altering the schema bumps the version.
    pub schema_version: u64,
}

Tasks

Replace Schema with RegionMetadata and simplify RegionMetadata correspondingly.

Implementation challenges

No response

The text was updated successfully, but these errors were encountered:

waynexia added the C-enhancement Category Enhancements label Nov 23, 2023

evenyag assigned evenyag and waynexia Nov 23, 2023

github-actions bot unassigned evenyag and waynexia Mar 19, 2024

waynexia mentioned this issue Apr 22, 2024

fix: set is_time_index properly on updating physical table's schema #3770

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema refactor: remove redundent fields and align types #2798

schema refactor: remove redundent fields and align types #2798

waynexia commented Nov 23, 2023

schema refactor: remove redundent fields and align types #2798

schema refactor: remove redundent fields and align types #2798

Comments

waynexia commented Nov 23, 2023

What type of enhancement is this?

What does the enhancement do?

Tasks

Implementation challenges