When using `use_columns` it still reads other columns #327

deanm0000 · 2025-01-22T15:36:02Z

Suppose we have an excel file that looks like this:

then run

from fastexcel import read_excel
read_excel(file_path).load_sheet(0, use_columns=["apple","banana"])

it raises with:

CalamineCellError: calamine cell error: #DIV/0!
Context:
    0: could not determine dtype for column blah

It should just ignore that column since it isn't in the use_columns list.

The work around is to tell it to make the other column a string

read_excel(file_path).load_sheet(0, use_columns=["apple","banana"], dtypes={"blah":"string"})

The text was updated successfully, but these errors were encountered:

lukapeschke · 2025-01-31T13:06:01Z

I looked into this, and we try to infer the dtype for every column (event the ones that aren't part of use_columns) because of the available_columns property: https://fastexcel.toucantoco.dev/fastexcel.html#ExcelSheet.available_columns

Now, I see several ways we could address this, but they're all breaking:

Remove the available_columns attribute
Compute available_columns lazily: store it as an Option internally and compute it when it is accessed for the first time. This would make loading a sheet slightly faster when use_columns is specified, but could cause fastexcel to throw exceptions outside of read_excel and load_{sheet,table}.
Change available_columns's type to list[ColumnInfo] | None and set it to None when use_columns is specified.

@deanm0000 I'm curious: do you ever use the available_columns property ?

@PrettyWood what are your thoughts on this ?

Also, since polars is one of the main consumers of fastexcel, I'd also be curious to know if @alexander-beedie has an opinion 🙂

alexander-beedie · 2025-01-31T13:16:57Z

Also, since polars is one of the main consumers of fastexcel, I'd also be curious to know if @alexander-beedie has an opinion 🙂

We don't use available_columns from the Polars side, but one way to deal with the issue while keeping everyone (reasonably!) happy might be to make it a method, rather than a property, and only do the work on demand.

Still technically breaking, but not badly, and makes it clearer that there may be a (small) bit of compute as it's a method rather than a property (can still cache the result if repeated calls are anticipated).

Give the method an "include_dtype" parameter, so if the caller needs the dtype, they can opt-in to the associated overhead - otherwise return the column names as strings.

def available_columns(include_dtype: bool=False) -> list[ColumnInfo] | list[str]:
    ...

This way, you're giving the caller more options, so the initial minor breakage is less of an issue (you aren't taking away something people may need, you're allowing them to opt-in to getting the additional dtype info if needed, but by default everyone gets more speed).

deanm0000 · 2025-01-31T16:58:30Z

do you ever use the available_columns property?
no

Back the the problem at hand, at a high level, in this function

fastexcel/src/types/python/excelsheet/column_info.rs

Line 345 in 0d4f087

pub(crate) fn build_available_columns_info<D: CalamineDataProvider>(

could we have it say something like If selected_columns is Selected (not sure how DynamicSelection works so maybe that too) then before returning do (Warning this is psuedo code off the fly):

available_columns.iter().filter(|item| selected_columns.contains(item)).collect()

wait. Is selected_columns here the same as the use_columns input or is it distinct? I'm assuming selected_columns==use_columns so if they're not then I don't know what selected_columns is.

In that way it doesn't need to be a breaking change to address the issue, I think.

lukapeschke · 2025-02-03T10:25:04Z

@deanm0000 what you're would be equivalent to solution 3: In the function you linked, selected_columns is equivalent to use_columns . If we filtered the contents of available_columns based on what's in use_columns, we would end up with available_columns containing the exact same information as selected_columns. This would still be a breaking change, since we would be returning less information than before to the user.

With that approach I'd rather have available_columns be None, to convey the message "since you specified the columns to use, there is no information available about the rest, only what you've specified"

I like the approach proposed by @alexander-beedie , as it only requires users to change a property access to a method call in their code. Since you're not using available_columns, I assume it would work for you as well @deanm0000 ?

lukapeschke added bug Something isn't working 🦀 rust 🦀 Pull requests that edit Rust code labels Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using `use_columns` it still reads other columns #327

When using `use_columns` it still reads other columns #327

deanm0000 commented Jan 22, 2025

lukapeschke commented Jan 31, 2025

alexander-beedie commented Jan 31, 2025 •

edited

Loading

deanm0000 commented Jan 31, 2025

lukapeschke commented Feb 3, 2025

When using use_columns it still reads other columns #327

When using use_columns it still reads other columns #327

Comments

deanm0000 commented Jan 22, 2025

lukapeschke commented Jan 31, 2025

alexander-beedie commented Jan 31, 2025 • edited Loading

deanm0000 commented Jan 31, 2025

lukapeschke commented Feb 3, 2025

When using `use_columns` it still reads other columns #327

When using `use_columns` it still reads other columns #327

alexander-beedie commented Jan 31, 2025 •

edited

Loading