Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using use_columns it still reads other columns #327

Open
deanm0000 opened this issue Jan 22, 2025 · 4 comments
Open

When using use_columns it still reads other columns #327

deanm0000 opened this issue Jan 22, 2025 · 4 comments
Labels
bug Something isn't working 🦀 rust 🦀 Pull requests that edit Rust code

Comments

@deanm0000
Copy link
Contributor

See here

Suppose we have an excel file that looks like this:

Image

then run

from fastexcel import read_excel
read_excel(file_path).load_sheet(0, use_columns=["apple","banana"])

it raises with:

CalamineCellError: calamine cell error: #DIV/0!
Context:
    0: could not determine dtype for column blah

It should just ignore that column since it isn't in the use_columns list.

The work around is to tell it to make the other column a string

read_excel(file_path).load_sheet(0, use_columns=["apple","banana"], dtypes={"blah":"string"})
@lukapeschke lukapeschke added bug Something isn't working 🦀 rust 🦀 Pull requests that edit Rust code labels Jan 27, 2025
@lukapeschke
Copy link
Collaborator

I looked into this, and we try to infer the dtype for every column (event the ones that aren't part of use_columns) because of the available_columns property: https://fastexcel.toucantoco.dev/fastexcel.html#ExcelSheet.available_columns

Now, I see several ways we could address this, but they're all breaking:

  1. Remove the available_columns attribute
  2. Compute available_columns lazily: store it as an Option internally and compute it when it is accessed for the first time. This would make loading a sheet slightly faster when use_columns is specified, but could cause fastexcel to throw exceptions outside of read_excel and load_{sheet,table}.
  3. Change available_columns's type to list[ColumnInfo] | None and set it to None when use_columns is specified.

@deanm0000 I'm curious: do you ever use the available_columns property ?

@PrettyWood what are your thoughts on this ?

Also, since polars is one of the main consumers of fastexcel, I'd also be curious to know if @alexander-beedie has an opinion 🙂

@alexander-beedie
Copy link
Contributor

alexander-beedie commented Jan 31, 2025

Also, since polars is one of the main consumers of fastexcel, I'd also be curious to know if @alexander-beedie has an opinion 🙂

We don't use available_columns from the Polars side, but one way to deal with the issue while keeping everyone (reasonably!) happy might be to make it a method, rather than a property, and only do the work on demand.

Still technically breaking, but not badly, and makes it clearer that there may be a (small) bit of compute as it's a method rather than a property (can still cache the result if repeated calls are anticipated).

Give the method an "include_dtype" parameter, so if the caller needs the dtype, they can opt-in to the associated overhead - otherwise return the column names as strings.

def available_columns(include_dtype: bool=False) -> list[ColumnInfo] | list[str]:
    ...

This way, you're giving the caller more options, so the initial minor breakage is less of an issue (you aren't taking away something people may need, you're allowing them to opt-in to getting the additional dtype info if needed, but by default everyone gets more speed).

@deanm0000
Copy link
Contributor Author

do you ever use the available_columns property?
no

Back the the problem at hand, at a high level, in this function

pub(crate) fn build_available_columns_info<D: CalamineDataProvider>(

could we have it say something like If selected_columns is Selected (not sure how DynamicSelection works so maybe that too) then before returning do (Warning this is psuedo code off the fly):

available_columns.iter().filter(|item| selected_columns.contains(item)).collect()

wait. Is selected_columns here the same as the use_columns input or is it distinct? I'm assuming selected_columns==use_columns so if they're not then I don't know what selected_columns is.

In that way it doesn't need to be a breaking change to address the issue, I think.

@lukapeschke
Copy link
Collaborator

@deanm0000 what you're would be equivalent to solution 3: In the function you linked, selected_columns is equivalent to use_columns . If we filtered the contents of available_columns based on what's in use_columns, we would end up with available_columns containing the exact same information as selected_columns. This would still be a breaking change, since we would be returning less information than before to the user.

With that approach I'd rather have available_columns be None, to convey the message "since you specified the columns to use, there is no information available about the rest, only what you've specified"

I like the approach proposed by @alexander-beedie , as it only requires users to change a property access to a method call in their code. Since you're not using available_columns, I assume it would work for you as well @deanm0000 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working 🦀 rust 🦀 Pull requests that edit Rust code
Projects
None yet
Development

No branches or pull requests

3 participants