Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for sparse X #86

Merged
merged 11 commits into from
Aug 28, 2024
Merged

add support for sparse X #86

merged 11 commits into from
Aug 28, 2024

Conversation

apoorvalal
Copy link
Contributor

@apoorvalal apoorvalal commented Aug 16, 2024

Adds support for CSR matrix inputs for all learners [not supported elsewhere, e.g. this open issue on econml]. Turned out to be not too bad since Kfold, integer slicing all work as usual with CSR matrices.

I added a safe_len method in utils and replaced all instances of len(X) with safe_len(X) (hence the many files touched - turned out no major substantive changes were required). This also means that the Matrix class definition in typing now unions over array, dataframe, and csr_matrix.

Also added an example notebook that documents massive (~8-10x) speed gains from using sparse inputs. I can't think of any obvious side-effects, but I'd defer to you folks on whether there are edge cases I haven't considered here.

Checklist

  • Added a CHANGELOG.rst entry

@apoorvalal apoorvalal requested a review from kklein as a code owner August 16, 2024 22:57
@apoorvalal apoorvalal changed the title add support for csr matrix add support for sparse X Aug 16, 2024
@apoorvalal
Copy link
Contributor Author

For completeness, I also tried the same estimation routine with pandas.arrays.SparseArray returned by pd.get_dummies when sparse=True but was unreasonably slow [I terminated the job after ~5 mins]. LGBM may not even support it, but if so the failure was silent.

@apoorvalal apoorvalal force-pushed the sparseX branch 2 times, most recently from 8fb8e06 to 60bedf4 Compare August 27, 2024 20:08
@apoorvalal
Copy link
Contributor Author

apoorvalal commented Aug 27, 2024

docs build was previously failing due to mypy getting annoyed at cell magics, is now failing due to
image

happens only on the build system; local run of postinstall works fine.

will likely need some metadata tweaking; will try to take care of this later but feel free to change things around [you have write access to my fork and can make changes directly]

edit: fixed ; made the sparse problem quite small (which makes the runtime difference smaller, but that's nbd)

Copy link
Collaborator

@kklein kklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor nitpicks; looking great! :)

Would you mind adding the following to the top of changelog.rst?

0.11.0 (2024-09-xx)
-------------------

**New features**

* Add support for using ``scipy.sparse.csr_matrix`` as datastructure for covariates ``X``. 

docs/examples/example_sparse_inputs.ipynb Outdated Show resolved Hide resolved
docs/examples/example_estimating_ates.ipynb Outdated Show resolved Hide resolved
metalearners/_utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@kklein kklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks a lot! :)

Feel free to squash and merge

@apoorvalal
Copy link
Contributor Author

don't think i have the privileges; would be great if you could.

image

thanks again; excited to use this.

@kklein kklein merged commit 0d1958c into Quantco:main Aug 28, 2024
12 checks passed
@kklein
Copy link
Collaborator

kklein commented Aug 28, 2024

We might add 1-2 changes at the beginning of next week and will probably release version 0.11.0 by end of next week. :)

@apoorvalal apoorvalal deleted the sparseX branch August 28, 2024 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants