Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A memory efficient implementation of the .mtx reading function #3389

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions src/scanpy/datasets/_ebi_expression_atlas.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,13 +67,19 @@ def read_mtx_from_stream(stream: BinaryIO) -> sparse.csr_matrix:
max_int32 = np.iinfo(np.int32).max
coord_dtype = np.int64 if n > max_int32 or m > max_int32 else np.int32

data = pd.read_csv(
chunks = pd.read_csv(
stream,
sep=r"\s+",
header=None,
dtype={0: coord_dtype, 1: coord_dtype, 2: np.float32},
chunksize=1e7,
)
mtx = sparse.csr_matrix((data[2], (data[1] - 1, data[0] - 1)), shape=(m, n))
mtx = sparse.csr_matrix(([0], ([0], [0])), shape=(m, n))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mtx = sparse.csr_matrix(([0], ([0], [0])), shape=(m, n))
mtx = sparse.csr_matrix((m, n), dtype=np.float64)

for data in chunks:
mtx_chunk = sparse.csr_matrix(
(data[2], (data[1] - 1, data[0] - 1)), shape=(m, n)
)
mtx = mtx + mtx_chunk
Comment on lines +78 to +82
Copy link
Member

@flying-sheep flying-sheep Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably slightly slower than necessary, since we know the chunks don‘t overlap and = needs to deal with actually summing up things. But I imagine it could also be pretty well optimized, so if the following is not faster, please just add a comment instead explaining that + is well-enough optimized.

The way csr_matrix((data, (i, j)), [shape]) works is that it first creates a coo_matrix, then converts it to csr.
I think the best way is probably:

  1. build up data, i, j arrays in a loop
  2. create a csr_matrix from the final arrays as the last step

return mtx


Expand Down
Loading