Skip to content

Polars IO plugin for reading compressed CSV/TSV files in a streaming fashion

Notifications You must be signed in to change notification settings

ghuls/polars_streaming_csv_decompression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Polars IO plugin for reading compressed CSV/TSV files in a streaming fashion

This plugin provides a way to read compressed CSV/TSV files in a streaming fashion for usage with Polars.

Currently, Polars decompresses compressed CSV/TSV files completely in memory (when using pl.read_csv("file.csv.gz") or pl.scan_csv("file.csv.gz")) before trying to parse them, which results in a lot of memory usage when reading large compressed CSV/TSV files (several GBs to 100s of GBs) as common in e.g. bioinformatics.

This plugin provides a way to read compressed CSV/TSV files in a streaming fashion, where the file is decompressed and parsed in chunks. This results in a much lower overall memory usage when reading large compressed CSV/TSV files.

As it is mainly intended for reading large compressed CSV/TSV files produced by bioinformatics tools, records are assumed to be separated by eol_char (="\n" by default) and embedded eol_char in fields are not expected. The last record also should end in eol_char. If those conditions are not met, reading such files could give corrupt data.

It can also be used for decoding CSV files with a different character encoding than utf8 and/or for decoding CSV files for which not all bytes can be decoded in that encoding. Compared with read_csv, the decoding will require a lower amount of total memory.

Streaming decompression is handled by xopen, which supports the following compression formats and backends and automatically selects the best backend available on the system:

  • gzip (.gz):
  • bzip2 (.bz2):
  • xz (.xz):
  • Zstandard (.zst) (optional)":
  • fallback to Python’s built-in functions (gzip.open, lzma.open, bz2.open) if none of the other methods can be used.

Installation

pip install git+https://github.com/ghuls/polars_streaming_csv_decompression.git

Usage

import polars as pl
import polars_streaming_csv_decompression

# Read compressed CSV file in a streaming fashion.
(
    polars_streaming_csv_decompression.streaming_csv(
        "my_big_file.csv.gz"
    )  # lazy, doesn't do a thing
    .select(
        ["a", "c"]
    )  # select only 2 columns (other columns will not be read)
    .filter(
        pl.col("a") > 10
    )  # the filter is pushed down the scan, so less data is read into memory
    .head(100)  # constrain number of returned results to 100
)


# Read CSV file with non-utf8 encoding in a streaming fashion.
(
    polars_streaming_csv_decompression.streaming_csv(
        "file_encoded_in_windows-1252.csv",
        encoding="windows-1252",
    )
    .head()
)

# Read CSV file with non-utf8 encoding where not all bytes can be decoded in a streaming fashion.
(
    polars_streaming_csv_decompression.streaming_csv(
        "file_encoded_in_windows-1252_but_not_all_bytes_can_be_decoded.csv",
        encoding="windows-1252-lossy",
    )
    .head()
)

About

Polars IO plugin for reading compressed CSV/TSV files in a streaming fashion

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages