[python-package] Adding support for polars for input data #6204

detrin · 2023-11-22T09:22:05Z

Summary

I think polars library is on the path to replace the majority of pandas use-cases. It is already being adopted by the community. We use it internally in my company for new projects and we try not to use pandas at all.

Motivation

Polars is blazingly fast and it has several times a lower memory footprint. There is no need to use extra memory to convert data into numpy or pandas to be used for training in LightGBM.

Description

I would like the following to be working, where data_train and data_test are instances of pl.DataFrame

y_train = data_train[col_target]
y_test = data_test[col_target]
X_train = data_train.select(cols_pred)
X_test = data_test.select(cols_pred)

lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

params = {
    "boosting_type": "gbdt",
    "objective": "regression",
    "metric": {"l2", "l1"},
    "learning_rate": 0.1,
    "feature_fraction": 0.9,
    "bagging_fraction": 0.8,
    "bagging_freq": 5,
    "verbose": 0,
    "num_leaves": 42,
    "max_depth": 5,
    "num_iterations": 5000,
    "min_data_in_leaf": 500,
    "reg_alpha": 2, 
    "reg_lambda": 5,
}

gbm = lgb.train(
    params,
    lgb_train,
    valid_sets=lgb_eval,
    callbacks=[lgb.early_stopping(stopping_rounds=500)],
)

as of now I have to convert it into numpy matrices

y_train = data_train[col_target].to_numpy()
y_test = data_test[col_target].to_numpy()
X_train = data_train.select(cols_pred).to_numpy()
X_test = data_test.select(cols_pred).to_numpy()

The text was updated successfully, but these errors were encountered:

jameslamb · 2023-11-22T13:21:11Z

Thanks for using LightGBM and for taking the time for writing this up.

I support lightgbm taking on direct polars integration... that project has reachrd a level of stability and popularity that warrants it.

Are you interested in contributing this?

jmoralez · 2023-11-22T17:06:38Z

I believe this will be greatly simplified with the great work from @borchero (#6034, #6163), and it will come down to adding something like:

if isinstance(data, pl_DataFrame):
    __init_from_pyarrow_table(data.to_arrow(), ...)

so we should probably wait for his other PRs to be merged.

borchero · 2023-11-22T17:45:54Z

As a side note, passing data from polars without copying any data is the entire reason my PRs exist 😄 polars has a to_arrow() method on data frames and series which is zero-copy.

jameslamb · 2023-11-22T18:06:39Z

is the entire reason my PRs exist

amazing!!!

In the future, please share that context with us when proposing changes here. It helps us to make informed choices about how much time and energy to put into reviewing, and how to weight the costs of the new addition against the added maintenance burden.

@borchero are you interested in adding polars support once the Arrow PRs are done? We'd love the help and we've really appreciated your high-quality contributions so far. I agree with @jmoralez that that's the right way to sequence this work.

borchero · 2023-11-22T18:23:05Z

In the future, please share that context with us when proposing changes here. It helps us to make informed choices about how much time and energy to put into reviewing, and how to weight the costs of the new addition against the added maintenance burden.

Sure, will do next time 😄 since there already was an open Arrow PR, I didn't think about going into "motivation" too much 👀

are you interested in adding polars support once the Arrow PRs are done?

The plan to pass polars to Arrow in our company's application code would have simply been to call .to_arrow() in appropriate places. Do you think that there's much value in adding direct support for pl.DataFrame and pl.Series at the expense of an additional dependency and, hence, higher maintenance complexity?

we've really appreciated your high-quality contributions so far

🙏🏼🙏🏼

detrin · 2023-11-22T18:49:52Z

The plan to pass polars to Arrow in our company's application code would have simply been to call .to_arrow() in appropriate places. Do you think that there's much value in adding direct support for pl.DataFrame and pl.Series at the expense of an additional dependency and, hence, higher maintenance complexity?

The way I see it is that 2024 will be the year of polars adoption by major python ML packages. The easier you will make it for users to use it, the better the user experience will be overall.

I am glad to hear that this is being considered and my issue wasn't rejected at first glance.

detrin · 2023-11-22T18:55:30Z

On a different note, I tried to use LightGBM directly in rust https://github.com/detrin/lightgbm-rust and I will perhaps use it for use-case for testing. The pyarrow option is interesting, I will try it as well. Thanks @borchero could you link the PR here?

jameslamb · 2023-11-22T19:09:42Z

Do you think that there's much value in adding direct support for pl.DataFrame and pl.Series at the expense of an additional dependency and, hence, higher maintenance complexity?

This is a great point @borchero.

Taking on pyarrow support directly definitely was worth it, as there were details like data types and handling contiguous vs. chunked arrays that added complexity, and therefore significant user benefit to having LightGBM handle that complexity internally in a way that best fits with how LightGBM works.

I'm not that familiar with polars, so I don't know if there are similar complexities that'd be worth pulling inside of LightGBM to make things easier for users.

If not and it's literally just .to_arrow(), then maybe just documentation + an informative error message suggesting the use of .to_arrow() would be enough?

I guess as a first test, I'd want to understand how .to_arrow() works... does it return a copy of the data, but in Arrow format? Or does polars use the Arrow format internally and does .to_arrow() just return a pointer to that data that bypasses any other polars-specific data structures?

Because if it's a copy... then having lightgbm do .to_arrow() internally would result in an unavoidable copy that could be avoided externally.

Consider something like this (pseudo-code, this won't run):

import polars as pl
import lightgbm as lgb

df = pl.read_csv("data.csv")
dtrain = lgb.Dataset(
    df[["feature_1", "feature_2"]],
    label=df["label"]
)

lgb.train(train_set=dtrain, ...)

If lightgbm does a .to_arrow() on that passed-in polars DataFrame internally, then throughout training you'll be holding df in memory and a copy in Arrow format created with .to_arrow().

I think that's result in higher peak memory usage than instead doing something like the following and passing in Arrow data directly

import polars as pl
import lightgbm as lgb

df = pl.read_csv("data.csv").to_arrow()
dtrain = lgb.Dataset(
    df[["feature_1", "feature_2"]],
    label=df["label"]
)

lgb.train(train_set=dtrain, ...)

Other things that might justify adding support for directly passing polars DataFrames and series:

does polars have its own data types that are significantly different from pyarrow? e.g. does it have concepts from pandas like nullable dtypes or categoricals?
if .to_arrow() returns a copy... is there some other API in polars that provides a pointer to the start of the underlying raw data? So that LightGBM might be able to construct a lightgbm.Dataset from it without any copying on the Python side?

lightgbm's Python package doesn't really do any DataFrame aggregations, joins, filtering, etc. ... the types of operations that benefit from polars backend. So I think the main benefit would be something like "lots of users want to use polars, but it's difficult to know how to efficiently create a lightgbm.Dataset from a polars DataFrame".

ritchie46 · 2023-11-22T19:57:07Z

I guess as a first test, I'd want to understand how .to_arrow() works... does it return a copy of the data, but in Arrow format? Or does polars use the Arrow format internally and does .to_arrow()

Polars internally keeps memory according to the arrow memory format. When you call to_arrow we give a pointer according to that format to pyarrow and you can continue via pyarrow to move to pandas, pyarrow, or any other tool that consumes arrow.

Moving data in and out of polars via arrow is zero-copy.

Moving data in and out of polars via numpy can be zero-copy (it depends on the data type, null data and dimensions)

detrin · 2023-11-22T20:06:25Z

Moving data in and out of polars via numpy can be zero-copy (it depends on the data type, null data and dimensions)

Does this imply that potentially LigthGBM could use it even in my snippet above without allocating any new memory on the heap?

borchero · 2023-11-22T22:26:33Z

@detrin not for your training data, i.e. not for polars data frames. Polars uses column-wise storage, i.e. each of your columns is represented by a contiguous chunk of memory (but data for each column is potentially in different locations of the heap).

The only interface that is currently available to pass data to LightGBM from Python is via NumPy (disregarding the option to pass data as files), which uses a single array (=single chunk of memory) to store your data and uses row-wise storage. This means that each row is represented by a contiguous chunk of memory and rows are furthermore concatenated such that you end up with a single array.

As you can see, Polars' data representation is quite different to the NumPy data representation and, thus, data needs to be copied.

Side-note: to not require two copies, you should call .to_numpy(order="c") on your Polars data frame, otherwise, you will end up with a single array (=single chunk of memory) with column-wise ordering as this is more efficient to generate. LightGBM will, however, not like this and copy data yet again.

The way to resolve this issue is to extend LightGBM's API, i.e. to allow other data formats to be passed from the Python API. Arrow is a natural choice since it is being used ever more and is the backing memory format for pandas. In fact, it allows you to pass data from any tool that provides data as Arrow to LightGBM without copying any data.

jameslamb · 2023-11-22T22:44:46Z

The only interface that is currently available to pass data to LightGBM from Python is via NumPy (disregarding the option to pass data as files),

This is not true.

The Python package supports all of these formats for raw data:

numpy arrays
pandas DataFrames
datatable DataFrames (h2o's DataFrame format)
scipy CSC and CSR sparse matrices
CSV, TSV, and LibSVM files
Python lists of lists

Start reading from here and you can see all those options:

LightGBM/python-package/lightgbm/basic.py

Line 2010 in 516bde9

# start construct data

jameslamb · 2023-11-22T22:50:32Z

Also for reference https://numpy.org/doc/1.21/reference/arrays.ndarray.html#internal-memory-layout-of-an-ndarray.

It says

Data in new ndarrays is in the row-major (C) order, unless otherwise specified

"stored in a contiguous block of memory in row-major order" is not exactly the same as "row-wise", just wanted to add that link as it's a great reference for thinking about these concepts.

borchero · 2023-11-22T22:53:53Z

The Python package supports all of these formats for raw data:

Ah sorry about the misunderstanding! I think I phrased this a little too freely. I meant data formats that are useful for passing a Polars dataframe. Pandas is essentially treated the same as NumPy but adds a few more metadata. The other options are largely irrelevant for this particular instance.

"stored in a contiguous block of memory in row-major order" is not exactly the same as "row-wise", just wanted to add that link as it's a great reference for thinking about these concepts.

Yep, thanks! Once one has read through the NumPy docs, one also understands the statement that "polars' to_numpy() method outputs Fortran-contiguous NumPy arrays by default" 😄

detrin · 2023-11-22T23:01:35Z

So, if I understand it correctly as of now there is now way how to pass data from polars to LightGBM without copying the data in memory.

For the project I am working on I might use CLI as a workaround.

borchero · 2023-11-22T23:14:43Z

So, if I understand it correctly as of now there is now way how to pass data from polars to LightGBM without copying the data in memory.

Yes, you will have (at least) one copy of your data in memory along with the LightGBM-internal representation of your data that is optimized for building the tree.

For the project I am working on I might use CLI as a workaround.

Potentially, a viable temporary alternative might also be to pass data via files (?)

detrin · 2023-11-22T23:56:09Z

Potentially, a viable temporary alternative might also be to pass data via files (?)

Is it possible directly in python? I could then output data into temp file and load it in python by LightGBM.

borchero · 2023-11-23T00:02:01Z

See @jameslamb's comment above for LightGBM's "API":

CSV, TSV, and LibSVM files

You could e.g. write your data to CSV. Obviously, this introduces some performance hit.

deadsoul44 · 2024-07-18T17:03:46Z

Shameless plug:
PerpetualBooster supports Polars input:
https://github.com/perpetual-ml/perpetual

borchero · 2024-07-21T20:08:50Z

@jameslamb I just thought again about adding documentation about how to pass polars data to LightGBM. Where do you think is the most appropriate place for this? I wouldn't want to add a note on polars support for a bunch of Dataset.__init__ parameters as well as all the set_{label,group,...} methods. The same applies to an informative error message that you suggest above.

I thought about adding a note on the Dataset class docs but it's very minimal so far... wdyt?

jameslamb · 2024-07-22T03:05:48Z

Hey thanks for reviving this @borchero .

A lot has changed since the initial discussion. There's now a 1.0 release of polars (and the corresponding commitment to more stability) and you've since added direct Arrow support to lightgbm's Python package.

I wonder... could we add direct, transparent support for polars inputs in lightgbm without adding a dependency on polars by just doing something like this?

def _is_polars(arr) -> bool:
    return "polars." in str(arr.__class__) and callable(getattr(arr, "to_arrow", None))

# ... in any method where LightGBM accepts raw data ...
if _is_polars(data):
    data = data.to_arrow()

Because if we did that, then we wouldn't need to document specific methods that have to be called on polars inputs. From users' perspective, lightgbm just supports polars. If, in the future, this little trick proves insufficient, we could then consider taking on an optional dependency on polars and handle that similar to how the pandas and datatable dependencies are handled.

jameslamb · 2024-07-22T03:06:47Z

This related discussion is also worth cross-linking: dmlc/xgboost#10554

MarcoGorelli · 2024-07-29T12:15:07Z

Hi @jameslamb,

Hope it's ok for me to jump in here - I contribute to pandas and Polars, and have fixed up several issues related to the interchange protocol mentioned in dmlc/xgboost#10452

The interchange protocol provides a standardised way of converting between dataframe libraries, but has several limitations which may affect you, so I recommend not using it:

no support for Series input
unsupported datatypes (e.g. Date, nested datatypes)
unreliable implementations: using it to convert to pandas is not recommended for pandas<2.0.2, and accessing the column buffers directly isn't recommended for pandas<3.0. My biggest criticism of the project is that implementations are tied to the dataframe libraries themselves, making updates to historical versions impossible

If all you need to do is convert to pyarrow, then I'd suggest you just do

if (pl := sys.modules.get('polars', None) is not None and isinstance(data, pl.DataFrame):
    data = data.to_arrow()

If instead you need to perform dataframe operations in a library-agnostic manner, then Narwhals, an extremely lightweight compatibility layer between dataframe libraries which has zero dependencies, may be of interest to you (Altair recently adopted it for this purpose, see vega/altair#3452, as did scikit-lego)

I'd be interested to see how I could be of help here, as I'd love to see Polars support in LightGBM happen 😍 - if it may be useful to have a longer chat about how it would work, feel free to book some time in https://calendly.com/marcogorelli

kylebarron · 2024-07-29T13:55:04Z

I wonder... could we add direct, transparent support for polars inputs in lightgbm without adding a dependency on polars by just doing something like this?

@ritchie46 pointed out this discussion to me, and I wanted to highlight recent work around the Arrow PyCapsule Interface. It's a way for Python packages to exchange Arrow data safely without prior knowledge of each other. If the input object has an __arrow_c_stream__ method, then you can call it to get a PyCapsule containing an Arrow C Stream pointer. Recent versions of pyarrow have implemented this in their constructors. I recently added this protocol to Polars in pola-rs/polars#17676 and there's progress on getting wider ecosystem support.

You can use:

import pyarrow as pa

assert hasattr(input_obj, "__arrow_c_stream__")
table = pa.table(input_obj)
# pass table into existing API

Alternatively, this also presents an opportunity to access a stream of Arrow data without materializing it in memory all at once. You can use the following to only materialize a single Arrow batch at a time:

import pyarrow as pa

assert hasattr(input_obj, "__arrow_c_stream__")
reader = pa.RecordBatchReader.from_stream(input_obj)
for arrow_chunk in reader:
    # do something

If the source is itself a stream (as opposed to a "Table" construct where multiple batches are already materialized in memory), then you can import very large Arrow data while only materializing a single batch in memory at a time.

The PyCapsule Interface could also let you remove the pyarrow dependency if you so desired.

jameslamb · 2024-08-21T04:58:39Z

Thank you both for the information! I'm definitely supportive of having polars input "just work" in the Python package, assuming we can do it with a manageable amount of new complexity and maintenance burden.

Both approaches mentioned above (@MarcoGorelli's and @kylebarron 's) look interesting for us to pursue here as a first step.

Hopefully someone will be able to look into this soon (maybe @borchero or @jmoralez are interested?).

We'd also welcome contributions from anyone involved in this thread who has the time and interest, and would be happy to talk through specifics over a draft PR.

We have a lot more work to do here than hands to do it... we haven't even added support for pyarrow columns in pandas>=2.0 yet 😅 (#5739 (comment)).

For a bit more context.... lightgbm mostly does not need to do any dataframe operations on user-provided dataframes.

The main place where that does happen with pandas is in handling of pandas categorical columns (where we want to encode them as integer arrays in a way that LightGBM understands, but then also be able to recover the mapping from categories to their integer representations later at scoring time).

If you want to see what I mean by that, some of the relevant code is here:

LightGBM/python-package/lightgbm/basic.py

Line 807 in d67ecf9

def _data_from_pandas(

LightGBM/python-package/lightgbm/basic.py

Line 855 in d67ecf9

def _dump_pandas_categorical(

LightGBM/python-package/lightgbm/basic.py

Line 867 in d67ecf9

def _load_pandas_categorical(

MarcoGorelli · 2024-08-21T07:39:32Z

Thanks for your response 🙏 !

Quick note on the two approaches: Narwhals exposes the PyCapsule interface (narwhals-dev/narwhals#786), and includes a fallback to PyArrow (thanks Kyle for having suggested it! 🙌 ) which arguably makes it easier to use than looking for '__arrow_c_stream__' on native objects and then manually implementing a fallback. So it doesn't have to be "one or the other" - in fact, my understanding is that VegaFusion are looking at using both of them together:

Narwhals for slicing and dicing dataframes/series in Python-land, inspecting schemas and column names
PyCapsule Interface to agnostically access the underlying data from low-level languages (I think Rust in their case)

lcrmorin · 2024-10-18T12:18:07Z

Some positive news regarding polars adoption: Kaggle has made a push in that direction. The API in some visible competitions accept both pandas and polars df as solution (see here - in that case the pandas version seems bugged, making polars the only viable solution).

From what I understand polars inputs are not directly supported yet. What would be the best alternative to make it works ? (avoiding memory duplicates + categorical support).

Ic3fr0g · 2024-11-20T04:25:43Z

@borchero what's the current set of blockers on polars integration? How can I help?

borchero · 2024-11-26T21:12:00Z

I only now got the opportunity to think about @kylebarron's and @MarcoGorelli's proposals in more detail. I think the premise of PyCapsule is very appealing, esp. since it seems to have official support from Apache Arrow. I would personally prefer this over a solution involving narwhals (esp. as narwhals -- even though it markets itself as "lightweight" -- is much more heavyweight in comparison).

Given my -- so far limited -- understanding of PyCapsule, I would propose the following next steps:

Accept any objects implementing the PyCapsule interface as input for the methods currently accepting pyarrow objects. All of these objects can be converted into the corresponding pyarrow objects (i.e. a Table or [Chunked]Array) pretty easily. This would implicitly introduce polars support without introducing a direct dependency on polars (nice!).
We discuss how to add support for polars categoricals (which essentially maps to Arrow dictionaries). We wouldn't need to change anything in the Python bindings here (apart from dtype checks) but only adjust the C++ code to correctly interpret categorical columns in Arrow arrays.
[Optional] Get rid of the dependency on pyarrow by directly passing the PyCapsule objects to the C++ backend. I'm not entirely sure what this entails though, it seems like the C code then requires a dependency on Python.h. Afaict this would considerably increase the complexity of the compilation.

@Ic3fr0g if you want to help with this support, I would be very happy to review a PR implementing (1) :)

@lcrmorin you can already use LightGBM with polars today. Just convert your polars df into a pyarrow.Table via to_arrow which you can pass to LightGBM. Support for categorical inputs has not yet been implemented (see point 2 above).

jmoralez · 2024-11-26T21:24:56Z

Can you elaborate on how narwhals is much more heavyweight? The wheel is 225kB and it doesn't have any dependencies.

Your proposal seems to focus specifically on arrow, so we'd have two implementations to convert dataframes to the required LightGBM format, one for pandas and one for arrow. I think narwhals would allow us to have a single one.

adjust the C++ code to correctly interpret categorical columns in Arrow arrays

This seems way more complex than what we do right now for pandas, which is mapping categorical values to their integer codes. I think it would be a lot easier doing that for arrow as well.

MarcoGorelli · 2024-11-26T21:29:53Z

Thanks @borchero for looking into this! Personally I'm much more excited about Polars support than about spreading Narwhals, so I'm really happy that this might be happening

@jmoralez I think it depends on what someone does with dataframes - if all someone needs to do is to convert to PyArrow table, then I agree that Narwhals is more heavyweight than just doing

if hasattr(df_user, '__arrow_c_stream__'):
    df_pyarrow = pa.table(df_user)

The use case for Narwhals is more for when you want to do everything with the user's input library without converting to PyArrow / pandas - for example, when working from Python. But if LightGBM has the chance to operate on the underlying data anyway from C++, then there's probably no need for Narwhals here

borchero · 2024-11-26T21:51:38Z

Can you elaborate on how narwhals is much more heavyweight? The wheel is 225kB and it doesn't have any dependencies.

I don't think it's "heavyweight" in any general way. As @MarcoGorelli already pointed out though -- according to my proposal -- we'd only need a way to export data to PyArrow for which we don't need any additional dependency :)

Your proposal seems to focus specifically on arrow, so we'd have two implementations to convert dataframes to the required LightGBM format, one for pandas and one for arrow. I think narwhals would allow us to have a single one.

According to this comment, newer versions of pandas also have support for Arrow PyCapsule (at least for data frames) which would allow us to unify conversion for both polars and pandas data frames. [Maybe this would also easily enable us to support pandas pyarrow columns?] Of course, we'd still have to keep around the old implementation as long as we want to support older versions of pandas -- this might not be required when using narwhals (?).

This seems way more complex than what we do right now for pandas, which is mapping categorical values to their integer codes. I think it would be a lot easier doing that for arrow as well.

Iirc, a categorical column in Arrow consists of an array with (u)int32s along with a dictionary mapping these integers to the categories. The simplest implementation (which would be rather trivial) would be to simply take the array of (u)int32s, no conversion necessary. One caveat is that the categories might be mapped differently in two equal arrays (e.g. array a = [1, 2] might have a mapping 1 -> foo, 2 -> bar while array b = [2, 1] might have mapping 1 -> bar, 2 -> foo). I don't know if the same issue arises with pandas categoricals or how we treat the issue there. We'd either have to (1) sort the categories and actually apply the mapping or (2) require the user to ensure a proper mapping. The latter is a little annoying for polars.Categorical at least as the mapping depends on a global string cache (while it is no issue for polars.Enum).

MarcoGorelli · 2024-11-27T11:03:47Z

According to apache/arrow#39195 (comment), newer versions of pandas also have support for Arrow PyCapsule (at least for data frames) which would allow us to unify conversion for both polars and pandas data frames. [Maybe this would also easily enable us to support pandas pyarrow columns?]

pandas relies on PyArrow to export the PyCapsule Interface, so even if you were to access the underlying data directly in C++, it would still mean requiring PyArrow as a depdency. Which sounds fine to me, just checking this is understood

Of course, we'd still have to keep around the old implementation as long as we want to support older versions of pandas -- this might not be required when using narwhals (?).

True, but you could just copy this part from Narwhals and this one and do

def convert_to_pyarrow_table(native_frame):
    try:
        import pyarrow as pa  # ignore-banned-import
    except ModuleNotFoundError as exc:
        msg = f"PyArrow>=14.0.0 is required"
        raise ModuleNotFoundError(msg) from exc
    if parse_version(pa.__version__) < (14, 0):
        msg = f"PyArrow>=14.0.0 is required"
        raise ModuleNotFoundError(msg) from None

    if not hasattr(native_frame, "__arrow_c_stream__"):
        if (pd := sys.modules.get('pandas', None)) is not None and isinstance(native_frame, pd.DataFrame):
            # Keep supporting old pandas versions which didn't export the PyCapsule Interface
            return pa.Table.from_pandas(native_frame)
        msg = f"Given object of type {type(native_frame)} does not support PyCapsule Interface"
        raise TypeError(msg)

    return pa.table(native_frame)

If you then already have logic to deal with PyArrow Tables directly, then this should allow for supporting any dataframe library which implements the PyCapsule Interface, as well as simplifying (I think?) how you support pandas

The downside would be requiring PyArrow, and it's your decision whether or not that's acceptable (my impression is that the upsides outweigh the downsides here)

kylebarron · 2024-11-27T16:47:55Z

I'm not as familiar with usage of PyCapsules from C/Cython (I do all my Python binding from Rust), but I from my perspective it seems best to directly access the Arrow stream from C, in a very similar way to how you access pyarrow data now.

I'm not entirely sure what this entails though, it seems like the C code then requires a dependency on Python.h. Afaict this would considerably increase the complexity of the compilation.

I don't think your C code would need Python.h. Your C code would receive normal Arrow data. It's just that once your underlying C code has finished, your bindings will need to call the release callback on the capsule.

borchero · 2024-11-27T19:25:49Z

I don't think your C code would need Python.h. Your C code would receive normal Arrow data. It's just that once your underlying C code has finished, your bindings will need to call the release callback on the capsule.

This sounds interesting. Unfortunately, I'm lacking a little experience with Python/C bindings and the setup in this repository, in particular. Specifically, I have no idea where I could add code to have the "bindings" call the release callback (this would have to be C code as far as I understand it).

kylebarron · 2024-11-27T19:50:12Z

It looks like existing support for Arrow is in https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py. In particular

LightGBM/python-package/lightgbm/basic.py

Lines 418 to 443 in 83c0ff3

    
           def _export_arrow_to_c(data: pa_Table) -> _ArrowCArray: 
        
               """Export an Arrow type to its C representation.""" 
        
               # Obtain objects to export 
        
               if isinstance(data, pa_Array): 
        
                   export_objects = [data] 
        
               elif isinstance(data, pa_ChunkedArray): 
        
                   export_objects = data.chunks 
        
               elif isinstance(data, pa_Table): 
        
                   export_objects = data.to_batches() 
        
               else: 
        
                   raise ValueError(f"data of type '{type(data)}' cannot be exported to Arrow") 
        
               # Prepare export 
        
               chunks = arrow_cffi.new("struct ArrowArray[]", len(export_objects)) 
        
               schema = arrow_cffi.new("struct ArrowSchema*") 
        
               # Export all objects 
        
               for i, obj in enumerate(export_objects): 
        
                   chunk_ptr = int(arrow_cffi.cast("uintptr_t", arrow_cffi.addressof(chunks[i]))) 
        
                   if i == 0: 
        
                       schema_ptr = int(arrow_cffi.cast("uintptr_t", schema)) 
        
                       obj._export_to_c(chunk_ptr, schema_ptr) 
        
                   else: 
        
                       obj._export_to_c(chunk_ptr) 
        
               return _ArrowCArray(len(chunks), chunks, schema)

using cffi. This is what I'm referring to as your "binding code" because it's the glue between the Python environment and the underlying C code. Then you pass that _ArrowCArray somewhere on to be interpreted later.

This binding code could be updated to directly access a stream according to the C Stream interface. You just have to call .release() whenever you're done using the data, as shown in the C example

vyasr · 2024-12-10T18:02:34Z

I don't think your C code would need Python.h. Your C code would receive normal Arrow data. It's just that once your underlying C code has finished, your bindings will need to call the release callback on the capsule.

This interpretation is generally correct. Your bindings code would be responsible for extracting the underlying pointers and handing them to your C code, at which point you would be free to operate on the underlying pointer in pure C. What this boils down to at the bindings level is some sort of call to PyCapsule_GetPointer. The bindings code will need access to Python.h, depending on how the bindings are constructed (e.g. Cython via direct PyCapsule API reference or pybind11 via its own object wrappers). I don't know offhand to do that with cffi or swig, though, which are what I see some of in this repository.

jameslamb added the feature request label Nov 22, 2023

jameslamb mentioned this issue Nov 22, 2023

Feature Requests & Voting Hub #2302

Open

jameslamb changed the title ~~Adding support for polars for input data~~ [python-package] Adding support for polars for input data Nov 22, 2023

jameslamb mentioned this issue Dec 4, 2023

Create Dataset from Arrow format #3369

Closed

kylebarron mentioned this issue Jul 29, 2024

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

8 tasks

kklein mentioned this issue Aug 29, 2024

Add support for polars Quantco/metalearners#88

Open

jameslamb mentioned this issue Oct 5, 2024

[RFC] [python-package] remove h2o datatable support? #6662

Open

[python-package] Adding support for polars for input data #6204

[python-package] Adding support for polars for input data #6204

Comments

detrin commented Nov 22, 2023

Summary

Motivation

Description

jameslamb commented Nov 22, 2023

jmoralez commented Nov 22, 2023

borchero commented Nov 22, 2023

jameslamb commented Nov 22, 2023

borchero commented Nov 22, 2023

detrin commented Nov 22, 2023

detrin commented Nov 22, 2023

jameslamb commented Nov 22, 2023

ritchie46 commented Nov 22, 2023

detrin commented Nov 22, 2023

borchero commented Nov 22, 2023

jameslamb commented Nov 22, 2023

jameslamb commented Nov 22, 2023

borchero commented Nov 22, 2023

detrin commented Nov 22, 2023

borchero commented Nov 22, 2023

detrin commented Nov 22, 2023

borchero commented Nov 23, 2023

deadsoul44 commented Jul 18, 2024

borchero commented Jul 21, 2024

jameslamb commented Jul 22, 2024

jameslamb commented Jul 22, 2024

MarcoGorelli commented Jul 29, 2024

kylebarron commented Jul 29, 2024 • edited Loading

jameslamb commented Aug 21, 2024

MarcoGorelli commented Aug 21, 2024

lcrmorin commented Oct 18, 2024

Ic3fr0g commented Nov 20, 2024

borchero commented Nov 26, 2024

jmoralez commented Nov 26, 2024

MarcoGorelli commented Nov 26, 2024 • edited Loading

borchero commented Nov 26, 2024 • edited Loading

MarcoGorelli commented Nov 27, 2024 • edited Loading

kylebarron commented Nov 27, 2024

borchero commented Nov 27, 2024

kylebarron commented Nov 27, 2024

vyasr commented Dec 10, 2024

kylebarron commented Jul 29, 2024 •

edited

Loading

MarcoGorelli commented Nov 26, 2024 •

edited

Loading

borchero commented Nov 26, 2024 •

edited

Loading

MarcoGorelli commented Nov 27, 2024 •

edited

Loading