You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This document concerns the design for the future high-level external memory interface for XGBoost. The closest existing examples are data loaders in deep learning libraries, and there's no standardized interface. I want to tap into your wisdom and experience to build a new and general interface.
Background
In the last few months, we have significantly improved the external memory implementation, and it's itching toward becoming production-ready. The improvement is an essential milestone for XGBoost. The memory capacity has been a pain point for users for a long time. Not being a stochastic algorithm, XGBoost has to load all data into the main memory for efficient model training. This constraint restricts how much data a user can feed into the training algorithm.
With the help of the external memory version of XGBoost, we can now fetch data from a different (presumably larger) memory location, be it host memory for GPU or persistent memory for CPU. During training, XGBoost constructs a cache in the external memory and fetches the data from this cache. To build this cache, users first define a custom iterator that loads raw data in batches:
The iterator is considered the native interface. We want to build something at a higher level that can be composed with other libraries and downstream projects. With this new input schema, we are getting closer to what deep learning libraries do, like the data loader from Pytorch. However, since XGBoost is mainly built upon the idea that it can access all the data during training, existing interfaces are biased toward the scikit-learn-like fit method, where a user specifies all data either as a dataframe or as an array-like object:
The mismatch between two interfaces means scikit-learn utilities like cross-validation and calibration are no longer applicable. As a result, we need to design a new interface or a paradigm for batch-based algorithms.
Due to the flexible nature of DL architectures, there's no standard way to perform these wrapper tasks, and hence, it's unlikely that we can simply adopt the tools from the DL community. In addition, due to the size of modern-day DL models, cross-validation is not a particularly important topic for DL training.
Features
I want to list a set of features the new interface might possess. I will move from XGBoost-specific feature requests to more general wishes. This list is up for discussion and is not a hard requirement. You are more than welcome to make any changes.
We should be able to design a cross-validation routine for this interface for both standard regression and time series (rolling windows). In addition, this CV procedure is likely to be part of the XGBoost package instead of a general routine for other libraries, as we can make specialized optimizations like sharing histogram bins between folds. The CV procedure doesn't' have to be as general as the scikit-learn version. For example, we can require the user to pre-shuffle the input and simply load portions sequentially.
classXGBClassifierCV(Base):
...
The interface should be easy for downstream libraries to use. For example, if we want to define a doubly robust or a xlearner causal inference estimator with XGBoost, there should be a canonical way to do it.
The interface should have distributed computing in mind. Since distributed frameworks like dask and spark manage partitions, we would like to reduce the framework-specific glue code for training and CV.
We should impose as little constraint as possible. Data loading is highly flexible. The data might be from S3 buckets, and the loading might be done internally using GPU-Direct. The data may be generated on the fly, which is not unheard of with time series, as users might generate the rolling windows in their custom pipeline. We should allow all potential solutions.
Bindings
I don't have a strong preference for any particular language. Any design paradigm is welcome, and we can generalize it to other languages.
The text was updated successfully, but these errors were encountered:
This document concerns the design for the future high-level external memory interface for XGBoost. The closest existing examples are data loaders in deep learning libraries, and there's no standardized interface. I want to tap into your wisdom and experience to build a new and general interface.
Background
In the last few months, we have significantly improved the external memory implementation, and it's itching toward becoming production-ready. The improvement is an essential milestone for XGBoost. The memory capacity has been a pain point for users for a long time. Not being a stochastic algorithm, XGBoost has to load all data into the main memory for efficient model training. This constraint restricts how much data a user can feed into the training algorithm.
With the help of the external memory version of XGBoost, we can now fetch data from a different (presumably larger) memory location, be it host memory for GPU or persistent memory for CPU. During training, XGBoost constructs a cache in the external memory and fetches the data from this cache. To build this cache, users first define a custom iterator that loads raw data in batches:
For a full example, please see https://github.com/dmlc/xgboost/blob/master/demo/guide-python/external_memory.py .
The iterator is considered the native interface. We want to build something at a higher level that can be composed with other libraries and downstream projects. With this new input schema, we are getting closer to what deep learning libraries do, like the data loader from Pytorch. However, since XGBoost is mainly built upon the idea that it can access all the data during training, existing interfaces are biased toward the scikit-learn-like
fit
method, where a user specifies all data either as a dataframe or as an array-like object:The mismatch between two interfaces means scikit-learn utilities like cross-validation and calibration are no longer applicable. As a result, we need to design a new interface or a paradigm for batch-based algorithms.
Due to the flexible nature of DL architectures, there's no standard way to perform these wrapper tasks, and hence, it's unlikely that we can simply adopt the tools from the DL community. In addition, due to the size of modern-day DL models, cross-validation is not a particularly important topic for DL training.
Features
I want to list a set of features the new interface might possess. I will move from XGBoost-specific feature requests to more general wishes. This list is up for discussion and is not a hard requirement. You are more than welcome to make any changes.
Bindings
I don't have a strong preference for any particular language. Any design paradigm is welcome, and we can generalize it to other languages.
The text was updated successfully, but these errors were encountered: