[FLINK-31010] Add Transformer and Estimator for GBTClassifier and GBTRegressor #210

Fanoid · 2023-02-10T09:03:55Z

What is the purpose of the change

Add Transformer and Estimator for GBTClassifier and GBTRegressor.

Details about features compared to SparkML's implementation are as follows:

Implemented in this PR: fundamental binary classification and regressor (only squared loss).
Implemented and not supported in SparkML: 2nd-order approximation of loss func as impurity (this is an important feature supported by XGBoost and LightGBM [1]).
Not implemented yet, but parameters added: early stopping with validation set, encoding with leaf id, and weight columns.
Not implemented yet: classification threshold, absolute loss for regressor, feature importance, and 1st-order gradient.
Not expected to be supported: maxMemoryInMB, cacheNodeIds, and checkpointInterval.

[1] https://xgboost.readthedocs.io/en/stable/tutorials/model.html#the-structure-score

Brief change log

Add implementation of gradient-boosting trees.
Add Transformer and Estimator for GBTClassifier and GBTRegressor.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): yes
The public API, i.e., is any changed class annotated with @Public(Evolving): no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? JavaDocs

lindong28

Thanks for the PR. Left some comments below.

flink-ml-lib/src/main/java/org/apache/flink/ml/classification/gbtclassifier/GBTClassifier.java

flink-ml-lib/src/main/java/org/apache/flink/ml/common/gbt/GBTModelParams.java

flink-ml-lib/src/main/java/org/apache/flink/ml/common/gbt/loss/LogLoss.java

flink-ml-lib/src/main/java/org/apache/flink/ml/classification/gbtclassifier/GBTClassifier.java

flink-ml-lib/src/main/java/org/apache/flink/ml/common/gbt/GBTRunner.java

flink-ml-lib/src/main/java/org/apache/flink/ml/common/param/HasMinInfoGain.java

...-ml-lib/src/main/java/org/apache/flink/ml/common/gbt/datastorage/IterationSharedStorage.java

flink-ml-lib/src/main/java/org/apache/flink/ml/common/gbt/defs/Split.java

Fanoid · 2023-03-01T08:23:46Z

Hi, @lindong28 , thanks for your valuable comments. I've update the PR based on comments and offline discussions. Please take a look.

lindong28

Thanks for the update! Left some comments below.

flink-ml-lib/src/main/java/org/apache/flink/ml/common/gbt/BaseGBTParams.java

flink-ml-core/src/main/java/org/apache/flink/ml/common/sharedstorage/SharedStorageBody.java

lindong28 · 2023-03-06T06:21:21Z

...k-ml-lib/src/main/java/org/apache/flink/ml/common/gbt/operators/CalcLocalSplitsOperator.java

+    private transient SharedStorageContext sharedStorageContext;
+
+    public CalcLocalSplitsOperator() {
+        sharedStorageAccessorID = getClass().getSimpleName() + "-" + UUID.randomUUID();


Would it be simpler and more reliable to use StreamOperator#getOperatorID as the accessor ID?

Operators in a given operator graph is guaranteed to have different operatorIDs. A given operator is guaranteed to have the same operatorID after the job is restarted as long as the job graph is the same. And users can manually specify operatorID for operators in a job so that the operatorID will be the same even if the job graph is changed.

If we can re-use the operatorID as the accessorID, maybe we can remove the method SharedStorageStreamOperator#getSharedStorageAccessorID.

Actually, I've tried StreamOperator#getOperatorID before.

However, the Operator ID cannot be obtained before execution. Then, we are unable to specify the owner map of share data items when building graph. Without the owner map, it is difficult to control access in runtime.

It seems that there is a way to remove sharedStorageAccessorID and still pass all tests. We can discuss the code change offline.

...ml-core/src/main/java/org/apache/flink/ml/common/sharedstorage/SharedStorageContextImpl.java

flink-ml-core/src/main/java/org/apache/flink/ml/common/sharedstorage/SharedStorage.java

lindong28 · 2023-03-06T14:58:20Z

flink-ml-core/src/main/java/org/apache/flink/ml/common/sharedstorage/SharedStorage.java

+    }
+
+    static class Reader<T> {
+        protected final Tuple3<StorageID, Integer, String> t;


Can we use a name more readable than t (e.g. itemId)?

lindong28 · 2023-03-06T14:59:51Z

flink-ml-core/src/main/java/org/apache/flink/ml/common/sharedstorage/SharedStorage.java

+            Preconditions.checkState(owners.get(t).equals(ownerId));
+        }
+
+        void set(T value) {


Would it be a bit more consistent with the existing ListStateWithCache#update to name this method update(...)?

...ml-core/src/main/java/org/apache/flink/ml/common/sharedstorage/SharedStorageContextImpl.java

flink-ml-core/src/main/java/org/apache/flink/ml/common/sharedstorage/ItemDescriptor.java

lindong28 · 2023-03-10T00:53:49Z

.../src/main/java/org/apache/flink/ml/common/gbt/operators/CacheDataCalcLocalHistsOperator.java

+                OperatorStateUtils.getUniqueElement(histBuilderState, HIST_BUILDER_STATE_NAME)
+                        .orElse(null);
+
+        sharedStorageContext.initializeState(this, getRuntimeContext(), context);


Instead of passing the ownerMap to SharedStorageContextImpl and use ownerMap to determine the ItemDescriptor owned by this operator, would it be more straightforward to have this operator pass to this method the list of ItemDescriptor owned by it directly?

Then we should be able to simplify the code by removing e.g. SharedStorageStreamOperator#getSharedStorageAccessorID and SharedStorageContextImpl#setOwnerMap.

lindong28 · 2023-03-10T01:16:44Z

...ml-core/src/main/java/org/apache/flink/ml/common/sharedstorage/SharedStorageContextImpl.java

+/** Default implementation of {@link SharedStorageContext} using {@link SharedStorage}. */
+@SuppressWarnings("rawtypes")
+class SharedStorageContextImpl implements SharedStorageContext, Serializable {
+    private final StorageID storageID;


Instead of keeping the storageID here, an alternative approach is generate the storageId once during graph building phase and passes it to all operators. Then each operator can pass the storageID to initializeState.

The storageId can be pass to operators via either the constructor or the setGlobalJobParameters/getGlobalJobParameters of ExecutionConfig.

If we can do this, we might be able to simplify the code and remove e.g. SharedStorageWrapper.

zhipeng93

Thanks for the PR. I left some comments about the SharedObjects infra here.

flink-ml-lib/src/main/java/org/apache/flink/ml/common/gbt/GBTRunner.java

zhipeng93 · 2023-05-25T06:48:23Z

...core/src/main/java/org/apache/flink/ml/common/sharedobjects/SharedObjectsStreamOperator.java

+package org.apache.flink.ml.common.sharedobjects;
+
+/** Interface for all operators that need to access the shared objects. */
+public interface SharedObjectsStreamOperator {


As I understand, this PR tries to provide an infrastructure for sharing objects among multiple Flink operators, through java static variables.

To achieve this, it empoys one specific Flink operator for each sharing object as the writer and others operators as the reader. Based on this, the GBDT implementation relies on the Flink events to guarantee the read/write order of each object.

However, can you explain some other machine learning algorithms that would use SharedObjects in the future? And is there a general way that developers can guarantee the order of read/writes is correct? If a reader of an object changes the value of that object, does it still follows the assumption of SharedObjects?

There is another possible solution [1] that we put all the computation logic into one operator (i.e., WorkerOperator) and all the computation logic into another operator (i.e., ServerOperator). In this case, we would not need shared objects anymore. Let's have a thorough comparison between these two options.

[1] #237

Thanks for your valuable comments. You mentioned several issues in your comments, and I will answer them one by one.

Q: "can you explain some other machine learning algorithms that would use SharedObjects in the future?"

One common pattern where SharedObjects can be used is same datasets are needed in operators before and after a reduce operator. Here I list some algorithms (correct me if I made a mistake):
‒ All distributed algorithms based on decision trees: both model and training data are required during the nodes splitting after reducing the intermediate data.
‒ Second-order gradient optimizer, e.g. Newton, L-BFGS: require the use of old gradient data after the reduce operation.
‒ Algorithms that have two rounds of (unmergeable) reduce operation in each iteration: GBDT, ALS, GMM, and LDA.
‒ Evaluation metrics are calculated every few rounds of iterations: evaluation metrics has to be calculated with reduce after model is updated.

Q: "is there a general way that developers can guarantee the order of read/writes is correct?"

A simple approach is to connect the reader and writer with an addition dummy stream. Readers read the data only after receiving elements from the stream.
This approach assumes the reads and writers are not interleaved which should be true in most algorithms. If not (please give some examples), multi-threading techniques like atomics can be used.

Q: "If a reader of an object changes the value of that object, does it still follows the assumption of SharedObjects?"

I think this is a common issue in Java, and one solution is to give readers a deep clone of the object, which can be expensive. Balancing efficiency and safety, I chose efficiency.

#237 proposes a brilliant solution about abstracting iterative computations inspired by Parameter Server. I believe both solutions can work well in many algorithms/scenarios.

Before comparing two solutions, there are two facts about PS infra I must emphasize:

One fact is the PS infra is built on the DataStream APIs, which means there will be no performance improvement compared to implement with raw DataStream APIs. So we mainly discuss its usability with aspect to developers.

The other fact is the current status of functionalities shown in #237 cannot fully meet the requirements of GBDT implementation. MessageType, model format, reduce logic of messages, etc. are all fixed/hard-coded with respect to gradient-based algorithms. The usability will drop significantly if forcing GBDT implementation to use current APIs.

Therefore, to make a reasonable comparison between two solutions, I assume an extended version of current PS infra which supports POJO message types and POJO model data, user-defined reduce function, etc. Here are my thoughts under this assumption:

Framework Intrusiveness

Using PS infra means developers cannot use DataStream APIs in iterations anymore. Then, there are cases where PS infra cannot implement:

side outputs: evaluation result streams; prediction and model streams in online cases.

partition/join/coGroup of training data sets: AUC calculation after model update, ALS, SimRank.

As for SharedObjects, it is an augment to DataStream APIs. There is no extra limitation to developers.

The intrusiveness also influences the observation of operators when job running as, in PS infra, multiple computations are merged in to one operator, like in/out stats, checkpoint status. This decreases usability to both developers and end-users.

Applicable scenarios

Besides inapplicable cases mentioned above, PS infra cannot work in non-iteration cases. But SharedObjects can work. One possible case is to improve consecutive joins with a same dataset by reducing a copy of dataset.

Learning curve

PS infra provides a whole set of concepts and interfaces such as Message, ModelUpdater, ProcessStage, TrainingUtils, etc., which are not related to the existing DataStream API and have a steeper learning curve.

SharedObjects provides two interfaces, SharedObjectsUtils and SharedObjectsContext, and can be developed directly based on the existing DataStream API code, making it easier for developers to accept.

Overall speaking, I think both solutions can coexist because they are on different levels of APIs and have no conflicts. How about you? @zhipeng93

…ries.

…tting

Fanoid force-pushed the FLINK-31010 branch from 5c4a3ec to b07b579 Compare February 13, 2023 12:00

lindong28 marked this pull request as ready for review February 21, 2023 13:07

lindong28 reviewed Feb 22, 2023

View reviewed changes

Fanoid force-pushed the FLINK-31010 branch from 3720845 to e7b4e17 Compare March 1, 2023 06:52

lindong28 reviewed Mar 6, 2023

View reviewed changes

lindong28 reviewed Mar 10, 2023

View reviewed changes

Fanoid force-pushed the FLINK-31010 branch from a92eae9 to 3a1f57a Compare May 12, 2023 03:47

zhipeng93 reviewed May 25, 2023

View reviewed changes

Fanoid mentioned this pull request May 26, 2023

[Flink-27826] Support training very high dimensional logistic regression #237

Closed

Fanoid added 20 commits August 9, 2023 10:30

Add preprocess for GBT algorithms

cde4b9c

Add training and prediction

7d561c3

Support checkpoint for operator states

ee76f33

Add GBTClassifier

69e0774

Add GBTRegressor

2c43fa6

Fixing some missing Javadoc comment.

2fe1f72

[NO MERGE] Ad-hoc fix of KBinsDiscretizer

dbe0009

Fix tests according to ad-hoc fix of KBinsDiscretizer.

638ebee

Change LocalState datastream to JVM static memory

34797f0

Change features storage in BinnedInstance

4a76eda

[NO MERGE] Ignore GBT operators to pass Python completeness tests.

4b97064

Improve feature splitter.

c7156a4

Improve hist builder when no feature subsampling.

5525b85

Add optimized serializer for double arrays.

8932392

Fixed checkpoint problem.

74fb657

Rewrite shared storage.

99061bc

Replace inputCols and featuresCol with featuresCols.

ad8d4a7

Fix cases when reading a shared item earlier than its initialization.

b0309e4

Improve GBT params.

9d3f10f

Remove unused HasLossType.

1f592eb

Fanoid added 25 commits August 9, 2023 10:30

Improve loss func.

bd79176

Refactor params, BoostingStrategy, and Distributor.

fdaa732

Remove GBTRunnerTest.

f236703

Only call ListStateWithCache#update just before snapshot.

b21a23b

Refine some TODOs.

768820a

Improve javadoc for GBTClassifier and GBTRegressor.

132cca8

Improve categorical feature splitter by ignoring less frequent catego…

546591e

…ries.

Change PredGradHess to double[].

5562e80

Improve Histogram to remove scattering.

01c4211

Add eclipse collection jars to uber jar

7f73aa1

Support output feature importance

075262d

Update setModelData to support featureImportanceTable.

28cdd28

Remove duplicated model data

bc6465d

Fix save/load for feature importance.

d4f1dd5

Fix get label when type is not double.

d8c71fc

Reduce computation for nodes with max depth.

adf3115

[NO MERGE] Ad-hoc fix for NaN values in KBinsDiscretizer

a105e35

Fix out-of-bound exception for nodeFeaturePairs

1006812

Change to streaming processing from histogram building to global spli…

ad646c4

…tting

Simplify APIs in SharedStorageContext

513ed96

Fix after merging master

e709a35

Fix SharedStorageUtilsTest

c109c76

Rename shared storage to shared objects and change according to comments

5313d51

Update codes according to comments.

e098994

Refactor share objects infra to resolve challenges.

233b885

Fanoid force-pushed the FLINK-31010 branch from 39b28a9 to 233b885 Compare August 9, 2023 02:49

Fanoid added 2 commits August 9, 2023 10:54

Remove unused files.

4db52f3

Fix imports.

0109bb8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-31010] Add Transformer and Estimator for GBTClassifier and GBTRegressor #210

[FLINK-31010] Add Transformer and Estimator for GBTClassifier and GBTRegressor #210

Fanoid commented Feb 10, 2023 •

edited

Loading

lindong28 left a comment

Fanoid commented Mar 1, 2023

lindong28 left a comment

lindong28 Mar 6, 2023

Fanoid Mar 8, 2023 •

edited

Loading

lindong28 Mar 10, 2023

lindong28 Mar 6, 2023 •

edited

Loading

lindong28 Mar 6, 2023

lindong28 Mar 10, 2023

lindong28 Mar 10, 2023

zhipeng93 left a comment

zhipeng93 May 25, 2023

Fanoid May 26, 2023 •

edited

Loading

Fanoid May 26, 2023 •

edited

Loading

[FLINK-31010] Add Transformer and Estimator for GBTClassifier and GBTRegressor #210

Are you sure you want to change the base?

[FLINK-31010] Add Transformer and Estimator for GBTClassifier and GBTRegressor #210

Conversation

Fanoid commented Feb 10, 2023 • edited Loading

What is the purpose of the change

Brief change log

Does this pull request potentially affect one of the following parts:

Documentation

lindong28 left a comment

Choose a reason for hiding this comment

Fanoid commented Mar 1, 2023

lindong28 left a comment

Choose a reason for hiding this comment

lindong28 Mar 6, 2023

Choose a reason for hiding this comment

Fanoid Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

lindong28 Mar 10, 2023

Choose a reason for hiding this comment

lindong28 Mar 6, 2023 • edited Loading

Choose a reason for hiding this comment

lindong28 Mar 6, 2023

Choose a reason for hiding this comment

lindong28 Mar 10, 2023

Choose a reason for hiding this comment

lindong28 Mar 10, 2023

Choose a reason for hiding this comment

zhipeng93 left a comment

Choose a reason for hiding this comment

zhipeng93 May 25, 2023

Choose a reason for hiding this comment

Fanoid May 26, 2023 • edited Loading

Choose a reason for hiding this comment

Fanoid May 26, 2023 • edited Loading

Choose a reason for hiding this comment

Fanoid commented Feb 10, 2023 •

edited

Loading

Fanoid Mar 8, 2023 •

edited

Loading

lindong28 Mar 6, 2023 •

edited

Loading

Fanoid May 26, 2023 •

edited

Loading

Fanoid May 26, 2023 •

edited

Loading