-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xgboost-comprehensive with bagging aggregation #2554
Conversation
Hi @yan-gao-GY, here is my general review regarding the client and dataset. Here is my recommendation: # main
from dataset import instantiate_partitioner, train_test_split
partitioner = instantiate_partitioner(partitioner_type=patitioner_type, num_partitions=num_partitions)
# alternatively not `partitioner_type` but `node_id_to_samples_correlation` or just `correlation`
fds = FederatedDataset(dataset="jxie/higgs", partitioners={"train": partitioner})
partition = fds.load_partition(idx=partition_id, split="train")
partition.set_format("numpy")
# split_rate is not informative keyword to me, I'd stick to e.g. test_size or test_fraction
# I'd also drop the size returns but I think it's more personal choice
train_data, valid_data = train_test_split(partition, test_size=test_size, seed=SEED)
# I'd rename the _reformat_data, but it'd serve the same purpose
train_dmatrix = transform_dataset_to_dmatrix(train_data)
valid_dmatrix = transform_dataset_to_dmatrix(valid_data) Also, I'd rename the |
Also, I'd add the train and valid data as parameters to FlowerClient and then reference via self. |
@adam-narozniak thanks a lot for your suggestion! i think it makes sense. i'll make changes later. |
Also, one more thing. Let's make all the comments start with a capitalized letter. (I know that we don't necessarily even do full type hints in the examples, but let's make it consistent in the project) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inconsistent with the pyproject.toml
partition = fds.load_partition(idx=partition_id, split="train") | ||
partition.set_format("numpy") | ||
|
||
if args.centralised_eval: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one more question. In the case of centralized eval each of the (federated) nodes also uses centralized dataset for the federated evaluation. Is that intended, or is it controlled in the server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing centralised eval or not is controlled by server with --centralised_eval
. If not centralised eval, the user can still choose to use centralised test set or client test set (splitting from client's training data) to do the client evaluation. e.g., doing client.py --centralised_eval
will enable the client evaluation on centralised test set.
Co-authored-by: Daniel J. Beutel <[email protected]>
Co-authored-by: Daniel J. Beutel <[email protected]>
Co-authored-by: Daniel J. Beutel <[email protected]>
Co-authored-by: Daniel J. Beutel <[email protected]>
Co-authored-by: Daniel J. Beutel <[email protected]>
Co-authored-by: Daniel J. Beutel <[email protected]>
Issue
There is no easy-to-use XGBoost example with Flower.
Description
EXtreme Gradient Boosting (XGBoost) is a robust and comprehensible gradient-boosted decision tree (GBDT). Given the robustness and efficiency of XGBoost, combining it with federated learning offers a promising solution for model training with data privacy protection.
Proposal
This example demonstrates how to perform XGBoost within Flower using
xgboost
package on HIGGS dataset. Tree-based with bagging method is used for aggregation on the server.Warning
Note that this example uses
SizePartitioner
for FL data partitioning, so this PR should be merged after fds-size-partitioner.