Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] predict() breaks when using a Dataset stored in a file #4034

Closed
j-kreis opened this issue Mar 1, 2021 · 9 comments · Fixed by #4545
Closed

[R-package] predict() breaks when using a Dataset stored in a file #4034

j-kreis opened this issue Mar 1, 2021 · 9 comments · Fixed by #4545

Comments

@j-kreis
Copy link

j-kreis commented Mar 1, 2021

Description

On Windows R crashes using Dataset.lgb.save, without error message.
On Linux I am able to save the dataset, but lgb.predict can not find saved dataset

Reproducible example

For the Windows bug (the example given by lightgbm::lgb.Dataset.save)

data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
train_file =  tempfile(fileext = ".bin")
lgb.Dataset.save(dtrain, train_file)

For the Linux bug (Example given by lightgbm::lgb.load + predict using a file as input)

data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
data(agaricus.test, package = "lightgbm")
test <- agaricus.test
dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label)

test_file =  file.path(getwd(), "test.bin")
lgb.Dataset.save(dtest, test_file)

params <- list(objective = "regression", metric = "l2")
valids <- list(test = dtest)
model <- lgb.train(params = params, data = dtrain, nrounds = 5L, 
                   valids = valids, learning_rate = 1.0, 
                   early_stopping_rounds = 3L)
model_file <- tempfile(fileext = ".txt")
lgb.save(model, model_file)
load_booster <- lgb.load(filename = model_file)
model_string <- model$save_model_to_string(NULL) # saves best iteration
load_booster_from_str <- lgb.load(model_str = model_string)

model$predict(test_file)

The error:

Error in lgb.call(fun_name = "LGBM_BoosterPredictForFile_R", ret = NULL,  : 
  [LightGBM] [Fatal] Data file ��?��V doesn't exist.

Environment info

LightGBM version or commit hash:

lightgbm_3.1.1

Command(s) you used to install LightGBM

install.packages('lightgbm')
@jameslamb
Copy link
Collaborator

Thanks very much for using {lightgbm} and for the excellent bug report! It's possible that the Windows part of this is related to another not-yet-solved issue (#4007), but I'm not sure yet.

For Linux example, could you try changing uses of tempfile() to permanent files like file.path(getwd(), "model.txt") and let me know if that fixes it? Just to check that the problem you're facing is not specific to the use of tempfiles. It would also help if you could provide specific logs / error messages that you've summarized as "lgb.predict can not find saved dataset".

It will be another day or two before I'm able to look at this in depth, apologies.

@j-kreis
Copy link
Author

j-kreis commented Mar 4, 2021

Thanks for the quick response!! The description above now uses a permanent file and shows the error message, which is still there after updating the example.

@jameslamb
Copy link
Collaborator

jameslamb commented Mar 11, 2021

I started looking into this tonight. I think the two issues might be unrelated but not sure yet, so it's ok to leave them here as one thing for now.

I was able to reproduce the "Data file doesn't exist" bug on my Mac, with slightly simpler sample code.

library(lightgbm)

# set up training data
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)

# set up scoring data
data(agaricus.test, package = "lightgbm")
test <- agaricus.test
dtest <- lgb.Dataset.create.valid(
    dataset = dtrain
    , data = test$data
    , label = test$label
)

test_file <- file.path(getwd(), "test.bin")
if (file.exists(test_file)) {
    file.remove(test_file)
}
lgb.Dataset.save(
    dataset = dtest
    , fname = test_file
)

model <- lgb.train(
    params = list(
        objective = "regression"
        , metric = "l2"
    )
    , data = dtrain
    , nrounds = 5L
    , learning_rate = 1.0
)

model$predict(test_file)

I saw this behavior on {lightgbm} 3.1.1 and on the latest commit of master (8d0669f)

@jameslamb
Copy link
Collaborator

@ticarki sorry for the delay in getting back to you.

For the Windows half of this issue, I'm confident now that it's the same as #4045. I just submitted a fix for that issue (#4155). I tried your Windows example above on the branch for #4155 and no longer see a crash. Could you please try it out? You can follow the steps at #4045 (comment) to install from that feature branch.

I haven't tested yet if the problem you saw on Linux is related. I suspect that it isn't. So for now, I'm going to change the name of this issue to just describe that problem. Let me know if you disagree with how I've rephrased the title.

@jameslamb jameslamb changed the title [R-package] Issues with saving and reading Datasets [R-package] predict() breaks when using a Dataset stored in a file Apr 2, 2021
@jameslamb
Copy link
Collaborator

I haven't looked at this again, yet. Some of the recent changes made as part of #3016 MIGHT end up fixing this.

If no one else does it sooner, I'll come back and try to reproduce this after #3016 is complete.

@jameslamb
Copy link
Collaborator

Ok, I came back to look at this tonight. I think that now, thanks to #4252, the reproducible examples above will produce a more informative error message.

Error in predictor$predict(data = data, start_iteration = start_iteration, :
[LightGBM] [Fatal] Unknown format of training data.

I realize now that the examples are trying to predict on a saved LightGBM Dataset. I don't think that is supported.

As @shiyu1994 said in #4210 (comment)

Once the model is trained, currently we don't have any support to use the trained model to evaluate a constructed Dataset.

I believe that LGBM_BoosterPredictForFile (the underlying method from LightGBM's C++ library) only currently supports TSV, CSV, and LibSVM formats:

auto parser = std::unique_ptr<Parser>(Parser::CreateParser(data_filename, header, boosting_->MaxFeatureIdx() + 1, label_idx,

LightGBM/src/io/parser.cpp

Lines 232 to 239 in f831808

Parser* Parser::CreateParser(const char* filename, bool header, int num_features, int label_idx, bool precise_float_parser) {
const int n_read_line = 32;
auto lines = ReadKLineFromFile(filename, header, n_read_line);
int num_col = 0;
DataType type = GetDataType(filename, header, lines, &num_col);
if (type == DataType::INVALID) {
Log::Fatal("Unknown format of training data.");
}

DataType GetDataType(const char* filename, bool header,

@shiyu1994 am I right about that? If I am, I can update the documentation to clarify the supported file types.


@ticarki if you want to get predictions from a trained model and want to do that on data stored in a file, you'll have to use raw data in one of those formats for now.

Adding this to the end of the code from #4034 (comment) worked for me.

test_csv <- file.path(getwd(), "test.csv")
write.table(
    x = as.matrix(test$data)
    , file = test_csv
    , row.names = FALSE
    , col.names = FALSE
    , sep = ","
)
preds_from_file <- model$predict(test_csv, header = FALSE)
preds_in_mem <- model$predict(out_data)
identical(preds_from_file, preds_in_mem)

@shiyu1994
Copy link
Collaborator

@jameslamb Yes. Currently a Dataset loaded from binary file (or a binary file itself) cannot be used as input to the prediction methods. However

As @shiyu1994 said in #4210 (comment)

Once the model is trained, currently we don't have any support to use the trained model to evaluate a constructed Dataset.

This claim is wrong as pointed out by @StrikerRUS in #4210 (comment). We can use the eval method to evaluate a constructed Dataset with a trained Booster.

@jameslamb
Copy link
Collaborator

In #4545, I've proposed some documentation changes and an error message change to try to make it a bit clearer that only text files are supported in predict().

For anyone finding this issue, you can try the following sample code with the R package to evaluate a constructed Dataset stored in a file.

library(lightgbm)

# set up training data
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)

# set up scoring data
data(agaricus.test, package = "lightgbm")
test <- agaricus.test
dtest <- lgb.Dataset.create.valid(
    dataset = dtrain
    , data = test$data
    , label = test$label
)
dtest$construct()

test_file <- file.path(getwd(), "test.bin")
if (file.exists(test_file)) {
    file.remove(test_file)
}
lgb.Dataset.save(
    dataset = dtest
    , fname = test_file
)

model <- lgb.train(
    params = list(
        objective = "regression"
        , metric = "l2"
        , learning_rate = 1.0
    )
    , data = dtrain
    , nrounds = 5L
)

# evaluate constructed dataset
model$eval(
    data = lgb.Dataset(
        data = test_file
    )$construct()
    , name = "test_set"
)

jameslamb added a commit that referenced this issue Aug 25, 2021
…ed Datasets (fixes #4034) (#4545)

* documentation changes

* add list of supported formats to error message

* add unit tests

* Apply suggestions from code review

Co-authored-by: Nikita Titov <[email protected]>

* update per review comments

* make references consistent

Co-authored-by: Nikita Titov <[email protected]>
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants