Skip to content

Latest commit

 

History

History
 
 

MulticlassClassification_CLI

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Auto-generate model training and C# code for a Multi-class Classification task (GitHub Issues classification scenario)

In this example you are going to automatically train/create a model and related C# code by simply providing a dataset (The GitHub .NET Framework issues dataset in this case) to the ML.NET CLI tool.

Note: This CLI example is related to the GitHub issues classification ML.NET sample but in this case the C# code is auto-generated by the CLI tool. You don't need to start coding in C# from scratch.

What is the ML.NET CLI (Command-line Interface)

The ML.NET CLI (command-line interface) is a tool you run on any command-prompt (Windows, Mac or Linux) for generating good quality ML.NET models and C# source code based on training datasets you provide.

The ML.NET CLI is part of ML.NET and its main purpose is to "democratize" ML.NET for .NET developers when learning ML.NET so it is very simple to generate a good quality ML.NET model (serialized model .zip file) plus the sample C# code to run/score that model. In addition, the C# code to create/train that model is also generated for you so you can research what algorithm and settings it is using for that generated "best model".

Run the CLI command to generate the ML model and C# code for the GitHub .NET Framework issues dataset

From command-prompt (either PowerShell, Bash or CMD) move to the 'Multiclass Classification CLI sample' folder:

> cd <YOUR_PATH>samples/CLI/MulticlassClassification_CLI

Now run the following ML.NET CLI command:

> mlnet auto-train --task multiclass-classification --dataset corefx-issues-train.tsv --label-column-name Area --max-exploration-time 300

You will get a similar command execution like the following:

CLI running

This process is performing multiple training explorations trying multiple trainers/algorithms and multiple hyper-parameters with different combinations of configuration per each model.

IMPORTANT: Note that in this case you are exploring multiple trainings with the CLI looking for "best models" only for 5 minutes. That's enough when you are just learning the CLI usage and the generated C# code for the model. But when trying to optimize the model to achieve high quality you might need to run the CLI 'auto-train' command for many more minutes or even hours depending on the size of the dataset.

As a rule of thumb, a high quality model might need hundreds of iterations (hundreds of models explored automatically performed by the CLI).

When the command finishes the training explorations, you get a summary like the following:

CLI running

For undestanding the 'quality metrics' read this doc: Model evaluation metrics in ML.NET.

That command generates the following assets in a new folder (if no --name parameter was specified, its name is 'SampleMulticlassClassification'):

  • A serialized "best model" (MLModel.zip) ready to use.
  • Sample C# code to run/score that generated model (To make predictions in your end-user apps with that model).
  • Sample C# code with the training code used to generate that model (For learning purposes or direct training with the API).

The first two assets (.ZIP file model and C# code to run that model) can directly be used in your end-user apps (ASP.NET Core web app, services, desktop app, etc.) to make predictions with that generated ML model.

The third asset, the training code, shows you what ML.NET API code was used by the CLI to train the generated model, so you can investigate what specific trainer/algorithm and hyper-paramenters were selected by the CLI.

Go ahead and explore that generated C# projects code and compare it with the GitHub issues classification ML.NET sample in this repo. The accuracy and performance coming from the model generated by the CLI should be better than the sample in the repo which has simpler ML.NET code with no additional hyper-parameters, etc.

For instance, the configuration for one of the trainers used in the GitHub issues classification ML.NET sample (SdcaMaximumEntropy) is simplified for making easier to learn ML.NET (but might not be the most optimal model), so it is like the following code, with no hyper-parameters:

var trainer = mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy("Label", "Features");

On the other hand, in 1 hour exploration time with the CLI, the selected algorithm/trainer chosen (LightGbm) was the following code which includes quite a few hyper-parameters, all that code generated for you!:

var trainer = mlContext.MulticlassClassification.Trainers.LightGbm(new LightGbmMulticlassTrainer.Options()
                                                                        { NumberOfIterations = 150,
                                                                        LearningRate = 0.1254156f,
                                                                        NumberOfLeaves = 9,
                                                                        MinimumExampleCountPerLeaf = 20,
                                                                        UseCategoricalSplit = false,
                                                                        HandleMissingValue = false,
                                                                        MinimumExampleCountPerGroup = 100,
                                                                        MaximumCategoricalSplitPointCount = 64,
                                                                        CategoricalSmoothing = 20,
                                                                        L2CategoricalRegularization = 0.1,
                                                                        UseSoftmax = true,
                                                                        Booster = new GradientBooster.Options()
                                                                                            { L2Regularization = 0.5,
                                                                                              L1Regularization = 1 },
                                                                        LabelColumnName = "Area",
                                                                        FeatureColumnName = "Features" })

If you run the CLI for longer time exploring additional algorithms/trainers, the algorithm configuration would probably change and improve.

Finding those hyper-parameters by yourself could be a very long and tedious trial process. With the CLI and AutoML this is very much simplified for you.

Next steps: Use your own dataset for creating models for your own scenarios

You can generate those assets explained above from your own datasets without coding by yourself, so it also improves your productivity even if you already know ML.NET. Try your own dataset with the CLI!

See also

Step-by-step CLI tutorial, getting started from scratch (Note that the tutorial focuses on a binary-classification ML task, not a regression task, but the CLI commands are pretty similar):

Tutorial: Auto generate a binary classifier using the CLI