GitHub

trafo (version 0.1.5, CHANGELOG) is a tiny random forest library written in C11. Most likely this isn't what you are looking for, but feel free to copy/fork/use or have fun finding bugs.

Features and Limitations

Tiny: The compiled library, libtrafo.so is less than 50 kB.
Parallel processing: Tree construction as well as predictions can run in parallel using OpenMP.
Sort once: Features are only sorted once. Book keeping maintains this property throughout the tree constructions.
Gini impurity or by Entropy can be used as the splitting criterion for nodes.
Feature importance: approximations included at almost no extra cost. The more proper (?) feature permutation algorithm is simple to add on top of the library if needed.
Command line interface: the binary trafo_cli can be used to test the library on tsv/csv-formated data although the parser is very basic.
Supports integer labels and floating point features.
Does not impute missing features.
Very little functionality besides the basics. See the trafo_cli.c for one way to add k-fold cross validation on top of the library.
Paramerers include: - The number of trees. - Fraction of samples per tree. - Number of features per tree.
Internally, features are floats with double precision and labels are uint32. In 99% of all application it would probably be better with the combination of single precision and uint16. That is on the todo list.
Only smoke tested ...

Basic Library Usage

Load a classifier and apply it to some data

#include <trafo.h>

trf * T = trafo_load("classifier.trafo");
uint32_t * class = trafo_predict(T, features, NULL, n_features);
trafo_free(T);

Train a classifier on labelled data and save it to disk:

#include <trafo.h>

// Basic configuration via a struct.
trafo_settings C = {0};
// Required: Describe the data
C.n_sample = n_sample;
C.n_feature = n_feature;
C.F_row_major = F;       // Input: Features
C.label = L;             // Input: Labels
// Optional: Algorithm settings
C.n_tree = n_tree;
// .. and possibly more.

// Fitting / Training
trf * T = trafo_fit(&C);

// Use it ...

// Save to disk
trafo_save(T, "classifier.trafo");

trafo_free(T); // And done

see trafo.h for the full API. For more examples, look in trafo_cli.c. For a minimal program to check that the library was installed properly can be found in test/minimal_example.c.

Performance hints

The random forest implementation in scikit-learn is denoted skl in the tables. Reported timings and memory usage are averages of 25 runs. System: 8-Core Amd 3700X running Ubuntu 24. Compiler: GCC 13.2.0.

Python timings and memory measurements

# Benchmarking fitting/training
mem0 = get_peak_memory()
t1 = time.perf_counter()
clf = clf.fit(X, Y)
t2 = time.perf_counter()
mem1 = get_peak_memory() # custom funciton parsing VmHWM from /proc/self/status
# Benchmarking classification
t3 = time.perf_counter()
P = clf.predict(X)
t4 = time.perf_counter()
# and the differences were used
t_train = t2-t1
t_predict = t4-t3
mem = mem1-mem0

See test/benchmark/ for the full code.

Datasets:

Name	Samples	Features	Classes
iris	150	5	3
digits	1797	64	10
wine	178	13	3
breast_cancer	569	30	2
diabetes	442	10	347*
rand5_100	100000	100	2
rand6_10	1000000	10	2

(*) in case of the diabetes dataset some of the classes have no examples.

A single tree

The scikit-learn package was configured by:

clf = RandomForestClassifier(n_estimators=1)
clf.n_jobs=-1
clf.bootstrap = False
clf.max_features=X.shape[1]
clf.min_samples_split=2

detailed scikit-learn settings

{
    'bootstrap': False,
    'ccp_alpha': 0.0,
    'class_weight': None,
    'criterion': 'gini',
    'max_depth': None,
    'max_features': 10,
    'max_leaf_nodes': None,
    'max_samples': None,
    'min_impurity_decrease': 0.0,
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'min_weight_fraction_leaf': 0.0,
    'monotonic_cst': None,
    'n_estimators': 1,
    'n_jobs': -1,
    'oob_score': False,
    'random_state': None,
    'verbose': 0,
    'warm_start': False
    }

Results:

dataset	method	t_train_avg	t_train_std	t_predict_avg	t_predict_std	mem_fit_kb
breast_cancer	skl	0.0143084	0.000207006	0.00047848	5.32249e-05	1464
breast_cancer	trafo	0.0027546	0.00218933	0.00087928	0.00132134	532
diabetes	skl	0.0142767	0.000250696	0.00163288	5.35689e-05	3860
diabetes	trafo	0.0160097	0.00231795	0.000682	0.00124117	291
digits	skl	0.0246463	0.00018163	0.0008926	6.99823e-05	2088
digits	trafo	0.012642	0.0014166	0.00043184	1.41921e-05	2069
iris	skl	0.0142313	0.000250566	0.0003468	4.67076e-05	1520
iris	trafo	0.00218152	0.00394059	0.0006894	0.00123656	46
wine	skl	0.0142528	0.000175615	0.00035784	6.46583e-05	1448
wine	trafo	0.00149696	0.00264111	0.0007174	0.00122883	189

Please note that trafo only use a single thread when training a single tree. The prediction time on the diabetes dataset stand out, as the only case when trafo does not measure better than skl.

A forest with 100 trees

For this test, skl was run by:

clf = RandomForestClassifier(n_estimators=100)
clf.n_jobs=-1
clf.min_samples_split=2

dataset	method	t_train_avg	t_train_std	t_predict_avg	t_predict_std	mem_fit_kb
breast_cancer	skl	0.147509	0.0176639	0.0160279	0.00345635	3338
breast_cancer	trafo	0.00583736	0.00237732	0.00140824	0.00170441	916
diabetes	skl	0.119047	0.00666571	0.0244884	0.000727805	109763
diabetes	trafo	0.0453076	0.00452934	0.00221952	0.00145607	727
digits	skl	0.146164	0.0199758	0.02487	0.00208467	11023
digits	trafo	0.018972	0.00157042	0.00170804	0.00140434	4533
iris	skl	0.135705	0.00816796	0.0145676	0.0003384	2698
iris	trafo	0.00215564	0.00246629	0.00067504	0.00117672	332
wine	skl	0.133156	0.00978021	0.0164046	0.00383235	2734
wine	trafo	0.00505332	0.00480287	0.00145852	0.00237973	307

In summary: scikit-learn takes up to 50 times longer to train. Up to 28 times longer to predict and use up to 144 times as much memory as trafo.

Most likely the wrapping layer between python and the c code adds a considerable time for these small datasets, for skl. Here are some tests also for the larger datasets:.

dataset	method	t_train_avg	t_train_std	t_predict_avg	t_predict_std	mem_fit_kb
rand5_100	skl	6.8721	0.125042	0.218464	0.011936	347415
rand5_100	trafo	1.92905	0.0493303	0.155707	0.0169644	272928
rand6_10	skl	39.3289	1.0509	6.75931	0.211933	4206167
rand6_10	trafo	13.8396	0.492196	4.15566	0.220318	1615217

On these bigger datasets, skl use about 3X the time for training, 1.5X the time for predictions and somewhere around 2X the memory.

Installation

Use cmake with the CMakeLists.txt file, something like this should do:

mkdir build
cd build
cmake ..
make
sudo make install

Then just add -ltrafo to the linker flags of your project.

Example output

The command line program trafo_cli has the iris dataset built in and will perform some tests when called without any arguments.

For command line arguments, use with --help

Example output -- command line arguments

$ trafo_cli --help
Usage:
--train file.tsv
	table to train on
--cout file.trf
	Write the classifier to disk
--ntree n
	number of trees in the forest
--predict file.tsv
	table of point to classify
--model file.trf
	classifer to use
--classcol name
	specify the name of the column that contain the class/label
--tree_samples n
	Fraction of samples per tree (to override default)
--tree_features n
	Number of features per tree
--min_leaf_size
	How small a node be before it is automatically turned into a leaf
--verbose n
	Set verbosity level
--entropy
	Split on entropy instead of Gini impurity
--xfold n
	Perform n-fold cross validataion

Example: 10-fold cross validation
$ trafo --xfold 10 --train file.csv

Example output -- training on wine.tsv

$ trafo_cli --version
trafo_cli version 0.1.3

$ trafo_cli --train wine.tsv  --ntree 1 --entropy
Reading from wine.tsv
columns: 14, data rows: 178
Found feature column "class"
Features provided in row major format (to be transposed)
Label array provided
Number of features: 13
Number of samples: 178
Number of trees: 1
Fraction of samples per tree: 1.00
Features per tree: 13
min_samples_leaf: 1
Largest label id: 2
Splitting criterion: Entropy
trafo: Forest training took 0.0020 s
Feature importance*:
#  0 : 2.1 %
#  1 : 0.0 %
#  2 : 2.1 %
#  3 : 0.0 %
#  4 : 0.0 %
#  5 : 0.0 %
#  6 : 42.1 %
#  7 : 0.0 %
#  8 : 0.0 %
#  9 : 21.5 %
# 10 : 0.0 %
# 11 : 0.0 %
# 12 : 32.2 %
VmPeak: 770996 (kb) VmHWM: 1332 (kb)

$ trafo_cli --train wine.tsv  --ntree 200 --entropy
Reading from wine.tsv
columns: 14, data rows: 178
Found feature column "class"
Features provided in row major format (to be transposed)
Label array provided
Number of features: 13
Number of samples: 178
Number of trees: 200
Fraction of samples per tree: 0.63
Features per tree: 4
min_samples_leaf: 1
Largest label id: 2
Splitting criterion: Entropy
trafo: Forest training took 0.0071 s
Feature importance*:
#  0 : 9.8 %
#  1 : 4.8 %
#  2 : 3.0 %
#  3 : 3.5 %
#  4 : 4.3 %
#  5 : 6.0 %
#  6 : 17.0 %
#  7 : 1.9 %
#  8 : 3.1 %
#  9 : 10.2 %
# 10 : 9.2 %
# 11 : 13.5 %
# 12 : 13.7 %
VmPeak: 978908 (kb) VmHWM: 2872 (kb)

To do

Single precision features/uint16 labels option for reduced memory usage.
Proper benchmarks, also with Entropy as partitioning criterion.
Tell something about the precision / predictive powers of this library vs other implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
src		src
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
makefile		makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Basic Library Usage

Performance hints

A single tree

A forest with 100 trees

Installation

Example output

To do

See also / Alternatives

About

Releases

Packages

Languages

License

elgw/trafo

Folders and files

Latest commit

History

Repository files navigation

Basic Library Usage

Performance hints

A single tree

A forest with 100 trees

Installation

Example output

To do

See also / Alternatives

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages