Update elephants tutorial

piotrjurkiewicz · Sep 24, 2024 · b964bfd · b964bfd
1 parent 8af24de
commit b964bfd
Showing 1 changed file with 30 additions and 15 deletions.
diff --git a/docs/elephants_tutorial.rst b/docs/elephants_tutorial.rst
@@ -69,6 +69,20 @@ The `skl.train_classifiers` module provides an example script for training class
 .. code-block:: shell-session
 
     (venv) user@host:~/flow-models$ python3 flow_models/elephants/skl/train_classifiers.py --help
+    usage: train_classifiers.py [-h] [-O OUTPUT] [--seed SEED] [--fork] [--jobs JOBS] directory
+
+    Trains and evaluates sklearn classifier models to classify elephant flows.
+
+    positional arguments:
+      directory             binary flow records directory
+
+    options:
+      -h, --help            show this help message and exit
+      -O OUTPUT, --output OUTPUT
+                            results output directory
+      --seed SEED           seed
+      --fork                fork to subprocess for each simulation
+      --jobs JOBS           maximum number of simultaneous subprocesses
 
 The compulsory argument ``directory`` is a path to a directory containing binary flow records. In our case, this will be ``data/agh_2015061019_IPv4_anon/sorted``.
 
@@ -81,7 +95,7 @@ When ``fork`` parameter is not specified, all models will be trained and evaluat
 .. note::
     The ``fork`` option has to be used carefully, especially in environments with limited memory. Some models can use a significant amount of memory during training. Therefore, it needs to be ensured that the memory requirements of parallel-running multiple jobs will not exceed the available memory.
 
-    Many ``sklearn`` algorithms have built-in internal parallelization and are able to utilize all cores on the machine anyway. This can be enabled by providing ``{'n_jobs': -1}`` parameter into the model parameters. In such cases, the gain from using `fork` is limited only to periods of evaluation, which is performed on a single core.
+    Many ``sklearn`` algorithms have built-in internal parallelization and are able to utilize all cores on the machine anyway. This can be enabled by providing ``{'n_jobs': -1}`` parameter into the model parameters. In such cases, the gain from using ``fork`` is limited only to periods of evaluation, which is performed on a single core.
 
 Exploring the script code
 =========================
@@ -103,8 +117,8 @@ Within the `train_classifiers` script, the machine learning algorithms used for
 
 Each element in the ``algos`` list is a 2-tuple containing:
 
-1. A scikit-learn ``Classifier`` class.
-2. A dictionary of parameters to be passed to the classifier during initialization.
+- A scikit-learn ``Classifier`` class.
+- A dictionary of parameters to be passed to the classifier during initialization.
 
 The algorithms are trained and evaluated sequentially, in the order they appear in the list.
 
@@ -118,8 +132,8 @@ The ``data_par`` dictionary controls how many flow records are used for training
     # data_par = {'skip': 0, 'count': 1000000}
     data_par = {}
 
-- ``skip``: Defines the number of initial flow records to skip from the dataset.
-- ``count``: Specifies the number of flow records to process after the skipped ones.
+- ``skip`` - Defines the number of initial flow records to skip from the dataset.
+- ``count`` - Specifies the number of flow records to process after the skipped ones.
 
 By adjusting these parameters, you can focus the training on a specific subset of the data.
 
@@ -134,8 +148,9 @@ The ``prep_params`` list defines different combinations of input data preparatio
     prep_params = [{}, {'bits': True}, {'octets': True}]
 
 Each dictionary in ``prep_params`` specifies a unique way to preprocess the input data. Training and evaluation are performed for all the combinations listed. The options include:
-- ``bits`1: Transforms each component of the 5-tuple (source IP, destination IP, source port, destination port, protocol) into individual bits, treating them as separate features.
-- ``octets``: Splits any 5-tuple field longer than 8 bits into separate byte features.
+
+- ``bits`` - Transforms each component of the 5-tuple (source IP, destination IP, source port, destination port, protocol) into individual bits, treating them as separate features.
+- ``octets`` - Splits any 5-tuple field longer than 8 bits into separate byte features.
 
 When no flags are set, the 5-tuple fields are used as 32-bit integers corresponding to the features ``(source IP, destination IP, source port, destination port, protocol)``.
 
@@ -149,8 +164,8 @@ The ``modes`` list controls which evaluation modes are activated:
     # modes = ['train', 'test']
     modes = ['test']
 
-- ``test`` mode: Evaluates the model on a test dataset that does not overlap with the training data.
-- ``train`` mode: Evaluates the model on the same data used for training. This is useful for diagnosing whether the model is learning effectively or merely memorizing the training data (i.e., overfitting).
+- ``test`` mode - Evaluates the model on a test dataset that does not overlap with the training data.
+- ``train`` mode - Evaluates the model on the same data used for training. This is useful for diagnosing whether the model is learning effectively or merely memorizing the training data (i.e., overfitting).
 
 Flow labeling: ``train_decision``
 ---------------------------------
@@ -164,8 +179,8 @@ In binary classification, the model's output is a binary decision (0/1). In our
 
 The ``train_decision`` array holds the labels for each flow in the training dataset. The ``prepare_decision`` function generates these labels based on the flow sizes, aiming to achieve the desired traffic coverage in the training dataset. It does so by sorting the flows in descending order of size and labeling the largest flows as elephants until the specified traffic coverage is reached. Alternatively, training labels can be generated using a size-threshold by applying a boolean comparison directly to the flow size array.
 
-Flow selection using ``idx``
-----------------------------
+Dataset shrinking: ``idx``
+--------------------------
 
 .. code-block:: python
 
@@ -174,8 +189,8 @@ Flow selection using ``idx``
 
 The ``idx`` variable allows further limiting the dataset used for training. The ``top_idx`` function retrieves the indices of the largest flows within the dataset. This function can be used to shrink the training dataset, for instance, by selecting 5% of the largest flows and an additional 5% of randomly selected smaller flows. Such a reduction can greatly decrease training time with only a minor impact on the model's accuracy. To use all flows in the training dataset, simply pass ``Ellipsis`` as ``idx``.
 
-Handling class imbalance with ``sample_weight``
------------------------------------------------
+Handling class imbalance: ``sample_weight``
+-------------------------------------------
 
 .. code-block:: python
 
@@ -187,7 +202,7 @@ Handling class imbalance with ``sample_weight``
 
 Class imbalance between elephant and mouse flows poses a challenge for machine learning models, often leading to reduced classification accuracy. The ``sample_weight`` parameter can help address this imbalance during model training.
 
-One approach, outlined in the commented section ``Balanced sample weights``, involves normalizing the sample weights so that the sum of the weights for both classes is the same. This balances the number of samples from each class used for training.
+One approach, outlined in the commented section *Balanced sample weights*, involves normalizing the sample weights so that the sum of the weights for both classes is the same. This balances the number of samples from each class used for training.
 
 However, in practice, we found that using the square root of the flow size (in bytes) as the sample weight provides better results. This method adjusts the weight of each sample according to the flow size, favoring larger flows while still considering the smaller ones.
 
@@ -217,7 +232,7 @@ For the purpose of this tutorial, we will run the experiment with the following
     prep_params = [{}, {'bits': True}, {'octets': True}]
     modes = ['test']
 
-To speed up the process, we will use the following setting: ``idx = top_idx(train_octets, 0.1)``.
+To speed up the process, we will use the following setting: ``idx = top_idx(train_octets, 0.1)``
 
 On a server equipped with 2x Intel Xeon Silver 4114 CPUs running at 2.20GHz (40 logical cores), completing the training and evaluation with the selected models takes approximately 5 hours when running in single-process mode (without the ``--fork`` option). During this time, the peak memory usage by the process is about 8 GB.