- Introduction
- Data Collection
- Handling Class Imbalance
- Data Cleaning and Preprocessing
- Feature Engineering
- Model Training
- Model Evaluation
- Model Ensembling
- Implementation and Usage
- Conclusions and Future Work
- Acknowledgements
- License
Predicting hard disk drive (HDD) failures is crucial for maintaining data integrity and minimizing downtime in data centers. This project aims to develop a binary classification model using machine learning algorithms, primarily focusing on Random Forest and XGBoost, to predict the probability of HDD failures based on SMART (Self-Monitoring, Analysis, and Reporting Technology) data and other key features such as brand and storage capacity.
The dataset used in this project is provided by Backblaze, which offers detailed information about the HDDs deployed in their data centers. The data includes daily reports aggregated quarterly, containing:
- Date: When the data was recorded (yyyy-mm-dd).
- Serial Number: Manufacturer-assigned serial number.
- Model: Manufacturer-assigned model number.
- Capacity: Storage capacity in bytes.
- Failure: Binary indicator (0 for operational, 1 for failure).
- SMART Attributes: 124 columns representing raw and normalized values across 62 different SMART statistics.
- In this README, I will provide a quick overview of the steps followed to achieve the best machine learning model for this project. For a more detailed explanation or to review the code used, please refer to the accompanying Jupyter notebook (
.ipynb
). - The RandomForest model is not included here as its file size exceeds 25MB. If anyone would like to access it, they can either run the training code provided or request it from me via email or LinkedIn.
- There is also a 10-minute video in Spanish explaining this project. However, please note that the values shown in the video are different, as it was based on the initial trial. The models and code in this repository are from the second version, with improved values. Project Explanation Video (Spanish)
-
Download and Extract Data:
- Download ZIP files containing data from 2018 to the first two quarters of 2024.
- Extract the CSV files into a single directory for easier access.
-
Inspect Columns:
- Given the extensive number of columns, display them in a horizontal format for better readability.
The dataset exhibits a significant class imbalance, as selecting a random day's CSV file for training would show that only about 5-10 HDDs fail each day, out of a large number of operational drives. This means that on any given day, failures are relatively rare, leading to a highly imbalanced dataset.
- Random Sampling: Examine 5 random CSV files to determine the number of HDD failures each day.
- Data Cleaning: Remove non-informative
.csv
files that contain metadata to prevent errors during processing.
- Script Development: Create a script to iterate through all CSV files, extracting only the rows where HDDs failed.
- Dataset Consolidation: Combine these failure records with the latest CSV from June 30 to form a more balanced dataset.
- Final Dataset: Merge
CSV_fallos.csv
with the June 30th CSV to createdataset.csv
. - Shuffle and Relocate: Shuffle the combined dataset and move it to the root directory for ease of access.
Result: Approximately 17,000 failed HDDs (~5.54% of the total dataset).
A comprehensive dataset requires thorough cleaning to ensure reliability and effectiveness in training the machine learning models.
Eliminate columns that do not contribute to predicting HDD failures:
-
Non-Predictive Columns:
serial_number
datacenter
cluster_id
vault_id
pod_id
pod_slot_num
is_legacy_format
date
-
SMART Raw Values: Remove unnormalized SMART attributes, retaining only the normalized values for consistency and to reduce manufacturer dependency.
- Missing Values Analysis:
- Evaluate the percentage of missing values for each normalized SMART attribute.
- Retain attributes with ≤5-10% missing data.
- Discard rows with any missing values in the selected SMART attributes to avoid introducing noise.
Selected SMART Attributes (13 total):
- smart_1: Read error rate.
- smart_3: Spin-up time.
- smart_4: Start/stop count.
- smart_5: Reallocated sectors count.
- smart_7: Seek error rate.
- smart_9: Power-on hours.
- smart_10: Spin retry count.
- smart_12: Power cycle count.
- smart_192: Power-off retract count.
- smart_193: Load cycle count.
- smart_197: Current pending sector count.
- smart_198: Offline uncorrectable sector count.
- smart_199: UltraDMA CRC error count.
- Issue: The
capacity_bytes
column contains large values that could disproportionately influence the model. - Solution:
- Convert
capacity_bytes
to terabytes by dividing by 10^12 and rounding to two decimal places. - Create a new column
capacity_terabytes
and remove the originalcapacity_bytes
.
- Convert
Enhancing the dataset with meaningful features can significantly improve model performance.
The model
column represents the HDD model, which could provide insights into the longevity of different HDD models. However, using the model directly has drawbacks:
-
Pros:
- Potentially better survival rates by model.
-
Cons:
- Limited utility for predicting HDD models not present in the training set.
Solution: Implement a hybrid approach by extracting the manufacturer
from the model
using a Large Language Model (LLM) like Llama 3.1 70B (thanks to Groq for its free tier, which made this possible). This allows the model to generalize better across different HDDs by focusing on the manufacturer, enabling a broader scope than using specific model details.
- Approach:
- Use the LLM to map each
model
to its correspondingmanufacturer
. - Store mappings in a dictionary to avoid redundant API calls and improve efficiency.
- Assign a generic manufacturer label for unrecognized models.
- Use the LLM to map each
Note: This last point was ultimately unnecessary, as the LLM (Llama 3.1 70B) successfully extracted the manufacturer for all HDDs.
Convert the manufacturer
categorical data into a numerical format to make it usable by machine learning algorithms.
- One-Hot Encoding:
- Transform each manufacturer into a binary column.
- Advantages:
- Avoids imposing any order on manufacturers.
- Treats each manufacturer as an independent category.
- Chose this method for its flexibility, allowing unknown manufacturers to be represented by setting all columns to 0.
With a cleaned and preprocessed dataset, the next step is to train machine learning models to predict HDD failures.
-
Dataset Split:
- 90% for training (~254,700 samples).
- 10% for testing (~28,300 samples).
- Ensures class proportion is maintained in both sets.
-
Hyperparameter Tuning:
- GridSearchCV:
n_estimators
: [100, 200, 300]max_depth
: [10, 20, 30]min_samples_split
,min_samples_leaf
: Optimized for generalization.max_features
: ['sqrt', 'log2']
- GridSearchCV:
-
Objective: Maximize the F1-Score for the minority class to balance precision and recall.
-
Dataset Split:
- Consistent with Random Forest for fair comparison.
-
Hyperparameter Tuning:
- GridSearchCV:
n_estimators
: [100, 200, 300]max_depth
: [3, 6, 10]learning_rate
: [0.01, 0.1, 0.2]subsample
,colsample_bytree
: To enhance model robustness.
- GridSearchCV:
-
Objective: Utilize gradient boosting to handle complex patterns and class imbalance effectively.
Evaluating the performance of trained models is critical to understanding their effectiveness and suitability for deployment.
Visualizes the performance of both models by showing the true positives, true negatives, false positives, and false negatives.
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
No Failure | 0.98 | 0.99 | 0.99 | 27960 |
Failure | 0.88 | 0.70 | 0.78 | 1630 |
Accuracy | 0.98 | 29590 | ||
Macro Avg | 0.93 | 0.85 | 0.89 | 29590 |
Weighted Avg | 0.98 | 0.98 | 0.98 | 29590 |
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
No Failure | 0.98 | 0.99 | 0.99 | 27960 |
Failure | 0.88 | 0.70 | 0.78 | 1630 |
Accuracy | 0.98 | 29590 | ||
Macro Avg | 0.93 | 0.85 | 0.89 | 29590 |
Weighted Avg | 0.98 | 0.98 | 0.98 | 29590 |
Analyzes which features are most influential in predicting HDD failures for each model.
-
Random Forest:
- Focuses on
smart_9
(Power-on hours).
- Focuses on
-
XGBoost:
- Prioritizes
smart_198
(Offline uncorrectable sector count),smart_199
(UltraDMA CRC error count), andsmart_5
(Reallocated sector count).
- Prioritizes
This indicates that Random Forest emphasizes the age of the HDD, while XGBoost focuses more on physical failure indicators, despite both having almost the same performance.
Adjusts the decision threshold from the default 50% to optimize the F1-Score and recall.
-
Approach:
- Test thresholds from 10% to 90% in 0.5% increments.
- Select the threshold that maximizes the F1-Score, with a preference for higher recall in case of ties.
-
Results: Both models achieved a one-point increase in their F1-Score, with improvements in recall coming at a slight cost to precision.
-
Random Forest: Showed a slightly higher recall, making it marginally better at identifying failures.
-
XGBoost: Exhibited a slightly higher precision, indicating it was marginally more accurate in its positive predictions.
Each model thus demonstrated a very slight advantage in either recall (Random Forest) or precision (XGBoost), highlighting their respective strengths.
Combining the strengths of both Random Forest and XGBoost models through weighted ensemble stacking to enhance prediction performance.
-
Method:
- Assign weights to each model’s predictions.
- Experiment with various weight combinations and thresholds.
- Identify the best combinations to maximize F1-Score, precision, and recall.
-
Best Ensemble Combinations:
- Maximize F1-Score: XGBoost (0.47), Random Forest (0.53), Threshold: 37%
- Maximize Precision: XGBoost (0.34), Random Forest (0.66), Threshold: 90%
- Maximize Recall: XGBoost (0.23), Random Forest (0.77), Threshold: 10%
These combinations provide a balanced approach to handling different aspects of model performance, allowing for a more robust final prediction system.
The final implementation leverages the ensemble of both models with optimized weights and thresholds to predict HDD failures. The process involves:
- Selecting Disks: Randomly select HDDs from the test set, ensuring at least one has failed for testing purposes.
- Applying Ensemble Logic:
- F1-Oriented: High accuracy in predicting failures.
- Precision-Oriented: Identifies critical failures with high precision.
- Recall-Oriented: Maximizes the detection of actual failures, albeit with some false positives.
- Decision Making:
- Utilize a series of conditional checks to determine the final prediction based on the ensemble outputs.
- Provide actionable alerts (prints) based on the combination of model predictions.
- Ensemble Approach: Combining Random Forest and XGBoost leverages their individual strengths, resulting in a more robust prediction system.
- Threshold Optimization: Adjusting decision thresholds significantly improves the model’s ability to balance precision and recall.
- Feature Selection: Focusing on critical SMART attributes enhances model performance by reducing noise.
This project highlights the effectiveness of machine learning in predicting HDD failures, setting a solid foundation for future, more sophisticated predictive maintenance systems.
As seen in the previous step, the combination of weights and thresholds for the two trained models performed surprisingly well—far better than I anticipated.
Initially, I aimed to select a single model, most likely XGBoost, as it's known for reliable performance in similar contexts. My plan was to classify HDDs based on the model's confidence in predicting failure, using labels like 'minor issue, monitor' or 'critical alert, backup recommended.'
However, after plotting the feature importance for each model and observing their different prioritizations, I reconsidered and explored the idea of "blending" the models. I researched various methods for this and discovered Weighted Ensemble Stacking with threshold optimization, which is effective for combining model outputs with adjusted importance. The final result, as seen above, achieves a balanced performance, which is impressive for my first ML project at this scale.
Finally, this project shows promising potential. By focusing on manufacturers rather than specific models and using one-hot encoding, the model could operate with all manufacturer columns set to zero and still perform reliably—placing greater emphasis on SMART attributes. In essence, this is a universal model that can generalize across any HDD, making it versatile for broader applications.
- Backblaze: For providing the comprehensive HDD failure dataset. You can check it here.
- Groq: For facilitating access to Llama 3.1 70B for manufacturer extraction.
This project is licensed under the Apache2 License.
Thank you for taking the time to explore my project! Feel free to reach out with any questions or suggestions :)