From 4dbd422a8cd98b0a44fbbddd70b7e7a2b98841de Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A9d=C3=A9ric=20Hurier=20=28Fmind=29?= Date: Sun, 28 Apr 2024 21:40:57 +0200 Subject: [PATCH] section 2 review --- docs/2. Prototyping/2.2. Configs.md | 3 ++- docs/2. Prototyping/2.4. Analysis.md | 23 +++++++++++++++++++++++ docs/2. Prototyping/2.5. Modeling.md | 2 +- 3 files changed, 26 insertions(+), 2 deletions(-) diff --git a/docs/2. Prototyping/2.2. Configs.md b/docs/2. Prototyping/2.2. Configs.md index a8f247c..09e2d51 100644 --- a/docs/2. Prototyping/2.2. Configs.md +++ b/docs/2. Prototyping/2.2. Configs.md @@ -29,9 +29,10 @@ PARAM_GRID = { Incorporating configs into your projects is a reflection of best practices in software development. This approach ensures your code remains: -- **Flexible**: Facilitating effortless adaptation to different datasets or experimental scenarios. +- **Flexible**: Facilitating effortless adaptations and changes to different datasets or experimental scenarios. - **Easy to Maintain**: Streamlining the process of making updates or modifications without needing to delve deep into the core logic. - **User-Friendly**: Providing a straightforward means for users to tweak the notebook's functionality to their specific requirements without extensive coding interventions. +- **Avoid [hard coding](https://en.wikipedia.org/wiki/Hard_coding) and [magic numbers](https://en.wikipedia.org/wiki/Magic_number_(programming))**: Name and document key variables in your notebook to make them understandable and reviewable by others. Effectively, configurations act as a universal "remote control" for your code, offering an accessible interface for fine-tuning its behavior. diff --git a/docs/2. Prototyping/2.4. Analysis.md b/docs/2. Prototyping/2.4. Analysis.md index c86ab1d..da356b4 100644 --- a/docs/2. Prototyping/2.4. Analysis.md +++ b/docs/2. Prototyping/2.4. Analysis.md @@ -62,6 +62,29 @@ profile.to_widgets() While automated EDA tools like ydata-profiling can offer a quick and broad overview of the dataset, they are not a complete substitute for manual EDA. Human intuition and expertise are crucial for asking the right questions, interpreting the results, and making informed decisions on how to proceed with the analysis. Therefore, automated EDA should be viewed as a complement to, rather than a replacement for, traditional exploratory data analysis methods. +## How can you handle missing values in datasets? + +Handling missing values in datasets is crucial for maintaining data integrity. Here are common methods: + +1. **Remove Data**: Delete rows with missing values, especially if the missing data is minimal. +2. **Impute Values**: Replace missing values with a statistical substitute like mean, median, or mode, or use predictive modeling. +3. **Indicator Variables**: Create new columns to indicate data is missing, which can be useful for some models. + +[MissingNo](https://github.com/ResidentMario/missingno) is a tool for visualizing missing data in Python. To use it: + +1. **Install MissingNo**: `pip install missingno` +2. **Import and Use**: +```python +import missingno as msno +import pandas as pd + +data = pd.read_csv('your_data.csv') +msno.matrix(data) # Visual matrix of missing data +msno.bar(data) # Bar chart of non-missing values +``` + +These visualizations help identify patterns and distributions of missing data, aiding in effective preprocessing decisions. + ## Analysis additional resources - [Example from the MLOps Python Package](https://github.com/fmind/mlops-python-package/blob/main/notebooks/prototype.ipynb) diff --git a/docs/2. Prototyping/2.5. Modeling.md b/docs/2. Prototyping/2.5. Modeling.md index f941ef9..fb948ef 100644 --- a/docs/2. Prototyping/2.5. Modeling.md +++ b/docs/2. Prototyping/2.5. Modeling.md @@ -38,7 +38,7 @@ draft = pipeline.Pipeline( Implementing pipelines in your machine learning projects offers several key advantages: -- **Prevents Data Leakage**: By ensuring data preprocessing steps are applied correctly during model training and validation, pipelines help maintain the integrity of your data. +- **Prevents Data Leakage during preprocessing**: By ensuring data preprocessing steps are applied correctly during model training and validation, pipelines help maintain the integrity of your data. - **Simplifies Cross-Validation and Hyperparameter Tuning**: Pipelines facilitate the application of transformations to data subsets appropriately during procedures like cross-validation, ensuring accurate and reliable model evaluation. - **Ensures Consistency**: Pipelines guarantee that the same preprocessing steps are executed in both the model training and inference phases, promoting consistency and reliability in your ML workflow.