Skip to content

Commit

Permalink
section 2 review
Browse files Browse the repository at this point in the history
  • Loading branch information
fmind committed Apr 28, 2024
1 parent 30141aa commit 4dbd422
Show file tree
Hide file tree
Showing 3 changed files with 26 additions and 2 deletions.
3 changes: 2 additions & 1 deletion docs/2. Prototyping/2.2. Configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,10 @@ PARAM_GRID = {

Incorporating configs into your projects is a reflection of best practices in software development. This approach ensures your code remains:

- **Flexible**: Facilitating effortless adaptation to different datasets or experimental scenarios.
- **Flexible**: Facilitating effortless adaptations and changes to different datasets or experimental scenarios.
- **Easy to Maintain**: Streamlining the process of making updates or modifications without needing to delve deep into the core logic.
- **User-Friendly**: Providing a straightforward means for users to tweak the notebook's functionality to their specific requirements without extensive coding interventions.
- **Avoid [hard coding](https://en.wikipedia.org/wiki/Hard_coding) and [magic numbers](https://en.wikipedia.org/wiki/Magic_number_(programming))**: Name and document key variables in your notebook to make them understandable and reviewable by others.

Effectively, configurations act as a universal "remote control" for your code, offering an accessible interface for fine-tuning its behavior.

Expand Down
23 changes: 23 additions & 0 deletions docs/2. Prototyping/2.4. Analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,29 @@ profile.to_widgets()

While automated EDA tools like ydata-profiling can offer a quick and broad overview of the dataset, they are not a complete substitute for manual EDA. Human intuition and expertise are crucial for asking the right questions, interpreting the results, and making informed decisions on how to proceed with the analysis. Therefore, automated EDA should be viewed as a complement to, rather than a replacement for, traditional exploratory data analysis methods.

## How can you handle missing values in datasets?

Handling missing values in datasets is crucial for maintaining data integrity. Here are common methods:

1. **Remove Data**: Delete rows with missing values, especially if the missing data is minimal.
2. **Impute Values**: Replace missing values with a statistical substitute like mean, median, or mode, or use predictive modeling.
3. **Indicator Variables**: Create new columns to indicate data is missing, which can be useful for some models.

[MissingNo](https://github.com/ResidentMario/missingno) is a tool for visualizing missing data in Python. To use it:

1. **Install MissingNo**: `pip install missingno`
2. **Import and Use**:
```python
import missingno as msno
import pandas as pd

data = pd.read_csv('your_data.csv')
msno.matrix(data) # Visual matrix of missing data
msno.bar(data) # Bar chart of non-missing values
```

These visualizations help identify patterns and distributions of missing data, aiding in effective preprocessing decisions.

## Analysis additional resources

- [Example from the MLOps Python Package](https://github.com/fmind/mlops-python-package/blob/main/notebooks/prototype.ipynb)
Expand Down
2 changes: 1 addition & 1 deletion docs/2. Prototyping/2.5. Modeling.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ draft = pipeline.Pipeline(

Implementing pipelines in your machine learning projects offers several key advantages:

- **Prevents Data Leakage**: By ensuring data preprocessing steps are applied correctly during model training and validation, pipelines help maintain the integrity of your data.
- **Prevents Data Leakage during preprocessing**: By ensuring data preprocessing steps are applied correctly during model training and validation, pipelines help maintain the integrity of your data.
- **Simplifies Cross-Validation and Hyperparameter Tuning**: Pipelines facilitate the application of transformations to data subsets appropriately during procedures like cross-validation, ensuring accurate and reliable model evaluation.
- **Ensures Consistency**: Pipelines guarantee that the same preprocessing steps are executed in both the model training and inference phases, promoting consistency and reliability in your ML workflow.

Expand Down

0 comments on commit 4dbd422

Please sign in to comment.