Skip to content

Commit

Permalink
2. review
Browse files Browse the repository at this point in the history
  • Loading branch information
fmind committed Apr 3, 2024
1 parent a5c063d commit 978826c
Show file tree
Hide file tree
Showing 7 changed files with 33 additions and 33 deletions.
10 changes: 5 additions & 5 deletions docs/2. Prototyping/2.0. Notebooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@

A Python notebook, often referred to simply as a "notebook," is an interactive computing environment that allows users to combine executable code, rich text, visuals, and other multimedia resources in a single document. This tool is invaluable for data analysis, machine learning projects, documentation, and educational purposes, among others. Notebooks are structured in a cell-based format, where each cell can contain either code or text. When code cells are executed, the output is displayed directly beneath them, facilitating a seamless integration of code and content.

## Where can I learn how to use notebooks?
## Where can you learn how to use notebooks?

Learning how to use notebooks is straightforward, thanks to a plethora of online resources. Beginners can start with the official documentation of popular notebook applications like Jupyter (Jupyter Documentation) or Google Colab. For more interactive learning, platforms such as Coursera, Udacity, and edX offer courses specifically tailored to using Python notebooks for data science and machine learning projects. YouTube channels dedicated to data science and Python programming also frequently cover notebooks, providing valuable tips and tutorials for both beginners and advanced users.

## Why should I use a notebook for prototyping?
## Why should you use a notebook for prototyping?

Notebooks offer an unparalleled environment for prototyping due to their unique blend of features:

Expand All @@ -20,7 +20,7 @@ In addition, the narrative structure of notebooks supports a logical flow of ide

As an alternative to notebooks, consider using the [Python Interactive Window](https://code.visualstudio.com/docs/python/jupyter-support-py) in Visual Studio Code or other text editors. These environments combine the interactivity and productivity benefits of notebooks with the robustness and feature set of an integrated development environment (IDE), such as source control integration, advanced editing tools, and a wide range of extensions for additional functionality.

## Can I use my notebook in production instead of creating a Python package?
## Can you use your notebook in production instead of creating a Python package?

Using notebooks in the early stages of development offers many advantages; however, they are not well-suited for production environments due to several limitations:

Expand All @@ -31,6 +31,6 @@ Using notebooks in the early stages of development offers many advantages; howev

For these reasons, it is advisable to transition from notebooks to structured Python packages for production. Doing so enables better software development practices, such as unit testing, continuous integration, and deployment, thereby enhancing code quality and maintainability.

## Do I need to review this chapter even if I know how to use notebooks?
## Do you need to review this chapter even if you know how to use notebooks?

Yes, even seasoned users can benefit from reviewing this chapter. It introduces advanced techniques, new features, and tools that you may not know about. Furthermore, the chapter emphasizes structuring notebooks effectively and applying best practices to improve readability, collaboration, and overall efficiency.
Even seasoned users can benefit from reviewing this chapter. It introduces advanced techniques, new features, and tools that you may not know about. Furthermore, the chapter emphasizes structuring notebooks effectively and applying best practices to improve readability, collaboration, and overall efficiency.
8 changes: 4 additions & 4 deletions docs/2. Prototyping/2.1. Imports.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ import pandas as pd # External library module
from my_project import my_module # Internal project module
```

## Which packages do I need for my project?
## Which packages do you need for your project?

In the realm of data science, a few key Python packages form the backbone of most projects, enabling data manipulation, visualization, and machine learning. Essential packages include:

Expand All @@ -36,7 +36,7 @@ poetry add pandas numpy matplotlib scikit-learn plotly

This command tells poetry to download and install these packages, along with their dependencies, into your project environment, ensuring version compatibility and easy package management.

## How should I organize my imports to facilitate my work?
## How should you organize your imports to facilitate your work?

Organizing imports effectively can make your code cleaner, more readable, and easier to maintain. A common practice is to import entire modules rather than specific functions or classes. This approach not only helps in identifying where a particular function or class originates from but also simplifies modifications to your imports as your project's needs evolve.

Expand All @@ -57,7 +57,7 @@ Importing entire modules (`import pandas as pd`) is generally recommended for cl

## Are there any side effects when importing modules in Python?

Yes, importing a module in Python executes all the top-level code in that module, which can lead to side effects. These effects can be both intentional and unintentional. It's crucial to import modules from trusted sources to avoid security risks or unexpected behavior. Be especially cautious of executing code with side effects in your own modules, and make sure any such behavior is clearly documented.
Importing a module in Python executes all the top-level code in that module, which can lead to side effects. These effects can be both intentional and unintentional. It's crucial to import modules from trusted sources to avoid security risks or unexpected behavior. Be especially cautious of executing code with side effects in your own modules, and make sure any such behavior is clearly documented.

Consider this cautionary example:

Expand All @@ -71,7 +71,7 @@ os.system("rm -rf /") # This command is extremely dangerous!
import lib # Importing lib.py could lead to data loss
```

## What should I do if packages cannot be imported from my notebook?
## What should you do if packages cannot be imported from your notebook?

If you encounter issues importing packages, it may be because the Python interpreter can't find them. This problem is common when using virtual environments. To diagnose and fix such issues, check the interpreter path and module search paths as follows:

Expand Down
10 changes: 5 additions & 5 deletions docs/2. Prototyping/2.2. Configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ PARAM_GRID = {
}
```

## Why should I create configs?
## Why should you create configs?

Incorporating configs into your projects is a reflection of best practices in software development. This approach ensures your code remains:

Expand All @@ -35,7 +35,7 @@ Incorporating configs into your projects is a reflection of best practices in so

Effectively, configurations act as a universal "remote control" for your code, offering an accessible interface for fine-tuning its behavior.

## Which configs can I provide out of the box?
## Which configs can you provide out of the box?

When it comes to data science projects, several common configurations are frequently utilized, including:

Expand All @@ -54,7 +54,7 @@ TEST_SIZE = 0.2
RANDOM_STATE = 0
```

## How should I organize the configs in my notebook?
## How should you organize the configs in your notebook?

A logical and functional organization of your configurations can significantly enhance the readability and maintainability of your code. Grouping configs based on their purpose or domain of application is advisable:

Expand Down Expand Up @@ -95,7 +95,7 @@ pd.options.display.max_columns = None
sklearn.set_config(transform_output="pandas")
```

## Why do I need to pass options?
## Why do you need to pass options?

Library defaults may not always cater to your specific needs or the demands of your project. For instance:

Expand All @@ -104,7 +104,7 @@ Library defaults may not always cater to your specific needs or the demands of y

Adjusting these options helps tailor the working environment to better fit your workflow and analytical needs, ensuring that outputs are both informative and visually accessible.

## How should I configure library options?
## How should you configure library options?

To optimize your working environment, consider customizing the settings of key libraries according to your project's needs. Here are some guidelines:

Expand Down
10 changes: 5 additions & 5 deletions docs/2. Prototyping/2.3. Datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ When working with datasets in pandas, several key properties enable you to quick

These attributes and methods are invaluable for initial data exploration and integrity checks, facilitating a deeper understanding of the dataset's characteristics.

## Which file format should I use?
## Which file format should you use?

Choosing the right file format for your dataset is crucial, as it affects the efficiency of data storage, access, and processing. Consider the following criteria when selecting a file format:

Expand All @@ -45,7 +45,7 @@ Choosing the right file format for your dataset is crucial, as it affects the ef
- **Dense** formats store every data point explicitly
- **Sparse** formats only store non-zero values, which can be more efficient for datasets with many missing values.

## How can I optimize the dataset loading process?
## How can you optimize the dataset loading process?

Optimizing the dataset loading process involves several strategies:

Expand All @@ -62,7 +62,7 @@ For large datasets, pandas might not be sufficient. Consider alternative librari

The **[Ibis project](https://ibis-project.org/)** unifies these alternatives under a common interface, allowing seamless transition between different backends based on the scale of your data and computational resources (e.g., using pandas for small datasets on a laptop and Spark for big data on clusters).

## Why do I need to split my dataset into 'X' and 'y'?
## Why do you need to split your dataset into 'X' and 'y'?

In supervised learning, the convention is to split the dataset into features (`X`) and the target variable (`y`). This separation is crucial because it delineates the input variables that the model uses to learn from the output variable it aims to predict. Structuring your data this way makes it clear to both the machine learning algorithms and the developers what the inputs and outputs of the models should be.

Expand All @@ -75,7 +75,7 @@ X, y = train.drop('target', axis='columns'), train['target']

This practice lays the groundwork for model training and evaluation, ensuring that the algorithms have a clear understanding of the data they are working with.

## Why should I split my dataset further into train/test sets?
## Why should you split your dataset further into train/test sets?

Splitting your dataset into training and testing sets is essential for accurately evaluating the performance of your machine learning models. This approach allows you to:

Expand All @@ -92,7 +92,7 @@ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_

It's crucial to manage potential issues like data leakage, class imbalances, and the temporal nature of data to ensure the reliability of your model evaluations.

## Do I need to shuffle my dataset prior to splitting it into train/test sets?
## Do you need to shuffle your dataset prior to splitting it into train/test sets?

Whether to shuffle your dataset before splitting it into training and testing sets depends on the nature of your problem. For time-sensitive data, such as time series, shuffling could disrupt the temporal sequence, leading to misleading training data and inaccurate models. In such cases, maintaining the chronological order is critical.

Expand Down
6 changes: 3 additions & 3 deletions docs/2. Prototyping/2.4. Analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Exploratory Data Analysis (EDA) is a critical step in the data analysis process

EDA is a flexible, data-driven approach that allows for a more in-depth understanding of the data before making any assumptions. It serves as a foundation for formulating hypotheses, defining a more targeted analysis, and selecting appropriate models and algorithms for machine learning projects.

## How can I use pandas to analyze my data?
## How can you use pandas to analyze your data?

Pandas is an essential tool for EDA in Python, offering a wide array of functions to quickly slice, dice, and summarize your data. To begin analyzing your dataset with pandas, you can use the following methods:

Expand All @@ -28,7 +28,7 @@ df.describe(include='all')

These functions allow you to quickly assess the quality and characteristics of your data, facilitating the identification of areas that may require further investigation or preprocessing.

## How can I visualize patterns in my dataset?
## How can you visualize patterns in your dataset?

Visualizing patterns in your dataset is pivotal for EDA, as it helps in recognizing underlying structures, trends, and outliers that might not be apparent from the raw data alone. Python offers a wealth of libraries for data visualization, including:

Expand All @@ -51,7 +51,7 @@ This method enables the rapid exploration of pairwise relationships within a dat

## Is there a way to automate EDA?

Yes, there are libraries designed to automate the EDA process, significantly reducing the time and effort required to understand a dataset. One such library is **pandas-profiling**, which generates comprehensive reports from a pandas DataFrame, providing insights into the distribution of each variable, correlations, missing values, and much more.
There are libraries designed to automate the EDA process, significantly reducing the time and effort required to understand a dataset. One such library is **pandas-profiling**, which generates comprehensive reports from a pandas DataFrame, providing insights into the distribution of each variable, correlations, missing values, and much more.

Example with pandas-profiling:

Expand Down
12 changes: 6 additions & 6 deletions docs/2. Prototyping/2.5. Modeling.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ draft = pipeline.Pipeline(
)
```

## Why do I need to use a pipeline?
## Why do you need to use a pipeline?

Implementing pipelines in your machine learning projects offers several key advantages:

Expand All @@ -44,7 +44,7 @@ Implementing pipelines in your machine learning projects offers several key adva

Pipelines thus represent an essential tool in the machine learning toolkit, streamlining the model development process and enhancing model performance and evaluation.

## Why do I need to process inputs by type?
## Why do you need to process inputs by type?

Different data types typically require distinct preprocessing steps to prepare them effectively for machine learning models:

Expand Down Expand Up @@ -91,7 +91,7 @@ draft = pipeline.Pipeline(
)
```

## How can I change the pipeline hyper-parameters?
## How can you change the pipeline hyper-parameters?

Adjusting hyper-parameters within a scikit-learn pipeline can be achieved using the `set_params` method or by directly accessing parameters via the double underscore (`__`) notation. This flexibility allows you to fine-tune your model directly within the pipeline structure.

Expand All @@ -111,7 +111,7 @@ pipeline = Pipeline([
pipeline.set_params(regressor__n_estimators=100, regressor__max_depth=10)
```

## Why do I need to perform a grid search with my pipeline?
## Why do you need to perform a grid search with your pipeline?

Conducting a grid search over a pipeline is crucial for identifying the optimal combination of model hyper-parameters. This exhaustive search evaluates various parameter combinations across your dataset, using cross-validation to ensure robust assessment of model performance.

Expand All @@ -137,7 +137,7 @@ search = GridSearchCV(
search.fit(inputs_train, targets_train)
```

## Why do I need to perform cross-validation with my pipeline?
## Why do you need to perform cross-validation with your pipeline?

Cross-validation is a fundamental technique in the validation process of machine learning models, enabling you to assess how well your model is likely to perform on unseen data. By integrating cross-validation into your pipeline, you can ensure a thorough evaluation of your model's performance, mitigating the risk of overfitting and underfitting.

Expand All @@ -153,7 +153,7 @@ Here’s a breakdown of how you can control the cross-validation behavior throug

- **Iterable**: An iterable yielding train/test splits as arrays of indices directly specifies the data partitions for each fold. This option offers maximum flexibility, allowing for completely custom splits based on external logic or considerations (e.g., predefined groups or stratifications not captured by the standard splitters).

## Do I need to retrain my pipeline? Should I use the full dataset?
## Do you need to retrain your pipeline? Should you use the full dataset?

After identifying the best model and hyper-parameters through grid search and cross-validation, it's common practice to retrain your model on the entire dataset. This approach allows you to leverage all available data, maximizing the model's learning and potentially enhancing its performance when making predictions on new, unseen data.

Expand Down
10 changes: 5 additions & 5 deletions docs/2. Prototyping/2.6. Evaluations.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@

Evaluation is a fundamental step in the machine learning workflow that involves assessing a model's predictions to ensure its reliability and accuracy before deployment. It acts as a quality assurance mechanism, providing insights into the model's performance through various means such as error metrics, graphical representations (like validation curves), and more. This step is crucial for verifying that the model performs as expected and is suitable for real-world applications.

## Why should I evaluate my pipeline?
## Why should you evaluate your pipeline?

Machine learning models can sometimes behave in unpredictable ways due to their inherent complexity. By evaluating your training pipeline, you can uncover issues like data leakage, which undermines the model's ability to generalize to unseen data. Rigorous evaluation builds trust and credibility, ensuring that the model's performance is genuinely robust and not just a result of overfitting or other biases.

For more insights on data leakage, explore this link: [Data Leakage in Machine Learning](https://en.wikipedia.org/wiki/Leakage_(machine_learning)).

## How can I generate predictions with my pipeline?
## How can you generate predictions with your pipeline?

To generate predictions using your machine learning pipeline, employ the hold-out dataset (test set). This approach ensures that the predictions are made on data that the model has not seen during training, providing a fair assessment of its generalization capability. Here's how you can do it:

Expand All @@ -30,7 +30,7 @@ results = results.sort_values(by="rank_test_score")
results.head()
```

## What do I need to evaluate in my pipeline?
## What do you need to evaluate in your pipeline?

Evaluating your training pipeline encompasses several key areas:

Expand Down Expand Up @@ -102,7 +102,7 @@ print(importances.shape)
importances.head()
```

## How can I ensure my pipeline was trained on enough data?
## How can you ensure your pipeline was trained on enough data?

Employing a learning curve analysis helps you understand the relationship between the amount of training data and model performance. Continue adding diverse data until the model's performance stabilizes, indicating an optimal data volume has been reached.

Expand All @@ -123,7 +123,7 @@ learning = pd.DataFrame(
px.line(learning, x="train_size", y=["mean_test_score", "mean_train_score"], title="Learning Curve")
```

## How can I ensure my pipeline captures the right level of complexity?
## How can you ensure your pipeline captures the right level of complexity?

To balance complexity and performance, use validation curves to see how changes in a model parameter (like depth) affect its performance. Adjust complexity to improve performance without causing overfitting.

Expand Down

0 comments on commit 978826c

Please sign in to comment.