Skip to content

Commit

Permalink
Review code structure
Browse files Browse the repository at this point in the history
  • Loading branch information
fmind committed Mar 21, 2024
1 parent 8c0d1c3 commit 193c43a
Show file tree
Hide file tree
Showing 58 changed files with 137 additions and 117 deletions.
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
# The MLOPS template course
# 0.0 Course

## In few words


## Intended Audience

## Prerequisites knowledge


## How to read ?


## Technology


Copyright
Copyright
1 change: 1 addition & 0 deletions docs/0. Overview/0.1. Projects.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 0.1 Projects
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 0.1. Data
# 0.2. Datasets

Data is often referred to as the fuel for Machine Learning, and although this course focuses on MLOps, it's crucial to have access to data to fully grasp the various concepts and technologies involved.

Expand All @@ -11,13 +11,13 @@ Briefly we can note the following data types:
### Structured Data
Structured data adheres to a predefined model, making it easier to search and organize.

* *Tabular*:
* *Tabular*:
Perhaps the most common type of data, where data is organize in rows and columns.
* Column are homogeneous in terms of types
* Typically CSV files and Relational database
* Typically CSV files and Relational database
* *Time Series*:
Sequence of data points collected or recorded at successive points in time, usually at consistent intervals.
* Characterized by temporal order, meaning the sequence of observations is crucial, and changing the order can alter the meaning or interpretation of the data.
Sequence of data points collected or recorded at successive points in time, usually at consistent intervals.
* Characterized by temporal order, meaning the sequence of observations is crucial, and changing the order can alter the meaning or interpretation of the data.
* Typically, financial data and energy
* *Geospatial*:
Data representing a specific location or geographic area on earth
Expand All @@ -39,7 +39,7 @@ Unstructured data does not follow a predefined model, making it more complex to
* characterized by high number of unique words, contextual meaning and ambiguity
* *Multimedia*:
Refer to picture, sound, video data
* challenging due to the high dimensionality, large file sizes, and the complexity of extracting meaningful patterns.
* challenging due to the high dimensionality, large file sizes, and the complexity of extracting meaningful patterns.

### Semi Structured Data

Expand All @@ -50,7 +50,7 @@ Examples are XML and JSON files.

## Which data should I use?

The question of which dataset to use is common, and honestly, the best dataset is the one you're most familiar with.
The question of which dataset to use is common, and honestly, the best dataset is the one you're most familiar with.
While the vast array of data types and their diverse applications might seem overwhelming, it's important to remember that many MLOps concepts are universal and can be applied across different domains.

We will look into the specificities of certain types of applications later in the course. For now, we offer two options for getting started.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Why not specific tools?
# 0.3. Platforms

Databricks, metaflow ...

Expand Down
3 changes: 3 additions & 0 deletions docs/0. Overview/0.4. Mentoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# 0.4. Mentoring

Mentoring
1 change: 1 addition & 0 deletions docs/0. Overview/0.5. Assistants.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 0.5. Assistants
1 change: 1 addition & 0 deletions docs/0. Overview/0.6. Resources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 0.6. Resources
1 change: 1 addition & 0 deletions docs/0. Overview/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 0. Overview
1 change: 1 addition & 0 deletions docs/1. Initializing/1.0. System.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 1.0. System
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 1.2. pyenv
# 1.1. pyenv

## What is pyenv?

Expand Down
1 change: 1 addition & 0 deletions docs/1. Initializing/1.2. Python.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 1.2. Python
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 1.3. poetry
# 1.3. Poetry

# 1.3. poetry

Expand Down
File renamed without changes.
File renamed without changes.
1 change: 1 addition & 0 deletions docs/1. Initializing/1.6. VS Code.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 1.6. Visual Studio Code
1 change: 0 additions & 1 deletion docs/1. Initializing/1.7. VS Code.md

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Initialization
# 1. Initializing

This section introduces many basic concepts common to all software projects, which also apply to MLOps

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 2.0. Notebook
# 2.0. Notebooks

## What is a notebook?

Expand Down
62 changes: 61 additions & 1 deletion docs/2. Prototyping/2.2. Configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,4 +74,64 @@ How to load/transform datasets ...
## Pipelines

How to define/run model pipelines ...
```
```

## What are options?

Options in a data science environment, such as a Jupyter notebook, are configurations that tailor the behavior and appearance of libraries like pandas, matplotlib, and scikit-learn. These options allow you to control aspects like display settings and output formats.

Example of options in a notebook:

```python
# Pandas
pd.options.display.max_rows = None
pd.options.display.max_columns = None
# Sklearn
set_config(transform_output="pandas")
```

## Why do I need to pass options?

Default settings of libraries may not always align with your specific needs. For example:
- Pandas may hide some columns or rows by default, limiting the visibility of data.
- Matplotlib's default figure sizes might be too small for detailed analysis.

Adjusting these options ensures your environment is optimized for your workflow.

## How should I configure Pandas options?

Pandas offers a variety of options for customizing data display. Check the [Pandas Options and Settings documentation](https://pandas.pydata.org/docs/user_guide/options.html) for a comprehensive guide.

```python
import pandas as pd

# Set the maximum number of rows and columns to display
pd.options.display.max_rows = None
pd.options.display.max_columns = None
# Extend the maximum column width for display
pd.options.display.max_colwidth= None
```

## How should I configure matplotlib options?

Matplotlib's appearance can be customized as per your requirements. Refer to the [Matplotlib Customizing Guide](https://matplotlib.org/stable/users/explain/customizing.html) for detailed options.

```python
import matplotlib.pyplot as plt

# Set default figure size
plt.rcParams['figure.figsize'] = (20, 10)
```

## How should I configure scikit-learn options?

Scikit-learn provides configurations to modify how outputs are displayed or handled. The [official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.set_config.html#sklearn.set_config) outlines these options.

```python
import sklearn

# return pandas dataframe instead of numpy array
sklearn.set_config(transform_output='pandas')
```

Setting these options at the beginning of your notebook ensures a consistent and tailored working environment throughout your analysis.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 2.4. Datasets
# 2.3. Datasets

## What are datasets?

Expand Down Expand Up @@ -33,15 +33,6 @@ Selecting a file format for your dataset involves considering several factors:
- **Dense**: Every data point is stored (e.g., CSV, Parquet).
- **Sparse**: Only non-zero values are stored, useful for data with many empty values (e.g., SciPy sparse matrices).

## How can I explore my dataset content?

Pandas is a popular tool for exploring datasets in Python. Common methods include:
- `.info()`: Overview of types, non-null values, and memory usage.
- `.shape`: Dimensions of the dataframe.
- `.describe()`: Descriptive statistics.

For visual exploration, libraries like [plotly.express](https://plotly.com/python/plotly-express/), [matplotlib](https://matplotlib.org/), and [seaborn](https://seaborn.pydata.org/), and [ydata-profiling](https://github.com/ydataai/ydata-profiling) are useful.

## How can I optimize the dataset loading process?

To improve dataset loading and handling:
Expand Down
61 changes: 0 additions & 61 deletions docs/2. Prototyping/2.3. Options.md

This file was deleted.

12 changes: 12 additions & 0 deletions docs/2. Prototyping/2.4. Analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# 2.4. Analysis

## pandas profiling

## How can I explore my dataset content?

Pandas is a popular tool for exploring datasets in Python. Common methods include:
- `.info()`: Overview of types, non-null values, and memory usage.
- `.shape`: Dimensions of the dataframe.
- `.describe()`: Descriptive statistics.

For visual exploration, libraries like [plotly.express](https://plotly.com/python/plotly-express/), [matplotlib](https://matplotlib.org/), and [seaborn](https://seaborn.pydata.org/), and [ydata-profiling](https://github.com/ydataai/ydata-profiling) are useful.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 2.5. Pipelines
# 2.5. Modeling

## What are pipelines?

Expand Down
2 changes: 1 addition & 1 deletion docs/2. Prototyping/2.6. Evaluations.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 2.6. Evaluations
# 2.6. Evaluation

## What is an evaluation?

Expand Down
1 change: 1 addition & 0 deletions docs/2. Prototyping/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 2. Prototyping
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 3.2. Functions
# 3.3. Paradigms

## What is a function?

Expand Down
1 change: 1 addition & 0 deletions docs/3. Refactoring/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 3. Refactoring
1 change: 0 additions & 1 deletion docs/4. Validating/4.0. Checkers.md

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 4.1. Typing
# 4.0. Typing

## What is Typing in Python?

Expand Down Expand Up @@ -30,4 +30,6 @@ The importance of typing in Python projects, particularly large-scale or complex
5. **Integrate with CI/CD Pipelines**: Incorporate mypy checks into your continuous integration/continuous deployment workflows to automatically catch type issues before they make it to production.
6. **Team Guidelines**: Establish team guidelines on how and when to use type annotations to maintain consistency across the codebase.
7. **Regular Reviews**: Regularly review the type annotations in your code, especially after major refactoring or updates to Python’s typing module, to ensure they remain accurate and useful.
8. **Leverage Advanced Features**: Explore advanced features of mypy, such as type inference, generic types, and custom type definitions, to handle more complex typing scenarios.
8. **Leverage Advanced Features**: Explore advanced features of mypy, such as type inference, generic types, and custom type definitions, to handle more complex typing scenarios.

TODO: Pandera, Pydantic
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 4.2. Linting
# 4.1. Linting

## What is Linting in Python?

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 4.3. Tests
# 4.2. Testing

## What are Tests in Python?

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 4.4. Logging
# 4.3. Logging

## What is Logging in Python?

Expand Down
3 changes: 3 additions & 0 deletions docs/4. Validating/4.4. Security.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# 4.4. Security

ruff, bandit
1 change: 1 addition & 0 deletions docs/4. Validating/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 4.0 Validating
2 changes: 1 addition & 1 deletion docs/5. Refining/5.0. Patterns.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# 5.6. Security
# 5.0. Patterns

Pydantic
3 changes: 3 additions & 0 deletions docs/5. Refining/5.4. Containers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# 5.4. Containers

Docker
5 changes: 0 additions & 5 deletions docs/5. Refining/5.4. Versions.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/5. Refining/5.5. Containers.md

This file was deleted.

1 change: 1 addition & 0 deletions docs/5. Refining/5.5. Experiments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 5.5. Experiments
1 change: 1 addition & 0 deletions docs/5. Refining/5.6. Model Registries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 5.6. Model Registries
1 change: 0 additions & 1 deletion docs/5. Refining/5.6. Security.md

This file was deleted.

1 change: 1 addition & 0 deletions docs/5. Refining/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 5. Refining
1 change: 1 addition & 0 deletions docs/6. Collaborating/6.0. Repository.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# 6.0. Repository
Loading

0 comments on commit 193c43a

Please sign in to comment.