Skip to content

Commit

Permalink
section 0 done
Browse files Browse the repository at this point in the history
  • Loading branch information
fmind committed Apr 25, 2024
1 parent 1006f17 commit 8b437a1
Show file tree
Hide file tree
Showing 10 changed files with 55 additions and 47 deletions.
22 changes: 11 additions & 11 deletions docs/0. Overview/0.0. Course.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,17 @@

Welcome to our comprehensive course designed to elevate your Python programming from basic notebooks to crafting a sophisticated, production-grade AI/ML codebase. Throughout this journey, you will learn:

- How to build and deploy production-worthy software artifacts.
- How to build and deploy production-grade software artifacts.
- Transitioning from prototyping in notebooks to developing structured Python packages.
- Enhancing code reliability and maintenance through linting and testing tools.
- Streamlining repetitive tasks using automation, both locally and via CI/CD pipelines.
- Adopting best practices to develop a versatile and resilient AI/ML codebase.

## Is there a fee for this course?

We are delighted to offer this course at no cost, under the Creative Commons Attribution 4.0 International license. This means you can adapt, share, and even use the content for commercial purposes, provided you attribute the original authors.
We offer this course at no cost, under the [Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/deed.en). This means you can adapt, share, and even use the content for commercial purposes, provided you attribute the original authors.

Additionally, for those seeking a deeper understanding, we provide extra support options, including personal mentoring sessions and access to online assistance.
Additionally, for those seeking a deeper understanding, we provide extra support options, including [personal mentoring sessions](./0.4. Mentoring.md) and access to [online assistance](./0.5. Assistants.md).

## Why enroll in this course?

Expand All @@ -34,17 +34,17 @@ To get the most out of this course, you should have:

The course is divided into six in-depth chapters, each focusing on different facets of coding and project management skills:

1. **Initialization**: Equip yourself with the necessary tools and platforms for your development environment.
2. **Prototyping**: Begin with notebooks to dive into data science projects and pinpoint viable solutions.
3. **Refactoring**: Transform your prototype into a neatly organized Python package, complete with scripts, configurations, and documentation.
4. **Validation**: Adopt practices like typing, linting, testing, and logging to refine code quality.
5. **Refinement**: Leverage advanced software development techniques and tools to polish your project.
6. **Collaboration**: Foster a productive team environment for effective contributions and communication.
1. **[Initializing](../../1. Initializing/)**: Equip yourself with the necessary tools and platforms for your development environment.
2. **[Prototyping](../../2. Prototyping/)**: Begin with notebooks to dive into data science projects and pinpoint viable solutions.
3. **[Refactoring](../../3. Refactoring/)**: Transform your prototype into a neatly organized Python package, complete with scripts, configurations, and documentation.
4. **[Validating](../../4. Validating/)**: Adopt practices like typing, linting, testing, and logging to refine code quality.
5. **[Refining](../../5. Refining/)**: Leverage advanced software development techniques and tools to polish your project.
6. **[Sharing](../../6. Sharing/)**: Foster a productive team environment for effective contributions and communication.

## What's beyond the scope of this course?

While this course provides a solid grounding in managing AI/ML projects, it does not delve into specific MLOps platforms like SageMaker, Vertex AI, Azure ML, or Databricks as online courses already cover these end-to-end platforms. Instead, this course focuses on core principles and practices that are universally applicable, whether you're working on-premise, cloud-based, or in a hybrid setting.
While this course provides a solid grounding in managing AI/ML projects, it does not delve into specific MLOps platforms like [SageMaker](https://aws.amazon.com/sagemaker/), [Vertex AI](https://cloud.google.com/vertex-ai/), [Azure ML](https://azure.microsoft.com/en-us/products/machine-learning), or [Databricks](https://www.databricks.com/) as vendor courses already cover these end-to-end platforms. Instead, this course focuses on core principles and practices that are universally applicable, whether you're working on-premise, cloud-based, or in a hybrid setting.

## How much time do you need to complete this course?

The time required to complete this course varies based on your prior experience and familiarity with the covered tools and practices. If you're already comfortable with tools like Git or VS Code, you may progress faster. The course philosophy encourages incremental improvement"make it done, make it right, make it fast"—urging you to begin with a functional project version and steadily refine it for better quality and efficiency.
The time required to complete this course varies based on your prior experience and familiarity with the covered tools and practices. If you're already comfortable with tools like [Git](https://git-scm.com/) or [VS Code](https://code.visualstudio.com/), you may progress faster. The course philosophy encourages incremental improvement following the "make it done, make it right, make it fast" mantra, encouraging you to begin with a functional project version and steadily refine it for better quality and efficiency.
12 changes: 6 additions & 6 deletions docs/0. Overview/0.1. Projects.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

## What is the default learning project?

The cornerstone project of this course involves a forecasting task using the Bike Sharing Dataset. The objective is to predict the number of bike rentals based on variables like date and time, weather conditions, and past rental data.
The cornerstone project of this course involves a forecasting task using the [Bike Sharing Demand dataset](https://www.kaggle.com/c/bike-sharing-demand). The objective is to predict the number of bike rentals based on variables like date and time, weather conditions, and past rental data. A [reference implementation](https://github.com/fmind/mlops-python-package) is provided to fallback on if needed.

Forecasting is a critical skill with wide-ranging applications in academia and industry, utilizing diverse machine learning techniques. This project introduces challenges such as managing data subsets to prevent data leakage, where future information could wrongly influence past predictions. Through tackling this project, you'll gain hands-on experience in structuring MLOps projects effectively, offering a solid foundation for your learning journey.
[Forecasting](https://en.wikipedia.org/wiki/Forecasting) is a critical skill with wide-ranging applications in academia and industry, utilizing diverse machine learning techniques. This project introduces challenges such as managing data subsets to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)), where future information could wrongly influence past predictions. Through tackling this project, you'll gain hands-on experience in structuring MLOps projects effectively, offering a solid foundation for your learning journey.

## Is it possible to select a personal project instead?

Expand All @@ -14,12 +14,12 @@ Absolutely! We encourage you to dive into a project that resonates with you pers

Looking for inspiration? There are several online platforms offering data science challenges, complete with datasets and clearly defined objectives:

- **Kaggle**: A hub for data scientists worldwide, Kaggle provides the tools and community support needed to pursue your data science aspirations.
- **DrivenData**: Hosts competitions where data scientists can address significant societal challenges through innovative predictive modeling.
- **DataCamp**: Offers real-world data science competitions, allowing participants to hone their skills, win accolades, and present their solutions.
- **[Kaggle](https://www.kaggle.com/)**: A hub for data scientists worldwide, Kaggle provides the tools and community support needed to pursue your data science aspirations.
- **[DrivenData](https://www.drivendata.org/)**: Hosts competitions where data scientists can address significant societal challenges through innovative predictive modeling.
- **[DataCamp](https://www.datacamp.com/)**: Offers real-world data science competitions, allowing participants to hone their skills, win accolades, and present their solutions.

## Can you work on a Large Language Model (LLM) project?

Working on projects centered around Large Language Models (LLM) and generative AI does hold similarities with predictive ML projects, particularly in the areas of model management and code structuring. However, LLM projects also present distinct challenges. Evaluating LLMs can be more intricate, sometimes necessitating the use of external LLMs for thorough testing. Additionally, the training and fine-tuning of LLMs typically demand specific hardware, like high-memory GPUs, and adhere to different methodologies compared to conventional ML tasks.
Working on projects centered around [Large Language Models (LLM)](https://en.wikipedia.org/wiki/Large_language_model) and [Generative AI](https://en.wikipedia.org/wiki/Generative_artificial_intelligence) does hold similarities with predictive ML projects, particularly in the areas of model management and code structuring. However, LLM projects also present distinct challenges. Evaluating LLMs can be more intricate, sometimes necessitating the use of external LLMs for thorough testing. Additionally, the training and fine-tuning of LLMs typically demand specific hardware, like high-memory GPUs, and adhere to different methodologies compared to conventional ML tasks.

Therefore, we recommend starting with a predictive ML project to get acquainted with fundamental MLOps practices. These core skills will then be easier to adapt and apply to LLM projects, easing the progression to these more specialized areas.
16 changes: 9 additions & 7 deletions docs/0. Overview/0.2. Datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,18 @@

## What is a dataset?

A dataset is a meticulously organized collection of data, serving as the cornerstone for any AI or Machine Learning (ML) initiative. The structure of a dataset might vary, yet its pivotal role in shaping the scope, capabilities, and challenges of a project remains undisputed. Data preparation, involving substantial cleaning and exploration of raw data, often consumes the lion's share of a Machine Learning engineer's efforts, adhering to the adage that 80% of the work pertains to data processing, leaving only 20% for modeling. This preparation phase is crucial, setting the stage for the subsequent modeling efforts.
A [dataset](https://en.wikipedia.org/wiki/Data_set) is an organized collection of data, serving as the cornerstone for any AI or Machine Learning (ML) initiative. The structure of a dataset might vary, yet its pivotal role in shaping the scope, capabilities, and challenges of a project remains undisputed. Data preparation, involving substantial cleaning and exploration of raw data, often consumes the lion's share of a Machine Learning engineer's efforts, [adhering to the adage that 80% of the work pertains to data processing, leaving only 20% for modeling](https://www.kaggle.com/discussions/questions-and-answers/268748). Yet, this preparation phase is crucial, setting the stage for the subsequent modeling efforts.

**The impact of a dataset's quality and size on the outcomes of a model is profound, frequently surpassing the effects of model adjustments**.
**The impact of a dataset's quality and size on the outcomes of a model is profound, frequently surpassing the effects of model adjustments. This impact is embedded into the ["Garbage in, garbage out"](https://en.wikipedia.org/wiki/Garbage_in,_garbage_out) concept related to data science projects.**

## When is the dataset used?

Datasets are integral throughout the AI/ML project lifecycle, playing pivotal roles in various stages:

- **Exploration**: This phase involves delving into the dataset to unearth insights, study variable relationships, and discern patterns that could influence future predictions.
- **Data Processing**: At this stage, the focus is on crafting features that encapsulate the predictive essence of the data and on partitioning the dataset effectively to gear up for the modeling phase.
- **Model Tuning**: Here, the objective is to refine the model's hyperparameters through strategies like cross-validation to bolster the model's generalization capability.
- **Model Evaluation**: This final step entails evaluating the model's performance on unseen data and identifying areas for potential improvement.
- **[Exploration](https://en.wikipedia.org/wiki/Data_exploration)**: This phase involves delving into the dataset to find insights, study variable relationships, and discern patterns that could influence future predictions.
- **[Data Processing](https://en.wikipedia.org/wiki/Data_processing)**: At this stage, the focus is on crafting features that encapsulate the predictive essence of the data and on partitioning the dataset effectively to gear up for the modeling phase.
- **[Model Tuning](https://en.wikipedia.org/wiki/Hyperparameter_optimization)**: Here, the objective is to refine the model's hyperparameters through strategies like cross-validation to bolster the model's generalization capability.
- **[Model Evaluation](https://en.wikipedia.org/wiki/Evaluation)**: This final step entails evaluating the model's performance on unseen data and identifying areas for potential improvement.

## What are the types of datasets?

Expand Down Expand Up @@ -41,4 +41,6 @@ Positioned between structured and unstructured data, semi-structured data does n

## Which dataset should you use?

The decision on which dataset to use often boils down to a balance between familiarity and exploration of new data. A simple rule of thumb is to opt for the dataset you are most acquainted with. Regardless of the diversity in data types and applications, the core principles of MLOps are applicable across various domains. Therefore, starting with a well-understood dataset allows you to concentrate on honing your MLOps skills rather than untangling the complexities of an unfamiliar dataset.
The decision on which dataset to use often boils down to a balance between familiarity and exploration of new data. A simple rule of thumb is to opt for the dataset you are most acquainted with. Regardless of the diversity in data types and applications, the core principles of MLOps are applicable across various domains. Therefore, starting with a well-understood dataset allows you to concentrate on honing your MLOps skills rather than untangling the complexities of an unfamiliar dataset.

As mentioned in the [previous section](./0.1. Projects.md), the course uses the [Bike Sharing Demand dataset](https://www.kaggle.com/c/bike-sharing-demand/data) dataset by default. You are free to use any other datasets, either for personal or professional purposes.
Loading

0 comments on commit 8b437a1

Please sign in to comment.