Skip to content

Commit

Permalink
0.2 done
Browse files Browse the repository at this point in the history
  • Loading branch information
fmind committed Mar 23, 2024
1 parent 010e698 commit f45e012
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 56 deletions.
2 changes: 1 addition & 1 deletion docs/0. Overview/0.1. Projects.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ If you're looking for inspiration, several platforms host data science competiti

- Kaggle: The largest global data science community, offering tools and resources to achieve your data science ambitions.
- DrivenData: Hosts competitions for data scientists to address the world's biggest challenges with innovative predictive models.
i- DataCamp: Offers real-world data science competitions to hone your skills and win prizes, showcasing your notebooks.
- DataCamp: Offers real-world data science competitions to hone your skills and win prizes, showcasing your notebooks.

## Can I work on an LLM project?

Expand Down
79 changes: 24 additions & 55 deletions docs/0. Overview/0.2. Datasets.md
Original file line number Diff line number Diff line change
@@ -1,75 +1,44 @@
# 0.2. Datasets

Data is often referred to as the fuel for Machine Learning, and although this course focuses on MLOps, it's crucial to have access to data to fully grasp the various concepts and technologies involved.

## What is a dataset?

## Why do I need a dataset?

## What are the types of dataset?

When mentionning data, the first point is perhaps what are we talking about. When exploiting the model, data will be required at every step and will take many forms, be stored on different supports and will have different properties.
A dataset is a structured collection of data, essential for initiating any AI/ML project. It can vary in form and structure but is crucial for understanding the project's scope, strengths, and limitations. A proficient ML engineer often dedicates more time to data processing than to modeling, adhering to the 80-20 rule. It's common for industrial projects to start with raw datasets that require thorough cleaning and exploration before proceeding to the modeling phase.

Briefly we can note the following data types:
Datasets are the backbone of machine learning, playing a pivotal role at every project stage. The quality and volume of your data significantly influence the performance of your models, often more than fine-tuning the models themselves.

### Structured Data
Structured data adheres to a predefined model, making it easier to search and organize.

* *Tabular*:
Perhaps the most common type of data, where data is organize in rows and columns.
* Column are homogeneous in terms of types
* Typically CSV files and Relational database
* *Time Series*:
Sequence of data points collected or recorded at successive points in time, usually at consistent intervals.
* Characterized by temporal order, meaning the sequence of observations is crucial, and changing the order can alter the meaning or interpretation of the data.
* Typically, financial data and energy
* *Geospatial*:
Data representing a specific location or geographic area on earth
* Can be represented as latitude, longitude or other type of geographic indicators
* Useful to anlyse and visual spatial relationship and patterns
* Typically handled by Geographic Information System (GIS)
* *Graph*
Data are organised around vertices and edges that connects them representing entities and relationships between them.
* useful to model complex networks and many real workd syste,s
* graph can be directed, undirected , weighted, multiple, cyclic, acyclic

### Unstructured Data
## When is the dataset used?

Unstructured data does not follow a predefined model, making it more complex to process.
Datasets are integral throughout the lifecycle of an AI/ML project, including:

* *Text*:
The most common form of unstructured data corresponding to written content.
* can be simple strings to entire book
* characterized by high number of unique words, contextual meaning and ambiguity
* *Multimedia*:
Refer to picture, sound, video data
* challenging due to the high dimensionality, large file sizes, and the complexity of extracting meaningful patterns.
- Exploration: Understanding your data, examining the relationships between input variables and the target prediction (correlations, etc.).
- Data Processing: Crafting optimal features to encapsulate the predictive context and meaningfully splitting your data.
- Model Tuning: Optimizing model hyperparameters with cross-validation to ensure generalizability.
- Model Evaluation: Determining your model's performance on unseen data and identifying areas of improvement.

### Semi Structured Data

Data that does not conform to a rigid data model like structured data, but it does contain tags or other markers to separate semantic elements and enforce hierarchies of records and fields, making it easier to parse than unstructured data.
Examples are XML and JSON files.
## What are the types of dataset?

## Which dataset should I use?
Datasets can be broadly classified into structured, unstructured, and semi-structured categories:

The question of which dataset to use is common, and honestly, the best dataset is the one you're most familiar with.
While the vast array of data types and their diverse applications might seem overwhelming, it's important to remember that many MLOps concepts are universal and can be applied across different domains.
### Structured Data

We will look into the specificities of certain types of applications later in the course. For now, we offer two options for getting started.
This data conforms to a predefined schema, facilitating easier organization and retrieval.

### Option 1: Use your own dataset (recommended)
- Tabular Data: Organized in rows and columns, with each column being type-homogeneous. Commonly found in CSV files and relational databases.
- Time Series Data: Data points collected over time at consistent intervals. The sequence of data is vital, as altering it can change the dataset's meaning. Often used in financial and energy data analysis.
- Geospatial Data: Represents specific locations or geographical areas on Earth, crucial for analyzing spatial relationships and patterns, and commonly managed with Geographic Information Systems (GIS).
- Graph Data: Comprises vertices and edges, representing entities and their interrelations. Graph data can model complex networks and systems, and may be directed, undirected, weighted, or cyclic.

Whenever possible, applying what you learn to your own dataset is highly recommended. This approach has several advantages:
### Unstructured Data

* Relevance: Working with your data means the insights and models you develop are immediately applicable to your projects or interests.
* Familiarity: You are likely more familiar with the specificties of your own data, which will ease the analysis and troubleshooting.
* Customization: It allows you to tailor the MLOps processes and solutions directly to your context.
Unstructured data lack a predefined format, making it more challenging to process and interpret.

### Option 2: Use a dataset from Kaggle
- Text: Ranges from simple strings to entire books, characterized by a vast vocabulary and inherent ambiguities.
- Multimedia: Includes images, audio, and video. The complexity lies in their high dimensionality, file size, and the difficulty of extracting meaningful insights.

If you don't have your own dataset or are looking for a new challenge, we recommend starting with a well-documented and widely-used public dataset. For this course, we suggest the following dataset from Kaggle:
### Semi Structured Data

- Name: House Prices - Advanced Regression Techniques
- Source: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview
Though not adhering to a strict data model like structured data, semi-structured data includes markers or tags to delineate semantic elements, simplifying parsing. Examples include XML and JSON files.

## Which dataset should I use?

Selecting the right dataset is a common question, and the simple answer is: the best dataset is the one you're most familiar with. Despite the array of data types and applications, the core concepts of MLOps are universally applicable across different domains.

0 comments on commit f45e012

Please sign in to comment.