Notes for tools, knowledge and concepts about Data Engineering and and some other roles (DevOps, DA, ...)
Data Engineering is a set of operations aimed at creating interfaces and mechanisms for the flow and access of data. It is a field that is responsible for the architecture that brings data from one place to another, and it is also responsible for working with data and making it available to data scientists for analysis.
- Data Pipeline: A data pipeline is a series of data processing elements connected in series, where the output of one element is the input of the next. Elements can include sources, processors, sinks, and storage.
- ETL: ETL stands for Extract, Transform, Load. It is a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse.
- Data Lake: A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics, and machine learning.
- Data Warehouse: A data warehouse is a system that stores data from multiple sources and transforms it into a format that analysts and business intelligence tools can use to perform complex queries and analysis.
- Data Mart: A data mart is a subset of a data warehouse that is designed for a particular line of business, such as sales, marketing, or finance.
- Data Lakehouse: A data lakehouse is a new data management paradigm that combines the best of data warehouses and data lakes. It provides the scalability and flexibility of a data lake with the performance and reliability of a data warehouse.
- Data Modeling: Data modeling is the process of creating a data model for an information system by applying formal data modeling techniques.
- Data Integration: Data integration is the process of combining data from different sources into a single, unified view.
- Data Transformation: Data transformation is the process of converting data from one format or structure into another format or structure.
- Data Ingestion: Data ingestion is the process of bringing data from external sources into a data storage system.
- Data Analysis: Data analysis is the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.
The nearly full list of tools for Data Engineering is mentioned in this repo
- Orchestration
- Data Lake
- AWS S3 - Cloud storage
- Google Cloud Storage - Cloud storage
- Azure Data Lake Storage - Cloud storage
- MinIO - Open-source object storage
- HDFS - Hadoop Distributed File System
- Data Warehouse
- AWS Redshift - Cloud data warehouse
- Google BigQuery - Cloud data warehouse
- Azure Synapse Analytics - Cloud data warehouse
- Snowflake - Cloud data warehouse
- PostgreSQL - Open-source relational database
- Databrick - Unified data analytics platform
- De Manejar - Vietnamese
- Long Nguyen - Vietnamese
- Data Guy Story - Vietnamese
- ToiLaDuyet - Vietnamese
- 200Labs - Vietnamese