Skip to content

Commit

Permalink
Merge branch 'development' of https://github.com/v3io/tutorials
Browse files Browse the repository at this point in the history
  • Loading branch information
adiso75 committed Jun 24, 2020
2 parents abe0bc6 + 5074de8 commit ee44820
Show file tree
Hide file tree
Showing 3 changed files with 479 additions and 0 deletions.
Binary file added assets/images/pipeline-diagram.JPG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
290 changes: 290 additions & 0 deletions data-ingestion-and-preparation/README.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data ingestion and preparation overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [Overview](#gs-overview)\n",
"- [Basic flow](#basic)\n",
"- [Reading from external database](#databases)\n",
"- [Working with Spark](#spaek)\n",
"- [Working with streaming](#streaming)\n",
"- [Running SQL on Iguazio](#sql)\n",
"- [Managing Parquet files](#parquet)\n",
"- [Accessing Iguazio key value and time series](#frames)\n",
"- [Getting data from AWS S3](#s3)\n",
"- [Running dataframes on GPUs via Nvidia cuDF](#gpu)\n",
"- [Running distributed Python with Dask](#dask)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"gs-overview\"></a>\n",
"## Overview\n",
"This notebook provides an overview of the various options for collecting, storing and manipulating data in Iguazio with references to built-in <br> sample notebooks. The platform address the various requirements for manipulating data in each of the data science lifeccyle steps <br>\n",
"Thus, provides a wide set of tools for handling data using various fraemworks. <br> \n",
"\n",
"Working with data in the various stepts of the pipeline may require different tools and frameworks for each step especially when it comes to the difference <br>\n",
"mechanism needed when you are in your research phase vs an operational phase where you need to build an inferene production pipeline <br>\n",
"You may use different tools and technics within those various steps. <br>\n",
"In the notebook you'll find short sections explaining the various solutions with links to more detailed notebooks<br>\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"<br><img src=\"../assets/images/pipeline-diagram.JPG\" alt=\"pipeline-diagram\" width=\"1000\"/><br>\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"basic\"></a>\n",
"\n",
"## Basic flow\n",
"\n",
"In this notebook you’ll go through basic scenarios of getting and manipulating data in Iguazio along with explaining some core concepts of working with data on the system. <br> \n",
"Thie notebook covers how to get a csv file from an S3 bucket, convert it into a key value table using Spark, Run a SQL query and convert it into a parquet file <br>\n",
"[**basic tutorial**](getting-started-basic.ipynb) \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"databases\"></a>\n",
"## Reading data from external databases\n",
"\n",
"Here are several examples of how to fetch data from external databases\n",
"\n",
"### Using Spark over JDBC\n",
"Spark SQL includes a data source that can read data from other databases using Java database connectivity (JDBC).<br>\n",
"The results are returned as a Spark DataFrame that can easily be processed in Spark SQL or joined with other data sources. <br>\n",
"In this notebooks we have several examples of getting data from various databases such as MySQL , Oracle, Postgress and others <br>\n",
"[**Spark over JDBC**](spark-jdbc.ipynb)\n",
"\n",
"### Using SQL Alchemy \n",
"In this notebook users can learn how to work with SQLAlchemy which is the Python SQL toolkit and Object Relational Mapper <br>\n",
"that gives application developers the full power and flexibility of SQL and then read the data into dataframes and work with Python data frame on the dataset <br>\n",
"[**Using SQLAlchemy**](read-external-db.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"spark\"></a>\n",
"\n",
"## Working with Spark\n",
"\n",
"Spark is a built-in service in Iguazio and it uses for both analyzing and manipulating a large set of data in a distributed way as well as for analyzing streams using its Spark streaming module <br>\n",
"Spark lets you query structured data inside Spark programs by using either SQL or a familiar DataFrame API.<br>\n",
"DataFrames and SQL provide a common way to access a variety of data sources. <br>\n",
"In this notebook, you'll learn how to use Spark SQL and DataFrames to access objects, tables, and unstructured data that persist in the data containers of the Iguazio data platform. <br>\n",
"The platform's Spark drivers implement the data-source API and support predicate push down: <br>\n",
"the queries are passed to the platform's data store, which returns only the relevant data. <br>\n",
"This allow accelerated and high-speed access from Spark to data stored in the platform.<br>\n",
"[**Working with Spark**](spark-sql-analytics.ipynb)\n",
"\n",
"### Ingesting data using Spark\n",
"Spark can be useful for ingesting data into the system for both batch or micro batch processing <br>\n",
"https://www.iguazio.com/docs/tutorials/latest-release/getting-started/data-ingestion-w-spark-qs/\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"streaming\"></a>\n",
"\n",
"## Working with Streaming\n",
"\n",
"### Using Nuclio to get data from common streaming engines\n",
"Iguazio has a serverless function (AKA Nuclio) that provides a mechanism to analyze and process real time events from various streaming engines. <br> \n",
"The following streaming frameworks are being supported:<br>\n",
"Kafaka, kinesis, Azure event hub, Iguazio stream, RabbitMQ, MQTT <br>\n",
"Using Nuclio functions for getting and analyzing streaming data in real time is a very common practice when building a real time pipeline. <br>\n",
"Streaming could be: images/videos, telemetry data , financial transactions, web clicks , sensors data etc.. <br> \n",
"Along with reading the stream events, customers can implement their own logic within the nuclio function to manipulate or enrich the data and prepare it for the next step in the pipeline.<br>\n",
"The Nuclio serverless function can sustain high load of workloads with very low latencies ,thus it makes it very useful for building an event driven applications with strict latency requirements.<br>\n",
"Here you can find additional information:<br>\n",
"https://www.iguazio.com/docs/intro/latest-release/serverless/\n",
"\n",
"### Iguazio built-in streaming (V3IO) \n",
"Iguazio has its own built-in streaming engine called V3IO stream. This stream is often being used in real time pipeline for writing into a queue. <br> \n",
"it can also serve as a streaming engine in a similar way to kafka or kinesis so users don't need to use an external one. <br>\n",
"In this notebook you’ll find an example of writing a nuclio function that uses Iguazio stream: <br>\n",
"[**Reading from V3IO stream**](../demos/real-time-user-segmentation.ipynb)\n",
"\n",
"### Spark streaming\n",
"Need an example\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"sql\"></a>\n",
"\n",
"## Runnig SQL queries on Iguazio data layer\n",
"\n",
"Users can run SQL queries to fetch data that resides on Iguazio key value and Parquet files. There are two options to run SQL query: <br>\n",
"\n",
"### Running full ANSII SQL queries\n",
"Iguazio has a built-in service called Presto. Presto is an open source that provides distributed framework for running SQL queries. <br>\n",
"To run a query using Presto from your notebook all you need to do is to use %SQL magic command and then write the query. <br>\n",
"Those queries will be running as distributed queries across Iguazio application nodes <br>\n",
"In this basic tutorial you can find an example for running such query using the %SQL magic command <br>\n",
"[**basic tutorial**](getting-started-basic.ipynb) <br>\n",
"\n",
"Note that for running queries on Parquet table you need to work with Hive tables. In this notebook you can find an example <br> \n",
"of a script that takes a csv file and convert it to a hive table. <br>\n",
"[**Working with Hive tables**](csv-to-hive.ipynb) <br>\n",
"\n",
"### Using Spark SQL\n",
"Spark is a built-in service in running in the platfrom and users can run SQL queries using Spark SQL <br>\n",
"[**Running SQL with Spark**](spark-sql-analytics.ipynb)\n",
"\n",
"\n",
"### Running SQL from a Nuclio function \n",
"In some cases users need to run a SQL query as part an event driven application. In this notebook you'll see an example of <br>\n",
"running a SQL query as part of a nuclio function <br>\n",
"[**Running SQL from Nuclio function**](nuclio-read-via-presto.ipynb)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"parquet\"></a>\n",
"\n",
"## Managing Parquet files\n",
"\n",
"Parquet is a columnar storage format that provides high-density high-performance file organization. The following section demonstrates how to create and write data to a Parquet table in the Iguazio Data Science Platform and read its contents. For information about <br> reading and writing Parquet files from Python applications view the notebook below: <br>\n",
"[**Managing Parquet files**](parquet-read-write.ipynb) <br>\n",
"\n",
"Once you bring Parquet files into the platform you may want to create hive tables and run SQL queries on those tables <br>\n",
"This notebook covers the way to do it: <br>\n",
"[**Convert Parquet to hive**](parquet-to-hive.ipynb)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"frames\"></a>\n",
"\n",
"## Accessing Iguazio's key value and Time series data using a native library (Frames)\n",
"\n",
"Iguazio has a multi model data layer that store data as key value and time series format. Users can access the data using various tools and APIs such as Spark, SQL or pandas dataframe and it can also access the data using a build-in framework called Frames. \n",
"Frames is a multi-model open-source data-access library, developed by Iguazio, which provides a unified high-performance DataFrame API for working with data in the data store of the Iguazio Data Science Platform. Frames currently supports the NoSQL (key/value) and time-series (TSDB) data models via its kv and tsdb backends.\n",
"This notebook explain how to work with Frames: <br>\n",
"[**Working with Frames**](frames.ipynb)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"s3\"></a>\n",
"\n",
"## Getting data from AWS S3\n",
"\n",
"The simple way for getting data from S3 is just by using a curl command sending a request to the AWS S3 bucket\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sh\n",
"CSV_PATH=\"/User/examples/stocks.csv\"\n",
"curl -L \"iguazio-sample-data.s3.amazonaws.com/2018-03-26_BINS_XETR08.csv\" > ${CSV_PATH}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Future – need to add a reference to the XCP tool and how to load bulk of data from S3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"dask\"></a>\n",
"\n",
"## Running distributed python code with Dask\n",
"\n",
"The dask frameworks enabling users to parallelize internal systems as not all computations fit into a dataframe. Dask exposes lower-level APIs letting you build custom systems for in-house applications. This helps parallelize python processes and dramatically accelerate their performance <br>\n",
"[**Configuring a dask cluster**](dask-cluster.ipynb)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"gpu\"></a>\n",
"\n",
"## Running dataframes on GPUs via Nvidia cuDF"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Iguazio has deployed the Nvidia Rapids library as part of its solution enabling users to easily access GPU resources. <br>\n",
"cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data. <br>\n",
"cuDF provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming. <br>\n",
"This notebook provide a sample example of using cuDF <br>\n",
"[**cudf vs pandas dataframe**](gpu-cudf-vs-pd.ipynb)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading

0 comments on commit ee44820

Please sign in to comment.