-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BLOG] Add ibis-python-data-analysis-productivity-framework #655
Open
MarsBarLee
wants to merge
8
commits into
develop
Choose a base branch
from
ibis-python-data-analysis-productivity-framework
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+250
−0
Open
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
e9da251
Add files
MarsBarLee 1085fe8
Update description and links
MarsBarLee acd9403
Update link formatting
MarsBarLee 4053b5d
Update link formatting
MarsBarLee 2875adb
Update hero and feature image
MarsBarLee 79bf5c4
Update file paths
MarsBarLee 8babdff
Update image file paths
MarsBarLee 8a47592
Add line break
MarsBarLee File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
248 changes: 248 additions & 0 deletions
248
apps/labs/posts/ibis-python-data-analysis-productivity-framework.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,248 @@ | ||
--- | ||
title: "Ibis: Python data analysis productivity framework" | ||
author: ivan-ogasawara | ||
published: July 9, 2019 | ||
description: 'During the last months OmniSci and Quansight were working together to add a backend to Ibis for OmniSciDB (formerly MapD Core)The implementation of this new backend also resulted in the creation of new expressions/operators on Ibis core, such as GeoSpatial data types and operations, trigonometric operations and Ssome statistcal operations.' | ||
category: [PyData ecosystem] | ||
featuredImage: | ||
src: /posts/ibis-python-data-analysis-productivity-framework/blog_feature_var2.svg | ||
alt: 'An illustration of a brown and a white hand coming towards each other to pass a business card with the logo of Quansight Labs.' | ||
hero: | ||
imageSrc: /posts/ibis-python-data-analysis-productivity-framework/blog_hero_var1.svg | ||
imageAlt: 'An illustration of a brown hand holding up a microphone, with some graphical elements highlighting the top of the microphone.' | ||
--- | ||
|
||
Ibis is a library pretty useful on data analysis tasks that provides a | ||
pandas-like API that allows operations like create filter, add columns, | ||
apply math operations etc in a `lazy` mode so all the operations are | ||
just registered in memory but not executed and when you want to get the | ||
result of the expression you created, Ibis compiles that and makes a | ||
request to the remote server (remote storage and execution systems like | ||
Hadoop components or SQL databases). Its goal is to simplify analytical | ||
workflows and make you more productive. | ||
|
||
Ibis was created by [Wes McKinney](https://github.com/wesm) and is | ||
mainly maintained by [Phillip Cloud](https://github.com/cpcloud) and | ||
[Krisztián Szűcs](https://github.com/kszucs). Also, recently, I was | ||
invited to become a maintainer of the Ibis repository! | ||
|
||
Maybe you are thinking: \"why should I use Ibis?\". Well, if you have | ||
any of the following issues, probably you should consider using Ibis in | ||
your analytical workflow! | ||
|
||
- if you need to get data from a SQL database but you don't know much | ||
about SQL \... | ||
- if you create SQL statements manually using string and have a lot of | ||
`IF`'s in your code that compose specific parts of your SQL code | ||
(it could be pretty hard to maintain and it will makes your code | ||
pretty ugly) \... | ||
- if you need to handle data with a big volume \... | ||
|
||
If you want to learn more about ibis consider taking a look at these | ||
tutorials: | ||
|
||
- [https://docs.ibis-project.org/tutorial.html](https://docs.ibis-project.org/tutorial.html) | ||
|
||
Do you want to watch some interesting video about Ibis? Check this out: | ||
|
||
- [https://www.youtube.com/embed/8Tzh42mQjrw?start=1625](https://www.youtube.com/embed/8Tzh42mQjrw?start=1625) | ||
|
||
**Now, let's check out some work developed here at Quansight in the | ||
last months!** | ||
|
||
During the last months **OmniSci** and **Quansight** were working | ||
together to add a backend to Ibis for **OmniSciDB** (formerly MapD | ||
Core)! In a few words, OmniSciDB is an in-memory, column store, SQL | ||
relational database designed from the ground up to run on GPUs. If you | ||
don't know yet this amazing database, I invite you to [check it | ||
out](https://omnisci.com). | ||
|
||
The implementation of this new backend also resulted in the creation of | ||
new expressions/operators on Ibis core, such as: | ||
|
||
- GeoSpatial data types and operations | ||
- Trigonometric operations | ||
- Some statistcal operations | ||
|
||
First, let's connect to a *OmniSciDB* and play with this new features! | ||
|
||
``` python | ||
# install the dependencies if you need! | ||
# !conda install -y ibis-framework=1.1.0 pyarrow pymapd vega geopandas geoalchemy2 shapely matplotlib --force-reinstall | ||
``` | ||
|
||
``` python | ||
import ibis | ||
from matplotlib import pyplot as plt | ||
|
||
print('ibis:', ibis.__version__) | ||
``` | ||
|
||
ibis: 1.2.0+7.g3afa8b0d | ||
|
||
``` python | ||
# metis.mapd.com is used in some OmniSci docs | ||
# but maybe you want to install your own OmniSciDB instance | ||
# you can take a look into installation section at | ||
# https://www.omnisci.com/docs/latest/ | ||
# also you maybe want to check the omniscidb-cpu conda package | ||
# conda install -c conda-forge omniscidb-cpu | ||
# if you need any help, feel free to open an issue at | ||
# https://github.com/conda-forge/omniscidb-cpu-feedstock/ | ||
omniscidb_cli = ibis.mapd.connect( | ||
host='metis.mapd.com', | ||
user='mapd', | ||
password='HyperInteractive', | ||
port=443, | ||
database='mapd', | ||
protocol='https' | ||
) | ||
``` | ||
|
||
### GeoSpatial features | ||
|
||
You need to handle geospatial data in a esier way? | ||
|
||
Let's take a look inside `zipcodes_2017` table! | ||
|
||
Well, currently `omniscidb` backend doesn't support `geopandas` output, | ||
so let's use a workaround for that! It should be implemented into | ||
`omniscidb` backend soon! (see: | ||
[gist-code](https://gist.githubusercontent.com/xmnlab/587dd1bde44850f3117a1087ed3f0f28/raw/0750400db90cf97319a91aa514648c31ad4ace45/omniscidb_geopandas_output.py)) | ||
|
||
``` python | ||
gist_url = 'https://gist.githubusercontent.com/xmnlab/587dd1bde44850f3117a1087ed3f0f28/raw/0750400db90cf97319a91aa514648c31ad4ace45/omniscidb_geopandas_output.py' | ||
!wget {gist_url} -O omniscidb_geopandas_output.py | ||
``` | ||
|
||
--2019-07-05 11:31:57-- [https://gist.githubusercontent.com/xmnlab/587dd1bde44850f3117a1087ed3f0f28/raw/0750400 | ||
db90cf97319a91aa514648c31ad4ace45/omniscidb_geopandas_output.py](https://gist.githubusercontent.com/xmnlab/587dd1bde44850f3117a1087ed3f0f28/raw/0750400db90cf97319a91aa514648c31ad4ace45/omniscidb_geopandas_output.py) | ||
Resolviendo gist.githubusercontent.com (gist.githubusercontent.com)... 151.101.48.133 | ||
Conectando con gist.githubusercontent.com (gist.githubusercontent.com)[151.101.48.133]:443... conectado. | ||
Petición HTTP enviada, esperando respuesta... 200 OK | ||
Longitud: 1874 (1,8K) [text/plain] | ||
Guardando como: “omniscidb_geopandas_output.py” | ||
|
||
omniscidb_geopandas 100%[===================>] 1,83K --.-KB/s en 0s | ||
|
||
2019-07-05 11:31:57 (70,1 MB/s) - “omniscidb_geopandas_output.py” guardado [1874/1874] | ||
|
||
``` python | ||
# workaround to use geopandas output | ||
from omniscidb_geopandas_output import enable_geopandas_output | ||
enable_geopandas_output(omniscidb_cli) | ||
``` | ||
|
||
``` python | ||
t = omniscidb_cli.table('zipcodes_2017') | ||
display(t) | ||
``` | ||
|
||
![A DatabaseTable with its data types](/posts/ibis-python-data-analysis-productivity-framework/a0a51ad71e1a32140f3e47e71145e6a67d061750.png) | ||
|
||
``` python | ||
print('# rows:', t.count().execute()) | ||
``` | ||
|
||
# rows: 33144 | ||
|
||
This table has \~33k rows. For this example, let's use just the first | ||
1k rows. | ||
|
||
``` python | ||
expr = t[t.omnisci_geo].head(1000) | ||
df = expr.execute() | ||
``` | ||
|
||
Instead of getting all rows from the database and get from that the | ||
first 1000 rows, Ibis will prepare a SQL statement to get just the first | ||
1000 rows! So it reduces the memory consuming to just the data you need! | ||
|
||
This is what Ibis will request to the database: | ||
|
||
``` python | ||
print(expr.compile()) | ||
``` | ||
|
||
SELECT "omnisci_geo" | ||
FROM zipcodes_2017 | ||
LIMIT 1000 | ||
|
||
Of course geospatial data reading as text wouldn't be useful, so let's | ||
plot the result! | ||
|
||
**Remember: we are using geopandas here!** | ||
|
||
``` python | ||
# let's add some custom style :) | ||
style_kwds = { | ||
'linewidth': 2, | ||
'markersize': 2, | ||
'facecolor': 'red', | ||
'edgecolor': 'red' | ||
} | ||
|
||
df['omnisci_geo'].iloc[::3].plot(**style_kwds) | ||
plt.show() | ||
``` | ||
|
||
![A scatterplot graph clustered in the upper right corner. It is clusetered -120 to 80 on the X axis and 40-60 in the Y axis.](/posts/ibis-python-data-analysis-productivity-framework/e62b7c1311b137ea2d1bfd6e7715369df26b2570.png) | ||
|
||
### Trigonometric operations | ||
|
||
Currently the OmniSciDB backend supports the follow trigonometric | ||
operations: `acos`, `asin`, `atan`, `atan2`, `cos`, `cot`, `sin`, `tan`. | ||
|
||
Let's check an example using a `sin` operation over `rowid` from | ||
`zipcodes_2017`. | ||
|
||
``` python | ||
# if you want to use a SQL statement try`sql` method! | ||
t = omniscidb_cli.sql('select rowid from zipcodes_2017') | ||
|
||
expr = t[t.rowid, t.rowid.sin().name('rowid_sin')].sort_by('rowid').head(100) | ||
expr.execute().rowid_sin.plot() | ||
plt.show() | ||
``` | ||
|
||
![A sine wave plot that repeats many times, the high points at 1,-- and -1.00. The X axis is 0 to 80.](/posts/ibis-python-data-analysis-productivity-framework/70f0a567ee713d1392bab6d8fef07bbe9777c033.png) | ||
|
||
### Some statistical operations | ||
|
||
The OmniSciDB Ibis backend also implements some statistical operations, | ||
such as: `Correlation (corr)`, `Standard Deviation (stddev)`, | ||
`Variance (var)` and `Covariance (cov)`. | ||
|
||
Let's check a pretty simple example: if there is any correlation in | ||
this dataset between `per capita income` and `education`. | ||
|
||
``` python | ||
t = omniscidb_cli.table('demo_vote_clean') | ||
# remove some conflictives fields: 'TYPE', 'NAME', 'COUNTY' field | ||
fields = [name for name in t.schema().names if name not in ('TYPE', 'NAME', 'COUNTY')] | ||
t = t[fields].distinct() | ||
t.PerCapitaIncome.corr(t.Education).execute() | ||
``` | ||
|
||
0.7212061029308654 | ||
|
||
The result `~0.72` means that `Per Capita Income` and `Education` has a | ||
positive correlation in this dataset. | ||
|
||
### Conclusions | ||
|
||
Ibis is a cool library that can help you in your data analysis tasks. If | ||
you already use pandas, it will be pretty easy to add Ibis in your | ||
workflow! | ||
|
||
So \... | ||
|
||
- Are you excited to use Ibis? [Try it out | ||
now](https://docs.ibis-project.org/getting-started.html)! | ||
- Have you already used Ibis? Reach out to me, | ||
[[email protected]]([email protected]), and share your experience! | ||
- Are you interested in contributing to Ibis? Check the [good first | ||
issues](https://github.com/ibis-project/ibis/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) | ||
label on GitHub! | ||
- Do you want to add new features and want to fund Ibis? Contact us at | ||
([email protected])[[email protected]] |
Binary file added
BIN
+29.5 KB
...ta-analysis-productivity-framework/70f0a567ee713d1392bab6d8fef07bbe9777c033.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+15.7 KB
...ta-analysis-productivity-framework/a0a51ad71e1a32140f3e47e71145e6a67d061750.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions
1
...ic/posts/ibis-python-data-analysis-productivity-framework/blog_feature_var2.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions
1
...ublic/posts/ibis-python-data-analysis-productivity-framework/blog_hero_var1.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+8.3 KB
...ta-analysis-productivity-framework/e62b7c1311b137ea2d1bfd6e7715369df26b2570.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our MDX compiler does not recognize indented text as a code block, only fenced text.