From c8989d941bef026a001c06a90ae85cde44c367ed Mon Sep 17 00:00:00 2001 From: Matthew Rocklin Date: Fri, 15 Dec 2023 09:12:03 -0800 Subject: [PATCH] Update dataframe page. --- docs/source/dataframe-api.rst | 4 +- docs/source/dataframe-create.rst | 8 +- docs/source/dataframe-design.rst | 4 +- docs/source/dataframe-extra.rst | 14 ++ docs/source/dataframe-groupby.rst | 4 +- docs/source/dataframe.rst | 343 ++++++++++++++---------------- 6 files changed, 187 insertions(+), 190 deletions(-) create mode 100644 docs/source/dataframe-extra.rst diff --git a/docs/source/dataframe-api.rst b/docs/source/dataframe-api.rst index 73723fc2c8d..cbaa4107647 100644 --- a/docs/source/dataframe-api.rst +++ b/docs/source/dataframe-api.rst @@ -1,5 +1,5 @@ -API ---- +Dask DataFrame API +================== .. currentmodule:: dask.dataframe diff --git a/docs/source/dataframe-create.rst b/docs/source/dataframe-create.rst index 811953f75ff..c2cabf2cbb5 100644 --- a/docs/source/dataframe-create.rst +++ b/docs/source/dataframe-create.rst @@ -1,5 +1,5 @@ -Create and Store Dask DataFrames -================================ +Load and Save Data with Dask DataFrames +======================================= .. meta:: :description: Learn how to create DataFrames and store them. Create a Dask DataFrame from various data storage formats like CSV, HDF, Apache Parquet, and others. @@ -67,7 +67,7 @@ Read from CSV You can use :func:`read_csv` to read one or more CSV files into a Dask DataFrame. It supports loading multiple files at once using globstrings: - + .. code-block:: python >>> df = dd.read_csv('myfiles.*.csv') @@ -76,7 +76,7 @@ You can break up a single large file with the ``blocksize`` parameter: .. code-block:: python - >>> df = dd.read_csv('largefile.csv', blocksize=25e6) # 25MB chunks + >>> df = dd.read_csv('largefile.csv', blocksize=25e6) # 25MB chunks Changing the ``blocksize`` parameter will change the number of partitions (see the explanation on :ref:`partitions `). A good rule of thumb when working with diff --git a/docs/source/dataframe-design.rst b/docs/source/dataframe-design.rst index 48a63bfbc2a..95046456e2e 100644 --- a/docs/source/dataframe-design.rst +++ b/docs/source/dataframe-design.rst @@ -1,7 +1,7 @@ .. _dataframe.design: -Internal Design -=============== +Dask DataFrame Design +===================== Dask DataFrames coordinate many Pandas DataFrames/Series arranged along an index. We define a Dask DataFrame object with the following components: diff --git a/docs/source/dataframe-extra.rst b/docs/source/dataframe-extra.rst new file mode 100644 index 00000000000..41126dd9092 --- /dev/null +++ b/docs/source/dataframe-extra.rst @@ -0,0 +1,14 @@ +Additional Information +====================== + +.. toctree:: + :maxdepth: 1 + + Parquet + Indexing + SQL + Join Performance + Shuffling Performance + dataframe-categoricals.rst + Extend + Hive Partitioning diff --git a/docs/source/dataframe-groupby.rst b/docs/source/dataframe-groupby.rst index 5b81f90ac34..03261b3ad36 100644 --- a/docs/source/dataframe-groupby.rst +++ b/docs/source/dataframe-groupby.rst @@ -1,5 +1,5 @@ -Shuffling for GroupBy and Join -============================== +Shuffling Performance +===================== .. currentmodule:: dask.dataframe diff --git a/docs/source/dataframe.rst b/docs/source/dataframe.rst index 1278411c633..d2a3610c59f 100644 --- a/docs/source/dataframe.rst +++ b/docs/source/dataframe.rst @@ -8,198 +8,181 @@ Dask DataFrame :maxdepth: 1 :hidden: - dataframe-create.rst + Load and Save Data + Internal Design Best Practices - dataframe-design.rst - dataframe-groupby.rst - dataframe-joins.rst - dataframe-indexing.rst - dataframe-categoricals.rst - dataframe-extend.rst - dataframe-parquet.rst - dataframe-hive.rst - dataframe-sql.rst - dataframe-api.rst - -A Dask DataFrame is a large parallel DataFrame composed of many smaller pandas -DataFrames, split along the index. These pandas DataFrames may live on disk -for larger-than-memory computing on a single machine, or on many different -machines in a cluster. One Dask DataFrame operation triggers many operations -on the constituent pandas DataFrames. + API + dataframe-extra.rst -.. raw:: html +.. grid:: 1 1 2 2 - + .. grid-item:: + :columns: 12 12 8 8 -Examples --------- + Dask Dataframe helps you process large tabular data by parallelizing pandas, + either on your laptop for larger-than-memory computing, or on a distributed + cluster of computers. -Visit https://examples.dask.org/dataframe.html to see and run examples using -Dask DataFrame. + - **Just pandas:** Dask dataframes are just many pandas dataframes. + The API is the same. The execution is the same. + - **Large scale:** Works on 100 GiB on a laptop, or 100 TiB on a cluster + - **Easy to use:** Pure Python, easy to set up and debug -.. image:: images/dask-dataframe.svg - :alt: Column of four squares collectively labeled as a Dask DataFrame with a single constituent square labeled as a pandas DataFrame. - :width: 45% - :align: right + .. grid-item:: + :columns: 12 12 4 4 -Design ------- + .. image:: images/dask-dataframe.svg + :alt: Column of four squares collectively labeled as a Dask DataFrame with a single constituent square labeled as a pandas DataFrame. Dask DataFrames coordinate many pandas DataFrames/Series arranged along the index. A Dask DataFrame is partitioned *row-wise*, grouping rows by index value for efficiency. These pandas objects may live on disk or on other machines. -Dask DataFrame copies the pandas DataFrame API ----------------------------------------------- - -Because the ``dask.DataFrame`` application programming interface (API) is a -subset of the ``pd.DataFrame`` API, it should be familiar to pandas users. -There are some slight alterations due to the parallel nature of Dask: - -.. grid:: 2 - - .. grid-item-card:: Dask DataFrame API - - .. code-block:: python - - >>> import dask.dataframe as dd - >>> df = dd.read_csv('2014-*.csv') - >>> df.head() - x y - 0 1 a - 1 2 b - 2 3 c - 3 4 a - 4 5 b - 5 6 c - - >>> df2 = df[df.y == 'a'].x + 1 - >>> df2.compute() - 0 2 - 3 5 - Name: x, dtype: int64 - - .. grid-item-card:: pandas DataFrame API - - .. code-block:: python - - >>> import pandas as pd - >>> df = pd.read_csv('2014-1.csv') - >>> df.head() - x y - 0 1 a - 1 2 b - 2 3 c - 3 4 a - 4 5 b - 5 6 c - - >>> df2 = df[df.y == 'a'].x + 1 - >>> df2 - 0 2 - 3 5 - Name: x, dtype: int64 +From Pandas to Dask +------------------- + +Dask DataFrame copies pandas, and so should be familiar to most users + +.. tab-set:: + + .. tab-item:: Load Data + + Pandas and Dask have the same API, and so switching from one to the other + is easy + + .. grid:: 1 1 2 2 + + .. grid-item:: + + .. code-block:: python + + >>> import pandas as pd + + >>> df = pd.read_parquet('s3://mybucket/myfile.parquet') + >>> df.head() + 0 1 a + 1 2 b + 2 3 c + + .. grid-item:: + + .. code-block:: python + + >>> import dask.dataframe as dd + + >>> df = dd.read_parquet('s3://mybucket/myfile.*.parquet') + >>> df.head() + 0 1 a + 1 2 b + 2 3 c + + .. tab-item:: Data Processing + + Dask does pandas in parallel. + Dask is lazy; when you want an in-memory result add ``.compute()``. + + .. grid:: 1 1 2 2 + + .. grid-item:: + + .. code-block:: python + + >>> import pandas as pd + + >>> df = df[df.value >= 0] + >>> joined = df.merge(other, on="account") + >>> result = joined.groupby("account") + + >>> result + alice 123 + bob 456 + + + .. grid-item:: + + .. code-block:: python + + >>> import dask.dataframe as dd + + >>> df = df[df.value >= 0] + >>> joined = df.merge(other, on="account") + >>> result = joined.groupby("account").value.mean() + + >>> result.compute() + alice 123 + bob 456 + + .. tab-item:: Machine Learning + + Machine learning libraries often have Dask submodules that + expect Dask dataframes and operate in parallel + + .. grid:: 1 1 2 2 + + .. grid-item:: + + .. code-block:: python + + >>> import pandas as pd + >>> import xgboost + >>> from sklearn.cross_validation import train_test_split + + >>> X_train, X_test, y_train, y_test = train_test_split( + ... X, y, test_size=0.2, + ) + >>> dtrain = xgboost.DMatrix(X_train, label=y_train) + + >>> xgboost.train(params, dtrain, 100) + + + .. grid-item:: + + .. code-block:: python + + >>> import dask.dataframe as dd + >>> import xgboost.dask + >>> from dask_ml.model_selection import train_test_split + + >>> X_train, X_test, y_train, y_test = train_test_split( + ... X, y, test_size=0.2, + ) + >>> dtrain = xgboost.dask.DaskDMatrix(client, X, y) + + >>> xgboost.dask.train(params, dtrain, 100) + As with all Dask collections, you trigger computation by calling the -``.compute()`` method. - -Common Uses and Anti-Uses -------------------------- - -Dask DataFrame is used in situations where pandas is commonly needed, usually when -pandas fails due to data size or computation speed: - -- Manipulating large datasets, even when those datasets don't fit in memory -- Accelerating long computations by using many cores -- Distributed computing on large datasets with standard pandas operations like - groupby, join, and time series computations - -Dask DataFrame may not be the best choice in the following situations: - -* If your dataset fits comfortably into RAM on your laptop, then you may be - better off just using pandas. There may be simpler ways to improve - performance than through parallelism -* If your dataset doesn't fit neatly into the pandas tabular model, then you - might find more use in :doc:`dask.bag ` or :doc:`dask.array ` -* If you need functions that are not implemented in Dask DataFrame, then you - might want to look at :doc:`dask.delayed ` which offers more - flexibility -* If you need a proper database with all that databases offer you might prefer - something like Postgres_ - -.. _Postgres: https://www.postgresql.org/ - - -Scope ------ - -Dask DataFrame covers a well-used portion of the pandas API. -The following class of computations works well: - -* Trivially parallelizable operations (fast): - * Element-wise operations: ``df.x + df.y``, ``df * df`` - * Row-wise selections: ``df[df.x > 0]`` - * Loc: ``df.loc[4.0:10.5]`` - * Common aggregations: ``df.x.max()``, ``df.max()`` - * Is in: ``df[df.x.isin([1, 2, 3])]`` - * Date time/string accessors: ``df.timestamp.month`` -* Cleverly parallelizable operations (fast): - * groupby-aggregate (with common aggregations): ``df.groupby(df.x).y.max()``, - ``df.groupby('x').min()`` (see :ref:`dataframe.groupby.aggregate`) - * groupby-apply on index: ``df.groupby(['idx', 'x']).apply(myfunc)``, where - ``idx`` is the index level name - * value_counts: ``df.x.value_counts()`` - * Drop duplicates: ``df.x.drop_duplicates()`` - * Join on index: ``dd.merge(df1, df2, left_index=True, right_index=True)`` - or ``dd.merge(df1, df2, on=['idx', 'x'])`` where ``idx`` is the index - name for both ``df1`` and ``df2`` - * Join with pandas DataFrames: ``dd.merge(df1, df2, on='id')`` - * Element-wise operations with different partitions / divisions: ``df1.x + df2.y`` - * Date time resampling: ``df.resample(...)`` - * Rolling averages: ``df.rolling(...)`` - * Pearson's correlation: ``df[['col1', 'col2']].corr()`` -* Operations requiring a shuffle (slow-ish, unless on index, see :doc:`dataframe-groupby`) - * Set index: ``df.set_index(df.x)`` - * groupby-apply not on index (with anything): ``df.groupby(df.x).apply(myfunc)`` - * Join not on the index: ``dd.merge(df1, df2, on='name')`` - -However, Dask DataFrame does not implement the entire pandas interface. Users -expecting this will be disappointed. Notably, Dask DataFrame has the following -limitations: - -1. Setting a new index from an unsorted column is expensive -2. Many operations like groupby-apply and join on unsorted columns require - setting the index, which as mentioned above, is expensive -3. The pandas API is very large. Dask DataFrame does not attempt to implement - many pandas features or any of the more exotic data structures like NDFrames -4. Operations that were slow on pandas, like iterating through row-by-row, - remain slow on Dask DataFrame - -See the :doc:`DataFrame API documentation` for a more extensive list. - - -Execution ---------- - -By default, Dask DataFrame uses the :ref:`multi-threaded scheduler `. -This exposes some parallelism when pandas or the underlying NumPy operations -release the global interpreter lock (GIL). Generally, pandas is more GIL -bound than NumPy, so multi-core speed-ups are not as pronounced for -Dask DataFrame as they are for Dask Array. This is particularly true -for string-heavy Python DataFrames, as Python strings are GIL bound. - -There has been recent work on changing the underlying representation -of pandas string data types to be backed by -`PyArrow Buffers `_, which -should release the GIL, however, this work is still considered experimental. - -When dealing with text data, you may see speedups by switching to the -:doc:`distributed scheduler ` either on a cluster or -single machine. +``.compute()`` method or persist data in distributed memory with the +``.persist()`` method.. + +When not to use Dask Dataframes +------------------------------- + +Dask dataframes are often used either when ... + +1. Your data is too big +2. Your computation is too slow and other techniques don't work + +You should probably stick to just using pandas if ... + +1. Your data is small +2. Your computation is fast (subsecond) +3. There are simpler ways to accelerate your computation, like avoiding + ``.apply`` or Python for loops and using a built-in pandas method instead. + +Examples +-------- + +Visit https://examples.dask.org/dataframe.html to see and run examples using +Dask DataFrame. + +.. raw:: html + +