From 07309e13ca2f225ce93ba879ba8e2d73293a1529 Mon Sep 17 00:00:00 2001 From: Matthew Rocklin Date: Tue, 5 Dec 2023 13:03:38 -0800 Subject: [PATCH] Update landing page This rips out most of the previous content. It focuses less on Dask's internal design / intent / architecture / goals, and more on how to use the project. I think (hope?) that the design also comes across a little more cleanly. --- docs/source/index.rst | 350 +++++++++++++++++++++++++++++++----------- 1 file changed, 261 insertions(+), 89 deletions(-) diff --git a/docs/source/index.rst b/docs/source/index.rst index a31b63e50a4..1a86eb7de04 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -2,132 +2,304 @@ Dask ==== -*Dask is a flexible library for parallel computing in Python.* +*Dask is a Python library for parallel and distributed computing.* Dask is ... -Dask is composed of two parts: +- **Easy** to use and set up (it's just a Python library) +- **Powerful** at providing scale, and unlocking complex algorithms + scale up +- and **Fun** 🎉 -1. **Dynamic task scheduling** optimized for computation. This is similar to - *Airflow, Luigi, Celery, or Make*, but optimized for interactive - computational workloads. -2. **"Big Data" collections** like parallel arrays, dataframes, and lists that - extend common interfaces like *NumPy, Pandas, or Python iterators* to - larger-than-memory or distributed environments. These parallel collections - run on top of dynamic task schedulers. +How to Use Dask +--------------- -Dask emphasizes the following virtues: +Dask provides several APIs. Choose one that works best for you: -* **Familiar**: Provides parallelized NumPy array and Pandas DataFrame objects -* **Flexible**: Provides a task scheduling interface for more custom workloads - and integration with other projects. -* **Native**: Enables distributed computing in pure Python with access to - the PyData stack. -* **Fast**: Operates with low overhead, low latency, and minimal serialization - necessary for fast numerical algorithms -* **Scales up**: Runs resiliently on clusters with 1000s of cores -* **Scales down**: Trivial to set up and run on a laptop in a single process -* **Responsive**: Designed with interactive computing in mind, it provides rapid - feedback and diagnostics to aid humans +.. tab-set:: -| + .. tab-item:: Tasks -.. figure:: images/dask-overview.svg - :alt: Dask is composed of three parts. "Collections" create "Task Graphs" which are then sent to the "Scheduler" for execution. There are two types of schedulers that are described in more detail below. - :align: center + Dask Futures parallelize arbitrary for-loop style Python code, + providing: - High level collections are used to generate task graphs which can be executed by schedulers on a single machine or a cluster. + - **Flexible** tooling allowing you to construct custom + pipelines and workflows + - **Powerful** scaling techniques, processing several thousand + tasks per second + - **Responsive** feedback allowing for intuitive execution, + and helpful dashboards -| + Dask futures form the foundation for other Dask work -See the `dask.distributed documentation (separate website) -`_ for more technical information -on Dask's distributed scheduler. + Learn more at :bdg-link-primary:`Futures Documentation ` -Familiar user interface ------------------------ + .. grid:: 1 1 2 2 -**Dask DataFrame** mimics Pandas - :doc:`documentation ` + .. grid-item:: + :columns: 12 12 7 7 -.. code-block:: python + .. code-block:: python - import pandas as pd import dask.dataframe as dd - df = pd.read_csv('2015-01-01.csv') df = dd.read_csv('2015-*-*.csv') - df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute() + from dask.distributed import LocalCluster + client = LocalCluster().get_client() -**Dask Array** mimics NumPy - :doc:`documentation ` + # Submit work to happen in parallel + results = [] + for filename in filenames: + data = client.submit(load, filename) + result = client.submit(process, data) + results.append(result) -.. code-block:: python + # Gather results back to local computer + results = client.gather(results) - import numpy as np import dask.array as da - f = h5py.File('myfile.hdf5') f = h5py.File('myfile.hdf5') - x = np.array(f['/small-data']) x = da.from_array(f['/big-data'], - chunks=(1000, 1000)) - x - x.mean(axis=1) x - x.mean(axis=1).compute() + .. grid-item:: + :columns: 12 12 5 5 -**Dask Bag** mimics iterators, Toolz, and PySpark - :doc:`documentation ` + .. figure:: images/futures-graph.png + :align: center -.. code-block:: python + .. tab-item:: DataFrames - import dask.bag as db - b = db.read_text('2015-*-*.json.gz').map(json.loads) - b.pluck('name').frequencies().topk(10, lambda pair: pair[1]).compute() + Dask Dataframes parallelize the popular pandas library, providing: -**Dask Delayed** mimics for loops and wraps custom code - :doc:`documentation ` + - **Larger-than-memory** execution for single machines, allowing you + to process data that is larger than your available RAM + - **Parallel** execution for faster processing + - **Distributed** computation for terabyte-sized datasets -.. code-block:: python + Dask Dataframes are similar in this regard to Apache Spark, but use the + familiar pandas API and memory model. One Dask dataframe is simply a + collection of pandas dataframes on different computers. - from dask import delayed - L = [] - for fn in filenames: # Use for loops to build up computation - data = delayed(load)(fn) # Delay execution of function - L.append(delayed(process)(data)) # Build connections between variables + Learn more at :bdg-link-primary:`DataFrame Documentation ` - result = delayed(summarize)(L) - result.compute() + .. grid:: 1 1 2 2 -The **concurrent.futures** interface provides general submission of custom -tasks: - :doc:`documentation ` + .. grid-item:: + :columns: 12 12 7 7 -.. code-block:: python + .. code-block:: python - from dask.distributed import Client - client = Client('scheduler:port') + import dask.dataframe as dd - futures = [] - for fn in filenames: - future = client.submit(load, fn) - futures.append(future) + # Read large datasets in parallel + df = dd.read_parquet("s3://mybucket/data.*.parquet") + df = df[df.value < 0] + result = df.groupby(df.name).amount.mean() - summary = client.submit(summarize, futures) - summary.result() + result = result.compute() # Compute to get pandas result + result.plot() + .. grid-item:: + :columns: 12 12 5 5 -Scales from laptops to clusters -------------------------------- + .. figure:: images/dask-dataframe.svg + :align: center -Dask is convenient on a laptop. It :doc:`installs ` trivially with -``conda`` or ``pip`` and extends the size of convenient datasets from "fits in -memory" to "fits on disk". -Dask can scale to a cluster of 100s of machines. It is resilient, elastic, data -local, and low latency. For more information, see the documentation about the -`distributed scheduler`_. + .. tab-item:: Arrays -This ease of transition between single-machine to moderate cluster enables -users to both start simple and grow when necessary. + Dask Arrays parallelize the popular NumPy library, providing: + - **Larger-than-memory** execution for single machines, allowing you + to process data that is larger than your available RAM + - **Parallel** execution for faster processing + - **Distributed** computation for terabyte-sized datasets -Complex Algorithms + Dask Arrays allow scientists and researchers to perform intuitive and + sophisticated operations on large datasets but use the + familiar NumPy API and memory model. One Dask array is simply a + collection of numpy arrays on different computers. + + Learn more at :bdg-link-primary:`Array Documentation ` + + .. grid:: 1 1 2 2 + + .. grid-item:: + + .. code-block:: python + + import dask.array as da + + x = da.random.random((10000, 10000)) + y = (x + x.T) - x.mean(axis=1) + + z = y.var(axis=0).compute() + + .. grid-item:: + :columns: 12 12 5 5 + + .. figure:: images/dask-array.svg + :align: center + + Xarray wraps Dask array and is a popular downstream project, providing + labeled axes and simultaneously tracking many Dask arrays together, + resulting in more intuitive analyses. Xarray is popular and accounts + for the majority of Dask array use today especially within geospatial + and imaging communities. + + Learn more at :bdg-link-primary:`Xarray Documentation ` + + .. grid:: 1 1 2 2 + + .. grid-item:: + + .. code-block:: python + + import xarray as xr + + ds = xr.open_mfdataset("data/*.nc") + da.groupby('time.month').mean('time').compute() + + + .. grid-item:: + :columns: 12 12 5 5 + + .. figure:: https://docs.xarray.dev/en/stable/_static/dataset-diagram-logo.png + :align: center + +How to Install Dask +------------------- + +Installing Dask is easy with ``pip`` or ``conda`` + +Learn more at :bdg-link-primary:`Install Documentation ` + +.. tab-set:: + + .. tab-item:: pip + + .. code-block:: + + pip install dask[complete] + + .. tab-item:: conda + + .. code-block:: + + conda install dask + +How to Deploy Dask ------------------ -Dask represents parallel computations with :doc:`task graphs`. These -directed acyclic graphs may have arbitrary structure, which enables both -developers and users the freedom to build sophisticated algorithms and to -handle messy situations not easily managed by the ``map/filter/groupby`` -paradigm common in most data engineering frameworks. +You can then use Dask on a single machine, or deploy it on distributed hardware + +Learn more at :bdg-link-primary:`Deploy Documentation ` + +.. tab-set:: + + .. tab-item:: Local + + Dask can set itself up easily in your Python session if you create a + ``LocalCluster`` object, which sets everything up for you. + + .. code-block:: python + + from dask.distributed import LocalCluster + cluster = LocalCluster() + client = cluster.get_client() + + # Normal Dask work ... + + Alternatively, you can skip this part, and Dask will operate within a + thread pool contained entirely with your local process. + + .. tab-item:: Kubernetes + + The `dask-kubernetes project `_ provides + a Dask Kubernetes Operator. + + .. code-block:: python + + from dask_kubernetes.operator import KubeCluster + cluster = KubeCluster( + name="my-dask-cluster", + image='ghcr.io/dask/dask:latest' + ) + cluster.scale(10) + + Learn more at :bdg-link-primary:`Dask Kubernetes Documentation ` + + .. tab-item:: HPC + + The `dask-jobqueue project `_ interfaces + with popular job submission projects, like SLURM, PBS, SGE, LSF, + Torque, Condor, and others. + + + .. code-block:: python + + from dask_jobqueue import SLURMCluster + + cluster = SLURMCluster() + cluster.scale(jobs=10) + + + You can also deploy Dask with MPI + + .. code-block:: python + + # myscript.py + from dask_mpi import initialize + initialize() + + from dask.distributed import Client + client = Client() # Connect this local process to remote workers + + .. code-block:: + + $ mpirun -np 4 python myscript.py + + Learn more at :bdg-link-primary:`Dask Jobqueue Documentation ` and the :bdg-link-primary:`Dask MPI Documentation `. + + .. tab-item:: Cloud + + The `dask-cloudprovider project `_ interfaces + with popular cloud platforms like AWS, GCP, Azure, and Digital Ocean. + + .. code-block:: python + + from dask_cloudprovider.aws import FargateCluster + cluster = FargateCluster( + # Cluster manager specific config kwargs + ) + + Learn more at :bdg-link-primary:`Dask CloudProvider Documentation ` + + .. tab-item:: Cloud-SaaS + + Several companies offer commercial Dask products. These are not open + source, but tend to be easier to set up and use, safer, cheaper, etc.. + Here is an incomplete list. + + - `**Coiled** `_ provides a standalone Dask deployment product that works + in AWS and GCP. Coiled notably employs most of the active Dask + maintainers today. + + Learn more at :bdg-link-primary:`Coiled ` + + - `**Saturn Cloud** `_ provides Dask as part of their hosted platform + including Jupyter and other products. + + Learn more at :bdg-link-primary:`Saturn Cloud ` + +Learn with Examples +------------------- + +Dask use is widespread, across all industries and scales. Dask is used +anywhere Python is used and people experience pain due to large scale data, or +intense computing. + +You can learn more about Dask applications at the following sources: + +- `Dask Examples `_ +- `Dask YouTube Channel `_ + +Additionally, we encourage you to look through the reference documentation on +this website related to the API that most closely matches your application. -We originally needed this complexity to build complex algorithms for -n-dimensional arrays but have found it to be equally valuable when dealing with -messy situations in everyday problems. +Dask was designed to be **easy to use** and **powerful**. We hope that it's +able to help you have fun with your work. .. toctree:: :maxdepth: 1 @@ -143,7 +315,7 @@ messy situations in everyday problems. .. toctree:: :maxdepth: 1 :hidden: - :caption: Fundamentals + :caption: API array.rst bag.rst