adding advanced indexing

xarray-contrib · Jul 1, 2024 · 473ac50 · 473ac50
1 parent fd4e240
commit 473ac50
Showing 1 changed file with 119 additions and 42 deletions.
diff --git a/intermediate/indexing/advanced-indexing.ipynb b/intermediate/indexing/advanced-indexing.ipynb
@@ -8,7 +8,8 @@
     "\n",
     "## Learning Objectives\n",
     "\n",
-    "* Orthogonal vs. Pointwise Indexing"
+    "* Orthogonal vs. Pointwise (Vectorized) Indexing\n",
+    "* Pointwise indexing in Xarray to extract data at a collection of points."
    ]
   },
   {
@@ -17,18 +18,9 @@
    "source": [
     "## Overview\n",
     "\n",
-    "In the previous notebooks, we learned basic forms of indexing with Xarray (positional and name based dimensions, integer and label based indexing), datetime Indexing, and nearest neighbor lookups. Xarray positional indexing deviates from the NumPy when indexing with multiple arrays like `arr[[0, 1], [0, 1]]`.\n",
+    "In the previous notebooks, we learned basic forms of indexing with Xarray, including positional and label-based indexing, datetime indexing, and nearest neighbor lookups. We also learned that indexing an Xarray DataArray directly works (mostly) like it does for NumPy arrays; however, Xarray indexing behvaior deviates from NumPy when using multiple arrays for indexing, like `arr[[0, 1], [0, 1]]`.\n",
     "\n",
-    "In this tutorial we learn about this difference and how to do vectorized/pointwise indexing using Xarray.\n",
-    "\n",
-    "For this notebook, first, we should learn about orthogonal (i.e. outer) and vectorized (i.e. pointwise) indexing concepts. \n",
-    "\n",
-    "* *Orthogonal* or *outer* indexing allows for indexing along each dimension independently, treating the indexers as one-dimensional arrays. The principle of outer or orthogonal indexing is that the result mirrors the effect of independently indexing along each dimension with integer or boolean arrays, treating both the indexed and indexing arrays as one-dimensional. This method of indexing is analogous to vector indexing in programming languages like MATLAB, Fortran, and R, where each indexer component *independently* selects along its corresponding dimension. This is the default behavior in Xarray.\n",
-    "\n",
-    "* *Vectorized* indexing is a more general form of indexing that allows for arbitrary combinations of indexing arrays. This method of indexing is analogous to the broadcasting rules in NumPy, where the dimensions of the indexers are aligned and the result is determined by the shape of the indexers. This is the default behavior in NumPy.  \n",
-    "\n",
-    "\n",
-    "We can better understand this with an example: "
+    "To better understand this difference, let's take a look at an example of 2D 5x5 array:"
    ]
   },
   {
@@ -63,6 +55,13 @@
     "da"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's see how the indexing behavior is different between NumPy array and Xarray DataArray when indexing with mutliple arrays:"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -85,7 +84,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The image below summarizes the difference between orthogonal and vectorized indexing for a 2D 5x5 array. \n",
+    "The image below summarizes the difference between vectorized and orthogonal indexing for a 2D 5x5 NumPy array and Xarray DataArray:\n",
     "\n",
     "\n",
     "\n",
@@ -96,12 +95,21 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    " Point-wise indexing, shown on the left, selects specific elements at given coordinates, resulting in an array of those individual elements. In the example shown, the indices `[0, 2, 4]`, `[0, 2, 4]` select the elements at positions (0, 0), (2, 2), and (4, 4), resulting in the values `[1, 13, 25]`. This is shown in NumPy indexing example. \n",
+    "**Pointwise** or **Vectorized indexing**, shown on the left, selects specific elements at given coordinates, resulting in an array of those individual elements. In the example shown, the indices `[0, 2, 4]`, `[0, 2, 4]` select the elements at positions `(0, 0)`, `(2, 2)`, and `(4, 4)`, resulting in the values `[1, 13, 25]`. This is the default behavior of NumPy arrays.\n",
+    "  \n",
+    "In contrast, **orthogonal indexing** uses the same indices to select entire rows and columns, forming a cross-product of the specified indices. This method results in sub-arrays that include all combinations of the selected rows and columns. The example demonstrates this by selecting rows 0, 2, and 4 and columns 0, 2, and 4, resulting in a subarray containing `[[1, 3, 5], [11, 13, 15], [21, 23, 25]]`. This is Xarray DataArray's default behavior.\n",
+    " \n",
+    "The output of vectorized indexing is a `1D array`,  while the output of orthogonal indexing is a `3x3` array.   \n",
     "\n",
     "\n",
-    " In contrast, **orthogonal indexing** uses the same indices to select entire rows and columns, forming a cross-product of the specified indices. This method results in subarrays that include all combinations of the selected rows and columns. The example demonstrates this by selecting rows 0, 2, and 4 and columns 0, 2, and 4, resulting in a subarray containing `[[1, 3, 5], [11, 13, 15], [21, 23, 25]]`. This is shown in Xarray indexing example.\n",
-    " \n",
-    "  The output of orthogonal indexing is a 3x3 array, while the output of vectorized indexing is a 1D array."
+    ":::{tip}  To Summarize: \n",
+    "\n",
+    "- *Pointwise* or *vectorized* indexing is a more general form of indexing that allows for arbitrary combinations of indexing arrays. This method of indexing is analogous to the broadcasting rules in NumPy, where the dimensions of the indexers are aligned and the result is determined by the shape of the indexers. This is the default behavior in NumPy.\n",
+    "\n",
+    "- *Orthogonal* or *outer* indexing allows for indexing along each dimension independently, treating the indexers as one-dimensional arrays. The principle of outer or orthogonal indexing is that the result mirrors the effect of independently indexing along each dimension with integer or boolean arrays, treating both the indexed and indexing arrays as one-dimensional. This method of indexing is analogous to vector indexing in programming languages like MATLAB, Fortran, and R, where each indexer component independently selects along its corresponding dimension. This is the default behavior in Xarray.\n",
+    "\n",
+    "\n",
+    ":::"
    ]
   },
   {
@@ -110,7 +118,9 @@
    "source": [
     "## Orthogonal Indexing in Xarray\n",
     "\n",
-    "If you only provide integers, slices, or unlabeled arrays (array without dimension names, such as `np.ndarray`, `list`, but not `DataArray()`) indexing can be understood as orthogonally (i.e. along independent axes, instead of using NumPy’s broadcasting rules to vectorize indexers). In the example above we saw this behavior, but let's see it in action with a real dataset."
+    "As explained earlier, when you use only integers, slices, or unlabeled arrays (arrays without dimension names, such as `np.ndarray` or `list`, but not `DataArray`) to index an `Xarray DataArray`, Xarray interprets these indexers orthogonally. This means it indexes along independent axes, rather than using NumPy's broadcasting rules to vectorize the indexers. \n",
+    "\n",
+    "In the example above we saw this behavior, but let's see this behavior in action with a real dataset. Here we’ll use `air temperature` data from the National Center for Environmental Prediction:"
    ]
   },
   {
@@ -122,7 +132,6 @@
    "outputs": [],
    "source": [
     "import numpy as np\n",
-    "import pandas as pd\n",
     "import xarray as xr\n",
     "\n",
     "\n",
@@ -148,7 +157,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "👆 please notice how the output if the indexing example above resulted in an array of 3x4. "
+    "👆 please notice how the output of the indexing example above resulted in an array of size `3x4`"
    ]
   },
   {
@@ -178,7 +187,23 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In the above example, you can see how the output shape is `time` x `lats` x `lons`. "
+    "In the above example, you can see how the output shape is `time` x `lats` x `lons`.\n",
+    "\n",
+    "```{attention}\n",
+    "Please note that slices or sequences/arrays without named-dimensions are treated as if they have the same dimension which is indexed along.\n",
+    "```\n",
+    "\n",
+    "For example:\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "da_air.sel(lat=[20, 30, 40], lon=target_lon, method=\"nearest\")"
    ]
   },
   {
@@ -188,19 +213,62 @@
    },
    "source": [
     "\n",
-    "But what if we would like to find the information from the nearest grid cell to a collection of specified points (for example, weather stations or tower data)?\n",
+    "But what if we'd like to find the nearest climate model grid cell to a collection of specified points (for example observation sites, or weather stations)?\n",
     "\n",
     "## Vectorized or Pointwise Indexing in Xarray\n",
     "\n",
     "Like NumPy and pandas, Xarray supports indexing many array elements at once in a *vectorized* manner. \n",
     "\n",
-    "**Vectorized indexing** or **Pointwise Indexing** using `DataArrays()` can be used to extract information from the nearest grid cells of interest, for example, the nearest climate model grid cells to a collection of specified weather station latitudes and longitudes.\n",
+    "**Vectorized indexing** or **Pointwise Indexing** using `DataArrays()` can be used to extract information from the nearest grid cells of interest, for example, the nearest climate model grid cells to a collection of specified observation tower data latitudes and longitudes.\n",
     "\n",
     "```{hint}\n",
-    "To trigger vectorized indexing behavior, you will need to provide the selection dimensions with a different name than the original dimensions. This dimension name will be used in the output array.\n",
+    "To trigger vectorized indexing behavior, you will need to provide the selection dimensions with a new **shared** output dimension name. This means that the dimensions of both indexers must be the same, and the output will have the same dimension name as the indexers.\n",
     "```\n",
     "\n",
-    "In the example below, the selections of the closest latitude and longitude are renamed to an output dimension named `points`:"
+    "Let's see how this works with an example. A researcher wants to find the nearest climate model grid cell to a collection of observation sites. She has the latitude and longitude of the observation sites as following:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "obs_lats = [31.81,\n",
+    "            41.26,\n",
+    "            22.59,\n",
+    "            44.47,\n",
+    "            28.57]\n",
+    "\n",
+    "obs_lons = [200.16,\n",
+    "            201.57,\n",
+    "            305.54,\n",
+    "            210.56,\n",
+    "            226.59]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If the researcher use the lists to index the DataArray, they will get the orthogonal indexing behavior, which is not what they wants."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "da_air.sel(lat=obs_lats, lon=obs_lats, method=\"nearest\") # -- orthogonal indexing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To trigger the pointwise indexing, they need to create DataArray objects with the same dimension name, and then use them to index the DataArray. \n",
+    "For example, the code below first create DataArray objects for the latitude and longitude of the observation sites using a shared dimension name `points`, and then use them to index the DataArray `air_temperature`:"
    ]
   },
   {
@@ -211,8 +279,8 @@
    },
    "outputs": [],
    "source": [
-    "# Define target latitude and longitude (where weather stations might be)\n",
-    "lat_points = xr.DataArray([31, 41, 42, 42], dims=\"points\")\n",
+    "## latitudes of weather stations with a dimension of \"points\"\n",
+    "lat_points = xr.DataArray(obs_lats, dims=\"points\")\n",
     "lat_points"
    ]
   },
@@ -224,7 +292,8 @@
    },
    "outputs": [],
    "source": [
-    "lon_points = xr.DataArray([200, 201, 202, 205], dims=\"points\")\n",
+    "## longitudes of weather stations with a dimension of \"points\"\n",
+    "lon_points = xr.DataArray(obs_lons, dims=\"points\")\n",
     "lon_points"
    ]
   },
@@ -243,7 +312,7 @@
    },
    "outputs": [],
    "source": [
-    "da_air.sel(lat=lat_points, lon=lon_points, method=\"nearest\")"
+    "da_air.sel(lat=lat_points, lon=lon_points, method=\"nearest\") # -- pointwise indexing"
    ]
   },
   {
@@ -268,34 +337,29 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "```{attention}\n",
-    "Please note that slices or sequences/arrays without named-dimensions are treated as if they have the same dimension which is indexed along.\n",
-    "```\n",
-    "\n",
-    "For example:"
+    "Now, let's plot the data for all stations."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "tags": []
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
-    "da_air.sel(lat=[20, 30, 40], lon=lon_points, method=\"nearest\")"
+    "da_air.sel(lat=lat_points, lon=lon_points, method=\"nearest\").plot(x='time', hue='points');\n",
+    "\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Excersises\n",
+    "## Exercises\n",
     "\n",
     "::::{admonition} Exercise\n",
     ":class: tip\n",
     "\n",
-    "In the simple 2D 5x5 Xarray data array above, select the sub-array containing (0,0),(2,2),(4,4) : \n",
+    "In the simple 2D 5x5 Xarray data array above, select the sub-array containing (0,0),(2,2),(4,4):\n",
     "\n",
     ":::{admonition} Solution\n",
     ":class: dropdown\n",
@@ -319,11 +383,23 @@
    "source": [
     "## Additional Resources\n",
     "\n",
-    "- [Xarray Docs - Indexing and Selecting Data](https://docs.xarray.dev/en/stable/indexing.html)\n"
+    "- [Xarray Docs - Indexing and Selecting Data](https://docs.xarray.dev/en/stable/indexing.html)\n",
+    "\n",
+    "\n",
+    ":::{seealso}\n",
+    "- [NumPy Fancy Indexing](https://numpy.org/doc/stable/reference/arrays.indexing.html#fancy-indexing)\n",
+    "- [Xarray Indexing](indexing.md)\n",
+    "\n",
+    ":::\n"
    ]
   }
  ],
  "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
@@ -333,7 +409,8 @@
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3"
+   "pygments_lexer": "ipython3",
+   "version": "3.9.0"
   },
   "toc": {
    "base_numbering": 1,