From 89188d78134ce11e1c31d5d2e98f56db52c50b3b Mon Sep 17 00:00:00 2001
From: richard-rogers <93153899+richard-rogers@users.noreply.github.com>
Date: Wed, 4 Dec 2024 04:54:16 +0000
Subject: [PATCH] Rename Pyspark dataframe columns with . example (#1588)

## Description

Adds example of renaming columns in a Pyspark dataframe with names
containing periods.


## Related

Closes whylabs/whylogs#1532

- [ ] I have reviewed the [Guidelines for Contributing](CONTRIBUTING.md)
and the [Code of Conduct](CODE_OF_CONDUCT.md).
---
 .../integrations/Pyspark_Profiling.ipynb      | 1771 ++++++++++-------
 1 file changed, 1086 insertions(+), 685 deletions(-)
diff --git a/python/examples/integrations/Pyspark_Profiling.ipynb b/python/examples/integrations/Pyspark_Profiling.ipynb
index 8473ea2fcc..b929a6ad38 100644
--- a/python/examples/integrations/Pyspark_Profiling.ipynb
+++ b/python/examples/integrations/Pyspark_Profiling.ipynb
@@ -1,690 +1,1091 @@
 {
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br> \n",
-    ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Pyspark_Profiling)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Pyspark_Profiling) to leverage the power of whylogs and WhyLabs together!*"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# PySpark Integration\n",
-    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/integrations/Pyspark_Profiling.ipynb)\n",
-    "\n",
-    "\n",
-    "Hi! Perhaps you're already feeling confident with our library, but you really wish there was an easy way to plug our profiling into your existing PySpark jobs. Well, glad you've made it here, because this is what we are going to cover in this example notebook 😃\n",
-    "\n",
-    "If you wish to have other insights on how to use whylogs, feel free to check our [other existing examples](https://github.com/whylabs/whylogs/tree/mainline/python/examples), as they might be extremely useful!"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Installing the extra dependency\n",
-    "\n",
-    "As we want to enable users to have exactly what they need to use from whylogs, the `pyspark` integration comes as an extra dependency. In order to have it available, simply uncomment and run the following cell:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Note: you may need to restart the kernel to use updated packages.\n",
-    "%pip install 'whylogs[spark]'"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Initializing a SparkSession\n",
-    "\n",
-    "Here we will initialize a SparkSession. I'm also setting the `pyarrow` execution config, because it makes our methods even more performant. \n",
-    "\n",
-    ">**IMPORTANT**: Make sure you have Spark 3.0+ available in your environment, as our implementation relies on it for a smoother integration"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "metadata": {
-    "pycharm": {
-     "name": "#%%\n"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "from pyspark.sql import SparkSession"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "pycharm": {
-     "name": "#%%\n"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "spark = SparkSession.builder.appName('whylogs-testing').getOrCreate()\n",
-    "arrow_config_key = \"spark.sql.execution.arrow.pyspark.enabled\"\n",
-    "spark.conf.set(arrow_config_key, \"true\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Reading the data\n",
-    "\n",
-    "For the sake of simplicity (and computational efforts, so you can run this notebook from your local machine), we will read the Wine Quality dataset, available in this URL: \"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv\"."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from pyspark import SparkFiles\n",
-    "\n",
-    "data_url = \"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv\"\n",
-    "spark.sparkContext.addFile(data_url)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "                                                                                \r"
-     ]
-    }
-   ],
-   "source": [
-    "spark_dataframe = spark.read.option(\"delimiter\", \";\").option(\"inferSchema\", \"true\").csv(SparkFiles.get(\"winequality-red.csv\"), header=True)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 7,
-   "metadata": {
-    "pycharm": {
-     "name": "#%%\n"
-    }
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "-RECORD 0----------------------\n",
-      " fixed acidity        | 7.4    \n",
-      " volatile acidity     | 0.7    \n",
-      " citric acid          | 0.0    \n",
-      " residual sugar       | 1.9    \n",
-      " chlorides            | 0.076  \n",
-      " free sulfur dioxide  | 11.0   \n",
-      " total sulfur dioxide | 34.0   \n",
-      " density              | 0.9978 \n",
-      " pH                   | 3.51   \n",
-      " sulphates            | 0.56   \n",
-      " alcohol              | 9.4    \n",
-      " quality              | 5      \n",
-      "only showing top 1 row\n",
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "spark_dataframe.show(n=1, vertical=True)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "root\n",
-      " |-- fixed acidity: double (nullable = true)\n",
-      " |-- volatile acidity: double (nullable = true)\n",
-      " |-- citric acid: double (nullable = true)\n",
-      " |-- residual sugar: double (nullable = true)\n",
-      " |-- chlorides: double (nullable = true)\n",
-      " |-- free sulfur dioxide: double (nullable = true)\n",
-      " |-- total sulfur dioxide: double (nullable = true)\n",
-      " |-- density: double (nullable = true)\n",
-      " |-- pH: double (nullable = true)\n",
-      " |-- sulphates: double (nullable = true)\n",
-      " |-- alcohol: double (nullable = true)\n",
-      " |-- quality: integer (nullable = true)\n",
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "spark_dataframe.printSchema()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Profiling the data with whylogs\n",
-    "\n",
-    "Now that we have a Spark DataFrame in place, let's see how easy it is to profile our data with whylogs."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 9,
-   "metadata": {
-    "pycharm": {
-     "name": "#%%\n"
-    }
-   },
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "                                                                                \r"
-     ]
-    }
-   ],
-   "source": [
-    "from whylogs.api.pyspark.experimental import collect_column_profile_views\n",
-    "\n",
-    "column_views_dict = collect_column_profile_views(spark_dataframe)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Yeap. It's done. It is **that** easy.\n",
-    "\n",
-    "But what do we get with a `column_views_dict`? "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "{'alcohol': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d240e20>, 'chlorides': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d29ec70>, 'citric acid': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2a2d00>, 'density': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2a6d90>, 'fixed acidity': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2a3e20>, 'free sulfur dioxide': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2aaeb0>, 'pH': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2adf40>, 'quality': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2b7100>, 'residual sugar': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2b7d60>, 'sulphates': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2bddf0>, 'total sulfur dioxide': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2c0d90>, 'volatile acidity': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2c4e20>}\n"
-     ]
-    }
-   ],
-   "source": [
-    "print(column_views_dict)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "It is a dictionary with one `ColumnProfileView` object per column in your dataset. And we can inspect some of the metrics on each one of them, such as the counts for a given column"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 11,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "(1599, 1599)"
-      ]
-     },
-     "execution_count": 11,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "column_views_dict[\"density\"].get_metric(\"counts\").n.value, spark_dataframe.count()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Or their `mean` value:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 12,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "0.9967466791744841"
-      ]
-     },
-     "execution_count": 12,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "column_views_dict[\"density\"].get_metric(\"distribution\").mean.value"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "And now let's check how accurate whylogs did store that `mean` calculation."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 13,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "+------------------+\n",
-      "|      avg(density)|\n",
-      "+------------------+\n",
-      "|0.9967466791744831|\n",
-      "+------------------+\n",
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "from pyspark.sql.functions import mean\n",
-    "spark_dataframe.select(mean(\"density\")).show()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "It is not the literal exact value, but it gets really close, right? That is because we are not extracting the exact information, but we are also **not sampling** the data. `whylogs` will look at **every data point** and *statistically* decide wether or not that data point is relevant to the final calculation. \n",
-    "\n",
-    "Is it just me or this is extremely powerful? Yes, it is.\n",
-    "\n",
-    "> \"Cool! But what can I do with a bunch of `ColumnProfileView`'s from my Dataset? I want to see everything together\n",
-    "\n",
-    "Well, you've come to the right place, because we will inspect the next method that does just that!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 14,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from whylogs.api.pyspark.experimental import collect_dataset_profile_view\n",
-    "\n",
-    "dataset_profile_view = collect_dataset_profile_view(input_df=spark_dataframe)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Yes, that easy. You now have a `DatasetProfileView`. As you might have seen from other example notebooks in our repo, you can turn this *lightweight* object into a pandas DataFrame, and visualize all the important metrics that we've profiled, like this:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 15,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "<div>\n",
-       "<style scoped>\n",
-       "    .dataframe tbody tr th:only-of-type {\n",
-       "        vertical-align: middle;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe tbody tr th {\n",
-       "        vertical-align: top;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe thead th {\n",
-       "        text-align: right;\n",
-       "    }\n",
-       "</style>\n",
-       "<table border=\"1\" class=\"dataframe\">\n",
-       "  <thead>\n",
-       "    <tr style=\"text-align: right;\">\n",
-       "      <th></th>\n",
-       "      <th>counts/n</th>\n",
-       "      <th>counts/null</th>\n",
-       "      <th>types/integral</th>\n",
-       "      <th>types/fractional</th>\n",
-       "      <th>types/boolean</th>\n",
-       "      <th>types/string</th>\n",
-       "      <th>types/object</th>\n",
-       "      <th>cardinality/est</th>\n",
-       "      <th>cardinality/upper_1</th>\n",
-       "      <th>cardinality/lower_1</th>\n",
-       "      <th>...</th>\n",
-       "      <th>distribution/min</th>\n",
-       "      <th>distribution/q_10</th>\n",
-       "      <th>distribution/q_25</th>\n",
-       "      <th>distribution/median</th>\n",
-       "      <th>distribution/q_75</th>\n",
-       "      <th>distribution/q_90</th>\n",
-       "      <th>type</th>\n",
-       "      <th>ints/max</th>\n",
-       "      <th>ints/min</th>\n",
-       "      <th>frequent_items/frequent_strings</th>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>column</th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "      <th></th>\n",
-       "    </tr>\n",
-       "  </thead>\n",
-       "  <tbody>\n",
-       "    <tr>\n",
-       "      <th>alcohol</th>\n",
-       "      <td>1599</td>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>1599</td>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>65.000010</td>\n",
-       "      <td>65.003256</td>\n",
-       "      <td>65.000000</td>\n",
-       "      <td>...</td>\n",
-       "      <td>8.40000</td>\n",
-       "      <td>9.30000</td>\n",
-       "      <td>9.5000</td>\n",
-       "      <td>10.20000</td>\n",
-       "      <td>11.10000</td>\n",
-       "      <td>12.00000</td>\n",
-       "      <td>SummaryType.COLUMN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>chlorides</th>\n",
-       "      <td>1599</td>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>1599</td>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>153.000058</td>\n",
-       "      <td>153.007697</td>\n",
-       "      <td>153.000000</td>\n",
-       "      <td>...</td>\n",
-       "      <td>0.01200</td>\n",
-       "      <td>0.06000</td>\n",
-       "      <td>0.0700</td>\n",
-       "      <td>0.07900</td>\n",
-       "      <td>0.09100</td>\n",
-       "      <td>0.10900</td>\n",
-       "      <td>SummaryType.COLUMN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>citric acid</th>\n",
-       "      <td>1599</td>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>1599</td>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>80.000016</td>\n",
-       "      <td>80.004010</td>\n",
-       "      <td>80.000000</td>\n",
-       "      <td>...</td>\n",
-       "      <td>0.00000</td>\n",
-       "      <td>0.01000</td>\n",
-       "      <td>0.0900</td>\n",
-       "      <td>0.26000</td>\n",
-       "      <td>0.43000</td>\n",
-       "      <td>0.53000</td>\n",
-       "      <td>SummaryType.COLUMN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>density</th>\n",
-       "      <td>1599</td>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>1599</td>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>439.557368</td>\n",
-       "      <td>445.310933</td>\n",
-       "      <td>433.943761</td>\n",
-       "      <td>...</td>\n",
-       "      <td>0.99007</td>\n",
-       "      <td>0.99451</td>\n",
-       "      <td>0.9956</td>\n",
-       "      <td>0.99675</td>\n",
-       "      <td>0.99786</td>\n",
-       "      <td>0.99914</td>\n",
-       "      <td>SummaryType.COLUMN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>fixed acidity</th>\n",
-       "      <td>1599</td>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>1599</td>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>96.000023</td>\n",
-       "      <td>96.004816</td>\n",
-       "      <td>96.000000</td>\n",
-       "      <td>...</td>\n",
-       "      <td>4.60000</td>\n",
-       "      <td>6.60000</td>\n",
-       "      <td>7.1000</td>\n",
-       "      <td>7.90000</td>\n",
-       "      <td>9.20000</td>\n",
-       "      <td>10.70000</td>\n",
-       "      <td>SummaryType.COLUMN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "    </tr>\n",
-       "  </tbody>\n",
-       "</table>\n",
-       "<p>5 rows × 24 columns</p>\n",
-       "</div>"
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pJfss1JOJc4I"
+      },
+      "source": [
+        ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br>\n",
+        ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Pyspark_Profiling)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Pyspark_Profiling) to leverage the power of whylogs and WhyLabs together!*"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "dFWwDRveJc4M"
+      },
+      "source": [
+        "# PySpark Integration\n",
+        "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/integrations/Pyspark_Profiling.ipynb)\n",
+        "\n",
+        "\n",
+        "Hi! Perhaps you're already feeling confident with our library, but you really wish there was an easy way to plug our profiling into your existing PySpark jobs. Well, glad you've made it here, because this is what we are going to cover in this example notebook 😃\n",
+        "\n",
+        "If you wish to have other insights on how to use whylogs, feel free to check our [other existing examples](https://github.com/whylabs/whylogs/tree/mainline/python/examples), as they might be extremely useful!"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "4xQXK66aJc4M"
+      },
+      "source": [
+        "## Installing the extra dependency\n",
+        "\n",
+        "As we want to enable users to have exactly what they need to use from whylogs, the `pyspark` integration comes as an extra dependency. In order to have it available, simply uncomment and run the following cell:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "TpxG9uOMJc4N"
+      },
+      "outputs": [],
+      "source": [
+        "# Note: you may need to restart the kernel to use updated packages.\n",
+        "%pip install 'whylogs[spark]'"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "mpRGR0TCJc4O"
+      },
+      "source": [
+        "## Initializing a SparkSession\n",
+        "\n",
+        "Here we will initialize a SparkSession. I'm also setting the `pyarrow` execution config, because it makes our methods even more performant.\n",
+        "\n",
+        ">**IMPORTANT**: Make sure you have Spark 3.0+ available in your environment, as our implementation relies on it for a smoother integration"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {
+        "pycharm": {
+          "name": "#%%\n"
+        },
+        "id": "tMAPYoEeJc4P"
+      },
+      "outputs": [],
+      "source": [
+        "from pyspark.sql import SparkSession"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "pycharm": {
+          "name": "#%%\n"
+        },
+        "id": "Fb_rCVoPJc4P"
+      },
+      "outputs": [],
+      "source": [
+        "spark = SparkSession.builder.appName('whylogs-testing').getOrCreate()\n",
+        "arrow_config_key = \"spark.sql.execution.arrow.pyspark.enabled\"\n",
+        "spark.conf.set(arrow_config_key, \"true\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "mSExY4vGJc4P"
+      },
+      "source": [
+        "## Reading the data\n",
+        "\n",
+        "For the sake of simplicity (and computational efforts, so you can run this notebook from your local machine), we will read the Wine Quality dataset, available in this URL: \"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv\"."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {
+        "id": "vgmlP2S2Jc4P"
+      },
+      "outputs": [],
+      "source": [
+        "from pyspark import SparkFiles\n",
+        "\n",
+        "data_url = \"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv\"\n",
+        "spark.sparkContext.addFile(data_url)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "KzNoJLBCJc4Q"
+      },
+      "outputs": [],
+      "source": [
+        "spark_dataframe = spark.read.option(\"delimiter\", \";\").option(\"inferSchema\", \"true\").csv(SparkFiles.get(\"winequality-red.csv\"), header=True)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 6,
+      "metadata": {
+        "pycharm": {
+          "name": "#%%\n"
+        },
+        "id": "D_Il3a7_Jc4R",
+        "outputId": "7d74162d-6eb7-446e-9e38-83f61d0e49e9",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "-RECORD 0----------------------\n",
+            " fixed acidity        | 7.4    \n",
+            " volatile acidity     | 0.7    \n",
+            " citric acid          | 0.0    \n",
+            " residual sugar       | 1.9    \n",
+            " chlorides            | 0.076  \n",
+            " free sulfur dioxide  | 11.0   \n",
+            " total sulfur dioxide | 34.0   \n",
+            " density              | 0.9978 \n",
+            " pH                   | 3.51   \n",
+            " sulphates            | 0.56   \n",
+            " alcohol              | 9.4    \n",
+            " quality              | 5      \n",
+            "only showing top 1 row\n",
+            "\n"
+          ]
+        }
+      ],
+      "source": [
+        "spark_dataframe.show(n=1, vertical=True)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 7,
+      "metadata": {
+        "id": "y_Q85tjAJc4R",
+        "outputId": "442f78d2-6999-450d-ef74-976d9845dc99",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "root\n",
+            " |-- fixed acidity: double (nullable = true)\n",
+            " |-- volatile acidity: double (nullable = true)\n",
+            " |-- citric acid: double (nullable = true)\n",
+            " |-- residual sugar: double (nullable = true)\n",
+            " |-- chlorides: double (nullable = true)\n",
+            " |-- free sulfur dioxide: double (nullable = true)\n",
+            " |-- total sulfur dioxide: double (nullable = true)\n",
+            " |-- density: double (nullable = true)\n",
+            " |-- pH: double (nullable = true)\n",
+            " |-- sulphates: double (nullable = true)\n",
+            " |-- alcohol: double (nullable = true)\n",
+            " |-- quality: integer (nullable = true)\n",
+            "\n"
+          ]
+        }
       ],
-      "text/plain": [
-       "               counts/n  counts/null  types/integral  types/fractional  \\\n",
-       "column                                                                   \n",
-       "alcohol            1599            0               0              1599   \n",
-       "chlorides          1599            0               0              1599   \n",
-       "citric acid        1599            0               0              1599   \n",
-       "density            1599            0               0              1599   \n",
-       "fixed acidity      1599            0               0              1599   \n",
-       "\n",
-       "               types/boolean  types/string  types/object  cardinality/est  \\\n",
-       "column                                                                      \n",
-       "alcohol                    0             0             0        65.000010   \n",
-       "chlorides                  0             0             0       153.000058   \n",
-       "citric acid                0             0             0        80.000016   \n",
-       "density                    0             0             0       439.557368   \n",
-       "fixed acidity              0             0             0        96.000023   \n",
-       "\n",
-       "               cardinality/upper_1  cardinality/lower_1  ...  \\\n",
-       "column                                                   ...   \n",
-       "alcohol                  65.003256            65.000000  ...   \n",
-       "chlorides               153.007697           153.000000  ...   \n",
-       "citric acid              80.004010            80.000000  ...   \n",
-       "density                 445.310933           433.943761  ...   \n",
-       "fixed acidity            96.004816            96.000000  ...   \n",
-       "\n",
-       "               distribution/min  distribution/q_10  distribution/q_25  \\\n",
-       "column                                                                  \n",
-       "alcohol                 8.40000            9.30000             9.5000   \n",
-       "chlorides               0.01200            0.06000             0.0700   \n",
-       "citric acid             0.00000            0.01000             0.0900   \n",
-       "density                 0.99007            0.99451             0.9956   \n",
-       "fixed acidity           4.60000            6.60000             7.1000   \n",
-       "\n",
-       "               distribution/median  distribution/q_75  distribution/q_90  \\\n",
-       "column                                                                     \n",
-       "alcohol                   10.20000           11.10000           12.00000   \n",
-       "chlorides                  0.07900            0.09100            0.10900   \n",
-       "citric acid                0.26000            0.43000            0.53000   \n",
-       "density                    0.99675            0.99786            0.99914   \n",
-       "fixed acidity              7.90000            9.20000           10.70000   \n",
-       "\n",
-       "                             type  ints/max  ints/min  \\\n",
-       "column                                                  \n",
-       "alcohol        SummaryType.COLUMN       NaN       NaN   \n",
-       "chlorides      SummaryType.COLUMN       NaN       NaN   \n",
-       "citric acid    SummaryType.COLUMN       NaN       NaN   \n",
-       "density        SummaryType.COLUMN       NaN       NaN   \n",
-       "fixed acidity  SummaryType.COLUMN       NaN       NaN   \n",
-       "\n",
-       "               frequent_items/frequent_strings  \n",
-       "column                                          \n",
-       "alcohol                                    NaN  \n",
-       "chlorides                                  NaN  \n",
-       "citric acid                                NaN  \n",
-       "density                                    NaN  \n",
-       "fixed acidity                              NaN  \n",
-       "\n",
-       "[5 rows x 24 columns]"
-      ]
-     },
-     "execution_count": 15,
-     "metadata": {},
-     "output_type": "execute_result"
+      "source": [
+        "spark_dataframe.printSchema()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "SsWuKLSbJc4R"
+      },
+      "source": [
+        "## Profiling the data with whylogs\n",
+        "\n",
+        "Now that we have a Spark DataFrame in place, let's see how easy it is to profile our data with whylogs."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "pycharm": {
+          "name": "#%%\n"
+        },
+        "id": "aP4jgNpXJc4S"
+      },
+      "outputs": [],
+      "source": [
+        "from whylogs.api.pyspark.experimental import collect_column_profile_views\n",
+        "\n",
+        "column_views_dict = collect_column_profile_views(spark_dataframe)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "CVYL1xEgJc4S"
+      },
+      "source": [
+        "Yeap. It's done. It is **that** easy.\n",
+        "\n",
+        "But what do we get with a `column_views_dict`?"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 9,
+      "metadata": {
+        "id": "ADl0tMuGJc4S",
+        "outputId": "c78887c7-d6a8-4420-b9d2-f4311808c8f0",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "{'alcohol': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x7a9dc4128910>, 'chlorides': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x7a9dc40644f0>, 'citric acid': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x7a9dc423ab90>, 'density': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x7a9dc41c64d0>, 'fixed acidity': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x7a9dc41c5030>, 'free sulfur dioxide': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x7a9dc419b370>, 'pH': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x7a9dc5370550>, 'quality': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x7a9dc53719f0>, 'residual sugar': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x7a9dc53eb4c0>, 'sulphates': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x7a9dc53e9f00>, 'total sulfur dioxide': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x7a9dc5414100>, 'volatile acidity': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x7a9dc5417010>}\n"
+          ]
+        }
+      ],
+      "source": [
+        "print(column_views_dict)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "SA4t1YMoJc4T"
+      },
+      "source": [
+        "It is a dictionary with one `ColumnProfileView` object per column in your dataset. And we can inspect some of the metrics on each one of them, such as the counts for a given column"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 10,
+      "metadata": {
+        "id": "pROki397Jc4T",
+        "outputId": "84a163ad-8dce-4209-d2a3-06a0c1c76cb8",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "(1599, 1599)"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 10
+        }
+      ],
+      "source": [
+        "column_views_dict[\"density\"].get_metric(\"counts\").n.value, spark_dataframe.count()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "R231V7EhJc4T"
+      },
+      "source": [
+        "Or their `mean` value:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 11,
+      "metadata": {
+        "id": "VzcsqEplJc4T",
+        "outputId": "a70602d8-d364-48f4-8de9-af5526455919",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "0.9967466791744841"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 11
+        }
+      ],
+      "source": [
+        "column_views_dict[\"density\"].get_metric(\"distribution\").mean.value"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "GsCgZLMXJc4U"
+      },
+      "source": [
+        "And now let's check how accurate whylogs did store that `mean` calculation."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 12,
+      "metadata": {
+        "id": "RkUkf58wJc4U",
+        "outputId": "bb565093-a05e-42fc-f802-4b3f5c879b72",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "+------------------+\n",
+            "|      avg(density)|\n",
+            "+------------------+\n",
+            "|0.9967466791744831|\n",
+            "+------------------+\n",
+            "\n"
+          ]
+        }
+      ],
+      "source": [
+        "from pyspark.sql.functions import mean\n",
+        "spark_dataframe.select(mean(\"density\")).show()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "o_NspRo4Jc4U"
+      },
+      "source": [
+        "It is not the literal exact value, but it gets really close, right? That is because we are not extracting the exact information, but we are also **not sampling** the data. `whylogs` will look at **every data point** and *statistically* decide wether or not that data point is relevant to the final calculation.\n",
+        "\n",
+        "Is it just me or this is extremely powerful? Yes, it is.\n",
+        "\n",
+        "> \"Cool! But what can I do with a bunch of `ColumnProfileView`'s from my Dataset? I want to see everything together\n",
+        "\n",
+        "Well, you've come to the right place, because we will inspect the next method that does just that!"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 13,
+      "metadata": {
+        "id": "19nYwCzxJc4U"
+      },
+      "outputs": [],
+      "source": [
+        "from whylogs.api.pyspark.experimental import collect_dataset_profile_view\n",
+        "\n",
+        "dataset_profile_view = collect_dataset_profile_view(input_df=spark_dataframe)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xI6yLa-DJc4V"
+      },
+      "source": [
+        "Yes, that easy. You now have a `DatasetProfileView`. As you might have seen from other example notebooks in our repo, you can turn this *lightweight* object into a pandas DataFrame, and visualize all the important metrics that we've profiled, like this:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 14,
+      "metadata": {
+        "id": "XZGmQnGDJc4V",
+        "outputId": "93c00a89-8de1-4155-f5dc-80d29b37d58e",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 316
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "               cardinality/est  cardinality/lower_1  cardinality/upper_1  \\\n",
+              "column                                                                     \n",
+              "alcohol              65.000010            65.000000            65.003256   \n",
+              "chlorides           153.000058           153.000000           153.007697   \n",
+              "citric acid          80.000016            80.000000            80.004010   \n",
+              "density             439.557368           433.943761           445.310933   \n",
+              "fixed acidity        96.000023            96.000000            96.004816   \n",
+              "\n",
+              "               counts/inf  counts/n  counts/nan  counts/null  counts/true  \\\n",
+              "column                                                                      \n",
+              "alcohol                 0      1599           0            0            0   \n",
+              "chlorides               0      1599           0            0            0   \n",
+              "citric acid             0      1599           0            0            0   \n",
+              "density                 0      1599           0            0            0   \n",
+              "fixed acidity           0      1599           0            0            0   \n",
+              "\n",
+              "               distribution/max  distribution/mean  ...                type  \\\n",
+              "column                                              ...                       \n",
+              "alcohol                14.90000          10.422983  ...  SummaryType.COLUMN   \n",
+              "chlorides               0.61100           0.087467  ...  SummaryType.COLUMN   \n",
+              "citric acid             1.00000           0.270976  ...  SummaryType.COLUMN   \n",
+              "density                 1.00369           0.996747  ...  SummaryType.COLUMN   \n",
+              "fixed acidity          15.90000           8.319637  ...  SummaryType.COLUMN   \n",
+              "\n",
+              "               types/boolean  types/fractional  types/integral  types/object  \\\n",
+              "column                                                                         \n",
+              "alcohol                    0              1599               0             0   \n",
+              "chlorides                  0              1599               0             0   \n",
+              "citric acid                0              1599               0             0   \n",
+              "density                    0              1599               0             0   \n",
+              "fixed acidity              0              1599               0             0   \n",
+              "\n",
+              "               types/string  types/tensor  frequent_items/frequent_strings  \\\n",
+              "column                                                                       \n",
+              "alcohol                   0             0                              NaN   \n",
+              "chlorides                 0             0                              NaN   \n",
+              "citric acid               0             0                              NaN   \n",
+              "density                   0             0                              NaN   \n",
+              "fixed acidity             0             0                              NaN   \n",
+              "\n",
+              "               ints/max  ints/min  \n",
+              "column                             \n",
+              "alcohol             NaN       NaN  \n",
+              "chlorides           NaN       NaN  \n",
+              "citric acid         NaN       NaN  \n",
+              "density             NaN       NaN  \n",
+              "fixed acidity       NaN       NaN  \n",
+              "\n",
+              "[5 rows x 32 columns]"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-cd1f2ac0-5bd4-48a5-8851-2a4341a14198\" class=\"colab-df-container\">\n",
+              "    <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>cardinality/est</th>\n",
+              "      <th>cardinality/lower_1</th>\n",
+              "      <th>cardinality/upper_1</th>\n",
+              "      <th>counts/inf</th>\n",
+              "      <th>counts/n</th>\n",
+              "      <th>counts/nan</th>\n",
+              "      <th>counts/null</th>\n",
+              "      <th>counts/true</th>\n",
+              "      <th>distribution/max</th>\n",
+              "      <th>distribution/mean</th>\n",
+              "      <th>...</th>\n",
+              "      <th>type</th>\n",
+              "      <th>types/boolean</th>\n",
+              "      <th>types/fractional</th>\n",
+              "      <th>types/integral</th>\n",
+              "      <th>types/object</th>\n",
+              "      <th>types/string</th>\n",
+              "      <th>types/tensor</th>\n",
+              "      <th>frequent_items/frequent_strings</th>\n",
+              "      <th>ints/max</th>\n",
+              "      <th>ints/min</th>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>column</th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "      <th></th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>alcohol</th>\n",
+              "      <td>65.000010</td>\n",
+              "      <td>65.000000</td>\n",
+              "      <td>65.003256</td>\n",
+              "      <td>0</td>\n",
+              "      <td>1599</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>14.90000</td>\n",
+              "      <td>10.422983</td>\n",
+              "      <td>...</td>\n",
+              "      <td>SummaryType.COLUMN</td>\n",
+              "      <td>0</td>\n",
+              "      <td>1599</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>chlorides</th>\n",
+              "      <td>153.000058</td>\n",
+              "      <td>153.000000</td>\n",
+              "      <td>153.007697</td>\n",
+              "      <td>0</td>\n",
+              "      <td>1599</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0.61100</td>\n",
+              "      <td>0.087467</td>\n",
+              "      <td>...</td>\n",
+              "      <td>SummaryType.COLUMN</td>\n",
+              "      <td>0</td>\n",
+              "      <td>1599</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>citric acid</th>\n",
+              "      <td>80.000016</td>\n",
+              "      <td>80.000000</td>\n",
+              "      <td>80.004010</td>\n",
+              "      <td>0</td>\n",
+              "      <td>1599</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>1.00000</td>\n",
+              "      <td>0.270976</td>\n",
+              "      <td>...</td>\n",
+              "      <td>SummaryType.COLUMN</td>\n",
+              "      <td>0</td>\n",
+              "      <td>1599</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>density</th>\n",
+              "      <td>439.557368</td>\n",
+              "      <td>433.943761</td>\n",
+              "      <td>445.310933</td>\n",
+              "      <td>0</td>\n",
+              "      <td>1599</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>1.00369</td>\n",
+              "      <td>0.996747</td>\n",
+              "      <td>...</td>\n",
+              "      <td>SummaryType.COLUMN</td>\n",
+              "      <td>0</td>\n",
+              "      <td>1599</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>fixed acidity</th>\n",
+              "      <td>96.000023</td>\n",
+              "      <td>96.000000</td>\n",
+              "      <td>96.004816</td>\n",
+              "      <td>0</td>\n",
+              "      <td>1599</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>15.90000</td>\n",
+              "      <td>8.319637</td>\n",
+              "      <td>...</td>\n",
+              "      <td>SummaryType.COLUMN</td>\n",
+              "      <td>0</td>\n",
+              "      <td>1599</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>0</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "      <td>NaN</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "<p>5 rows × 32 columns</p>\n",
+              "</div>\n",
+              "    <div class=\"colab-df-buttons\">\n",
+              "\n",
+              "  <div class=\"colab-df-container\">\n",
+              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-cd1f2ac0-5bd4-48a5-8851-2a4341a14198')\"\n",
+              "            title=\"Convert this dataframe to an interactive table.\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+              "  </svg>\n",
+              "    </button>\n",
+              "\n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-buttons div {\n",
+              "      margin-bottom: 4px;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "    <script>\n",
+              "      const buttonEl =\n",
+              "        document.querySelector('#df-cd1f2ac0-5bd4-48a5-8851-2a4341a14198 button.colab-df-convert');\n",
+              "      buttonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "      async function convertToInteractive(key) {\n",
+              "        const element = document.querySelector('#df-cd1f2ac0-5bd4-48a5-8851-2a4341a14198');\n",
+              "        const dataTable =\n",
+              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                    [key], {});\n",
+              "        if (!dataTable) return;\n",
+              "\n",
+              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "          + ' to learn more about interactive tables.';\n",
+              "        element.innerHTML = '';\n",
+              "        dataTable['output_type'] = 'display_data';\n",
+              "        await google.colab.output.renderOutput(dataTable, element);\n",
+              "        const docLink = document.createElement('div');\n",
+              "        docLink.innerHTML = docLinkHtml;\n",
+              "        element.appendChild(docLink);\n",
+              "      }\n",
+              "    </script>\n",
+              "  </div>\n",
+              "\n",
+              "\n",
+              "<div id=\"df-1389eab0-5c0a-4a7d-8beb-1263feefa95b\">\n",
+              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-1389eab0-5c0a-4a7d-8beb-1263feefa95b')\"\n",
+              "            title=\"Suggest charts\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "     width=\"24px\">\n",
+              "    <g>\n",
+              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
+              "    </g>\n",
+              "</svg>\n",
+              "  </button>\n",
+              "\n",
+              "<style>\n",
+              "  .colab-df-quickchart {\n",
+              "      --bg-color: #E8F0FE;\n",
+              "      --fill-color: #1967D2;\n",
+              "      --hover-bg-color: #E2EBFA;\n",
+              "      --hover-fill-color: #174EA6;\n",
+              "      --disabled-fill-color: #AAA;\n",
+              "      --disabled-bg-color: #DDD;\n",
+              "  }\n",
+              "\n",
+              "  [theme=dark] .colab-df-quickchart {\n",
+              "      --bg-color: #3B4455;\n",
+              "      --fill-color: #D2E3FC;\n",
+              "      --hover-bg-color: #434B5C;\n",
+              "      --hover-fill-color: #FFFFFF;\n",
+              "      --disabled-bg-color: #3B4455;\n",
+              "      --disabled-fill-color: #666;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart {\n",
+              "    background-color: var(--bg-color);\n",
+              "    border: none;\n",
+              "    border-radius: 50%;\n",
+              "    cursor: pointer;\n",
+              "    display: none;\n",
+              "    fill: var(--fill-color);\n",
+              "    height: 32px;\n",
+              "    padding: 0;\n",
+              "    width: 32px;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart:hover {\n",
+              "    background-color: var(--hover-bg-color);\n",
+              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "    fill: var(--button-hover-fill-color);\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart-complete:disabled,\n",
+              "  .colab-df-quickchart-complete:disabled:hover {\n",
+              "    background-color: var(--disabled-bg-color);\n",
+              "    fill: var(--disabled-fill-color);\n",
+              "    box-shadow: none;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-spinner {\n",
+              "    border: 2px solid var(--fill-color);\n",
+              "    border-color: transparent;\n",
+              "    border-bottom-color: var(--fill-color);\n",
+              "    animation:\n",
+              "      spin 1s steps(1) infinite;\n",
+              "  }\n",
+              "\n",
+              "  @keyframes spin {\n",
+              "    0% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "      border-left-color: var(--fill-color);\n",
+              "    }\n",
+              "    20% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    30% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    40% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    60% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    80% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "    90% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "  }\n",
+              "</style>\n",
+              "\n",
+              "  <script>\n",
+              "    async function quickchart(key) {\n",
+              "      const quickchartButtonEl =\n",
+              "        document.querySelector('#' + key + ' button');\n",
+              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
+              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
+              "      try {\n",
+              "        const charts = await google.colab.kernel.invokeFunction(\n",
+              "            'suggestCharts', [key], {});\n",
+              "      } catch (error) {\n",
+              "        console.error('Error during call to suggestCharts:', error);\n",
+              "      }\n",
+              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
+              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
+              "    }\n",
+              "    (() => {\n",
+              "      let quickchartButtonEl =\n",
+              "        document.querySelector('#df-1389eab0-5c0a-4a7d-8beb-1263feefa95b button');\n",
+              "      quickchartButtonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "    })();\n",
+              "  </script>\n",
+              "</div>\n",
+              "\n",
+              "    </div>\n",
+              "  </div>\n"
+            ],
+            "application/vnd.google.colaboratory.intrinsic+json": {
+              "type": "dataframe"
+            }
+          },
+          "metadata": {},
+          "execution_count": 14
+        }
+      ],
+      "source": [
+        "import pandas as pd\n",
+        "\n",
+        "dataset_profile_view.to_pandas().head()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "4DsdTSFMJc4V"
+      },
+      "source": [
+        "## Persisting as a file\n",
+        "\n",
+        "After collecting profiles, it is a good practice to store them as a file. This will allow you to later on read them back, merge with future profiles and track how is your data behaving along the way."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "uv5LsfqTJc4V"
+      },
+      "outputs": [],
+      "source": [
+        "dataset_profile_view.write(path=\"my_super_awesome_profile.bin\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "tvcl73JsJc4W"
+      },
+      "source": [
+        "And that's it, you have just written a profile generated with spark to your local environment! If you wish to upload to different locations, such as s3, whylabs or others, please make sure to check out our [other examples](https://github.com/whylabs/whylogs/tree/mainline/python/examples) page.\n",
+        "\n",
+        "Hopefully this tutorial will help you get started to profile and observe your data behaviour in your Spark jobs with almost no friction :)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Periods in column names\n",
+        "\n",
+        "Spark interprets periods in column names as structured fields. This can cause problems when using Spark to profile dataframes created outside of Spark that have periods in their column names. For example:"
+      ],
+      "metadata": {
+        "id": "GubNHJy6LJDy"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "pandas_df = pd.DataFrame({\"col1.mass\": [2.3, 4.1, 1.8], \"col1.volume\": [10.3, 8.7, 6.3]})\n",
+        "spark_df = spark.createDataFrame(pandas_df)\n",
+        "spark_df.show()\n",
+        "collect_column_profile_views(spark_df)"
+      ],
+      "metadata": {
+        "id": "pIOeZewIM9bs",
+        "outputId": "a6563674-45ac-468c-d7cf-30b9637284d7",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 484
+        }
+      },
+      "execution_count": 21,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "+---------+-----------+\n",
+            "|col1.mass|col1.volume|\n",
+            "+---------+-----------+\n",
+            "|      2.3|       10.3|\n",
+            "|      4.1|        8.7|\n",
+            "|      1.8|        6.3|\n",
+            "+---------+-----------+\n",
+            "\n"
+          ]
+        },
+        {
+          "output_type": "error",
+          "ename": "AnalysisException",
+          "evalue": "[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `col1`.`mass` cannot be resolved. Did you mean one of the following? [`col1`.`mass`, `col1`.`volume`].",
+          "traceback": [
+            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+            "\u001b[0;31mAnalysisException\u001b[0m                         Traceback (most recent call last)",
+            "\u001b[0;32m<ipython-input-21-811c9e827257>\u001b[0m in \u001b[0;36m<cell line: 4>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0mspark_df\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mspark\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreateDataFrame\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpandas_df\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[0mspark_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshow\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0mcollect_column_profile_views\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mspark_df\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+            "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/whylogs/api/pyspark/experimental/profiler.py\u001b[0m in \u001b[0;36mcollect_column_profile_views\u001b[0;34m(input_df, schema)\u001b[0m\n\u001b[1;32m     68\u001b[0m     \u001b[0mwhylogs_pandas_map_profiler_with_schema\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpartial\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mwhylogs_pandas_map_profiler\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mschema\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     69\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 70\u001b[0;31m     \u001b[0mprofile_bytes_df\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0minput_df_arrays\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmapInPandas\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mwhylogs_pandas_map_profiler_with_schema\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mschema\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcp\u001b[0m\u001b[0;34m)\u001b[0m  \u001b[0;31m# type: ignore\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     71\u001b[0m     column_profiles = profile_bytes_df.groupby(COL_NAME_FIELD).applyInPandas(  # linebreak\n\u001b[1;32m     72\u001b[0m         \u001b[0mcolumn_profile_bytes_aggregator\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mschema\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/pyspark/sql/pandas/map_ops.py\u001b[0m in \u001b[0;36mmapInPandas\u001b[0;34m(self, func, schema, barrier)\u001b[0m\n\u001b[1;32m    109\u001b[0m             \u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreturnType\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfunctionType\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mPythonEvalType\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSQL_MAP_PANDAS_ITER_UDF\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    110\u001b[0m         )  # type: ignore[call-overload]\n\u001b[0;32m--> 111\u001b[0;31m         \u001b[0mudf_column\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mudf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mcol\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mcol\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    112\u001b[0m         \u001b[0mjdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmapInPandas\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mudf_column\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexpr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbarrier\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    113\u001b[0m         \u001b[0;32mreturn\u001b[0m \u001b[0mDataFrame\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mjdf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msparkSession\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/pyspark/sql/pandas/map_ops.py\u001b[0m in \u001b[0;36m<listcomp>\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m    109\u001b[0m             \u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreturnType\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfunctionType\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mPythonEvalType\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSQL_MAP_PANDAS_ITER_UDF\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    110\u001b[0m         )  # type: ignore[call-overload]\n\u001b[0;32m--> 111\u001b[0;31m         \u001b[0mudf_column\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mudf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mcol\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mcol\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    112\u001b[0m         \u001b[0mjdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmapInPandas\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mudf_column\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexpr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbarrier\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    113\u001b[0m         \u001b[0;32mreturn\u001b[0m \u001b[0mDataFrame\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mjdf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msparkSession\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/pyspark/sql/dataframe.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, item)\u001b[0m\n\u001b[1;32m   3078\u001b[0m         \"\"\"\n\u001b[1;32m   3079\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3080\u001b[0;31m             \u001b[0mjc\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   3081\u001b[0m             \u001b[0;32mreturn\u001b[0m \u001b[0mColumn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mjc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   3082\u001b[0m         \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mColumn\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/py4j/java_gateway.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m   1320\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1321\u001b[0m         \u001b[0manswer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgateway_client\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1322\u001b[0;31m         return_value = get_return_value(\n\u001b[0m\u001b[1;32m   1323\u001b[0m             answer, self.gateway_client, self.target_id, self.name)\n\u001b[1;32m   1324\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/pyspark/errors/exceptions/captured.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m    183\u001b[0m                 \u001b[0;31m# Hide where the exception came from that shows a non-Pythonic\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    184\u001b[0m                 \u001b[0;31m# JVM exception message.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 185\u001b[0;31m                 \u001b[0;32mraise\u001b[0m \u001b[0mconverted\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    186\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    187\u001b[0m                 \u001b[0;32mraise\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+            "\u001b[0;31mAnalysisException\u001b[0m: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `col1`.`mass` cannot be resolved. Did you mean one of the following? [`col1`.`mass`, `col1`.`volume`]."
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The error above is caused by Spark misinterpreting the periods in the column names. We can work around this limitation by replacing any periods in the column names:"
+      ],
+      "metadata": {
+        "id": "oeiVK0iaQ9NG"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "renamed_df = spark_df.toDF(*(c.replace('.', '_') for c in spark_df.columns))\n",
+        "renamed_df.show()\n",
+        "collect_column_profile_views(renamed_df)"
+      ],
+      "metadata": {
+        "id": "ht1Ic5DGRd1G",
+        "outputId": "73a7cab9-90ee-4f34-d812-70120fd0a01b",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "execution_count": 22,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "+---------+-----------+\n",
+            "|col1_mass|col1_volume|\n",
+            "+---------+-----------+\n",
+            "|      2.3|       10.3|\n",
+            "|      4.1|        8.7|\n",
+            "|      1.8|        6.3|\n",
+            "+---------+-----------+\n",
+            "\n"
+          ]
+        },
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "{'col1_mass': <whylogs.core.view.column_profile_view.ColumnProfileView at 0x7a9dc6e7c520>,\n",
+              " 'col1_volume': <whylogs.core.view.column_profile_view.ColumnProfileView at 0x7a9dbfdae950>}"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 22
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "⚠ Be sure that any column names specified in whylogs schemas or UDFs match the renamed columns if you use this approach."
+      ],
+      "metadata": {
+        "id": "I3-RSh1BR8oP"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "CUSeRZehJc4W"
+      },
+      "source": [
+        "## Important note\n",
+        "\n",
+        "As you might have seen from the imports, currently this pyspark implementation is the **experimental** phase. We ran some benchmark ourselves with it, and for the sake of example, a `90Gb` dataset with 80M rows could be profiled in under 3 minutes! Cool, right? But we still want more users to try this on their own, see if there are places to be improved and give us feedback before we make it officially **the** spark module here.\n",
+        "Please, feel free to reach out to our [community Slack](https://communityinviter.com/apps/whylabs-community/rsqrd-ai-community) and interact with us there. We will love to hear from you :)"
+      ]
+    }
+  ],
+  "metadata": {
+    "interpreter": {
+      "hash": "16a10773934acde374a1cd808bcd53b1085f60e17ec18f4c0c26564dd890a5a0"
+    },
+    "kernelspec": {
+      "display_name": "Python 3 (ipykernel)",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.9.14"
+    },
+    "colab": {
+      "provenance": []
     }
-   ],
-   "source": [
-    "import pandas as pd \n",
-    "\n",
-    "dataset_profile_view.to_pandas().head()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Persisting as a file\n",
-    "\n",
-    "After collecting profiles, it is a good practice to store them as a file. This will allow you to later on read them back, merge with future profiles and track how is your data behaving along the way."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 17,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "dataset_profile_view.write(path=\"my_super_awesome_profile.bin\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "And that's it, you have just written a profile generated with spark to your local environment! If you wish to upload to different locations, such as s3, whylabs or others, please make sure to check out our [other examples](https://github.com/whylabs/whylogs/tree/mainline/python/examples) page.\n",
-    "\n",
-    "Hopefully this tutorial will help you get started to profile and observe your data behaviour in your Spark jobs with almost no friction :)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Important note\n",
-    "\n",
-    "As you might have seen from the imports, currently this pyspark implementation is the **experimental** phase. We ran some benchmark ourselves with it, and for the sake of example, a `90Gb` dataset with 80M rows could be profiled in under 3 minutes! Cool, right? But we still want more users to try this on their own, see if there are places to be improved and give us feedback before we make it officially **the** spark module here. \n",
-    "Please, feel free to reach out to our [community Slack](https://communityinviter.com/apps/whylabs-community/rsqrd-ai-community) and interact with us there. We will love to hear from you :)"
-   ]
-  }
- ],
- "metadata": {
-  "interpreter": {
-   "hash": "16a10773934acde374a1cd808bcd53b1085f60e17ec18f4c0c26564dd890a5a0"
-  },
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
   },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.9.14"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
+  "nbformat": 4,
+  "nbformat_minor": 0
 }

	counts/n	counts/null	types/integral	types/fractional	types/boolean	types/string	types/object	cardinality/est	cardinality/upper_1	cardinality/lower_1	...	distribution/min	distribution/q_10	distribution/q_25	distribution/median	distribution/q_75	distribution/q_90	type	ints/max	ints/min	frequent_items/frequent_strings
column
alcohol	1599	0	0	1599	0	0	0	65.000010	65.003256	65.000000	...	8.40000	9.30000	9.5000	10.20000	11.10000	12.00000	SummaryType.COLUMN	NaN	NaN	NaN
chlorides	1599	0	0	1599	0	0	0	153.000058	153.007697	153.000000	...	0.01200	0.06000	0.0700	0.07900	0.09100	0.10900	SummaryType.COLUMN	NaN	NaN	NaN
citric acid	1599	0	0	1599	0	0	0	80.000016	80.004010	80.000000	...	0.00000	0.01000	0.0900	0.26000	0.43000	0.53000	SummaryType.COLUMN	NaN	NaN	NaN
density	1599	0	0	1599	0	0	0	439.557368	445.310933	433.943761	...	0.99007	0.99451	0.9956	0.99675	0.99786	0.99914	SummaryType.COLUMN	NaN	NaN	NaN
fixed acidity	1599	0	0	1599	0	0	0	96.000023	96.004816	96.000000	...	4.60000	6.60000	7.1000	7.90000	9.20000	10.70000	SummaryType.COLUMN	NaN	NaN	NaN