From 89188d78134ce11e1c31d5d2e98f56db52c50b3b Mon Sep 17 00:00:00 2001 From: richard-rogers <93153899+richard-rogers@users.noreply.github.com> Date: Wed, 4 Dec 2024 04:54:16 +0000 Subject: [PATCH] Rename Pyspark dataframe columns with . example (#1588) ## Description Adds example of renaming columns in a Pyspark dataframe with names containing periods. ## Related Closes whylabs/whylogs#1532 - [ ] I have reviewed the [Guidelines for Contributing](CONTRIBUTING.md) and the [Code of Conduct](CODE_OF_CONDUCT.md). --- .../integrations/Pyspark_Profiling.ipynb | 1771 ++++++++++------- 1 file changed, 1086 insertions(+), 685 deletions(-) diff --git a/python/examples/integrations/Pyspark_Profiling.ipynb b/python/examples/integrations/Pyspark_Profiling.ipynb index 8473ea2fcc..b929a6ad38 100644 --- a/python/examples/integrations/Pyspark_Profiling.ipynb +++ b/python/examples/integrations/Pyspark_Profiling.ipynb @@ -1,690 +1,1091 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n", - ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Pyspark_Profiling)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Pyspark_Profiling) to leverage the power of whylogs and WhyLabs together!*" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# PySpark Integration\n", - "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/integrations/Pyspark_Profiling.ipynb)\n", - "\n", - "\n", - "Hi! Perhaps you're already feeling confident with our library, but you really wish there was an easy way to plug our profiling into your existing PySpark jobs. Well, glad you've made it here, because this is what we are going to cover in this example notebook 😃\n", - "\n", - "If you wish to have other insights on how to use whylogs, feel free to check our [other existing examples](https://github.com/whylabs/whylogs/tree/mainline/python/examples), as they might be extremely useful!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Installing the extra dependency\n", - "\n", - "As we want to enable users to have exactly what they need to use from whylogs, the `pyspark` integration comes as an extra dependency. In order to have it available, simply uncomment and run the following cell:" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "# Note: you may need to restart the kernel to use updated packages.\n", - "%pip install 'whylogs[spark]'" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Initializing a SparkSession\n", - "\n", - "Here we will initialize a SparkSession. I'm also setting the `pyarrow` execution config, because it makes our methods even more performant. \n", - "\n", - ">**IMPORTANT**: Make sure you have Spark 3.0+ available in your environment, as our implementation relies on it for a smoother integration" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from pyspark.sql import SparkSession" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "spark = SparkSession.builder.appName('whylogs-testing').getOrCreate()\n", - "arrow_config_key = \"spark.sql.execution.arrow.pyspark.enabled\"\n", - "spark.conf.set(arrow_config_key, \"true\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Reading the data\n", - "\n", - "For the sake of simplicity (and computational efforts, so you can run this notebook from your local machine), we will read the Wine Quality dataset, available in this URL: \"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv\"." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "from pyspark import SparkFiles\n", - "\n", - "data_url = \"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv\"\n", - "spark.sparkContext.addFile(data_url)" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - } - ], - "source": [ - "spark_dataframe = spark.read.option(\"delimiter\", \";\").option(\"inferSchema\", \"true\").csv(SparkFiles.get(\"winequality-red.csv\"), header=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "-RECORD 0----------------------\n", - " fixed acidity | 7.4 \n", - " volatile acidity | 0.7 \n", - " citric acid | 0.0 \n", - " residual sugar | 1.9 \n", - " chlorides | 0.076 \n", - " free sulfur dioxide | 11.0 \n", - " total sulfur dioxide | 34.0 \n", - " density | 0.9978 \n", - " pH | 3.51 \n", - " sulphates | 0.56 \n", - " alcohol | 9.4 \n", - " quality | 5 \n", - "only showing top 1 row\n", - "\n" - ] - } - ], - "source": [ - "spark_dataframe.show(n=1, vertical=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "root\n", - " |-- fixed acidity: double (nullable = true)\n", - " |-- volatile acidity: double (nullable = true)\n", - " |-- citric acid: double (nullable = true)\n", - " |-- residual sugar: double (nullable = true)\n", - " |-- chlorides: double (nullable = true)\n", - " |-- free sulfur dioxide: double (nullable = true)\n", - " |-- total sulfur dioxide: double (nullable = true)\n", - " |-- density: double (nullable = true)\n", - " |-- pH: double (nullable = true)\n", - " |-- sulphates: double (nullable = true)\n", - " |-- alcohol: double (nullable = true)\n", - " |-- quality: integer (nullable = true)\n", - "\n" - ] - } - ], - "source": [ - "spark_dataframe.printSchema()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Profiling the data with whylogs\n", - "\n", - "Now that we have a Spark DataFrame in place, let's see how easy it is to profile our data with whylogs." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - } - ], - "source": [ - "from whylogs.api.pyspark.experimental import collect_column_profile_views\n", - "\n", - "column_views_dict = collect_column_profile_views(spark_dataframe)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Yeap. It's done. It is **that** easy.\n", - "\n", - "But what do we get with a `column_views_dict`? " - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'alcohol': , 'chlorides': , 'citric acid': , 'density': , 'fixed acidity': , 'free sulfur dioxide': , 'pH': , 'quality': , 'residual sugar': , 'sulphates': , 'total sulfur dioxide': , 'volatile acidity': }\n" - ] - } - ], - "source": [ - "print(column_views_dict)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "It is a dictionary with one `ColumnProfileView` object per column in your dataset. And we can inspect some of the metrics on each one of them, such as the counts for a given column" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(1599, 1599)" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "column_views_dict[\"density\"].get_metric(\"counts\").n.value, spark_dataframe.count()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Or their `mean` value:" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0.9967466791744841" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "column_views_dict[\"density\"].get_metric(\"distribution\").mean.value" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And now let's check how accurate whylogs did store that `mean` calculation." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+------------------+\n", - "| avg(density)|\n", - "+------------------+\n", - "|0.9967466791744831|\n", - "+------------------+\n", - "\n" - ] - } - ], - "source": [ - "from pyspark.sql.functions import mean\n", - "spark_dataframe.select(mean(\"density\")).show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "It is not the literal exact value, but it gets really close, right? That is because we are not extracting the exact information, but we are also **not sampling** the data. `whylogs` will look at **every data point** and *statistically* decide wether or not that data point is relevant to the final calculation. \n", - "\n", - "Is it just me or this is extremely powerful? Yes, it is.\n", - "\n", - "> \"Cool! But what can I do with a bunch of `ColumnProfileView`'s from my Dataset? I want to see everything together\n", - "\n", - "Well, you've come to the right place, because we will inspect the next method that does just that!" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [], - "source": [ - "from whylogs.api.pyspark.experimental import collect_dataset_profile_view\n", - "\n", - "dataset_profile_view = collect_dataset_profile_view(input_df=spark_dataframe)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Yes, that easy. You now have a `DatasetProfileView`. As you might have seen from other example notebooks in our repo, you can turn this *lightweight* object into a pandas DataFrame, and visualize all the important metrics that we've profiled, like this:" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
counts/ncounts/nulltypes/integraltypes/fractionaltypes/booleantypes/stringtypes/objectcardinality/estcardinality/upper_1cardinality/lower_1...distribution/mindistribution/q_10distribution/q_25distribution/mediandistribution/q_75distribution/q_90typeints/maxints/minfrequent_items/frequent_strings
column
alcohol159900159900065.00001065.00325665.000000...8.400009.300009.500010.2000011.1000012.00000SummaryType.COLUMNNaNNaNNaN
chlorides1599001599000153.000058153.007697153.000000...0.012000.060000.07000.079000.091000.10900SummaryType.COLUMNNaNNaNNaN
citric acid159900159900080.00001680.00401080.000000...0.000000.010000.09000.260000.430000.53000SummaryType.COLUMNNaNNaNNaN
density1599001599000439.557368445.310933433.943761...0.990070.994510.99560.996750.997860.99914SummaryType.COLUMNNaNNaNNaN
fixed acidity159900159900096.00002396.00481696.000000...4.600006.600007.10007.900009.2000010.70000SummaryType.COLUMNNaNNaNNaN
\n", - "

5 rows × 24 columns

\n", - "
" + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "pJfss1JOJc4I" + }, + "source": [ + ">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n", + ">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Pyspark_Profiling)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Pyspark_Profiling) to leverage the power of whylogs and WhyLabs together!*" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dFWwDRveJc4M" + }, + "source": [ + "# PySpark Integration\n", + "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/integrations/Pyspark_Profiling.ipynb)\n", + "\n", + "\n", + "Hi! Perhaps you're already feeling confident with our library, but you really wish there was an easy way to plug our profiling into your existing PySpark jobs. Well, glad you've made it here, because this is what we are going to cover in this example notebook 😃\n", + "\n", + "If you wish to have other insights on how to use whylogs, feel free to check our [other existing examples](https://github.com/whylabs/whylogs/tree/mainline/python/examples), as they might be extremely useful!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4xQXK66aJc4M" + }, + "source": [ + "## Installing the extra dependency\n", + "\n", + "As we want to enable users to have exactly what they need to use from whylogs, the `pyspark` integration comes as an extra dependency. In order to have it available, simply uncomment and run the following cell:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TpxG9uOMJc4N" + }, + "outputs": [], + "source": [ + "# Note: you may need to restart the kernel to use updated packages.\n", + "%pip install 'whylogs[spark]'" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mpRGR0TCJc4O" + }, + "source": [ + "## Initializing a SparkSession\n", + "\n", + "Here we will initialize a SparkSession. I'm also setting the `pyarrow` execution config, because it makes our methods even more performant.\n", + "\n", + ">**IMPORTANT**: Make sure you have Spark 3.0+ available in your environment, as our implementation relies on it for a smoother integration" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "pycharm": { + "name": "#%%\n" + }, + "id": "tMAPYoEeJc4P" + }, + "outputs": [], + "source": [ + "from pyspark.sql import SparkSession" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "pycharm": { + "name": "#%%\n" + }, + "id": "Fb_rCVoPJc4P" + }, + "outputs": [], + "source": [ + "spark = SparkSession.builder.appName('whylogs-testing').getOrCreate()\n", + "arrow_config_key = \"spark.sql.execution.arrow.pyspark.enabled\"\n", + "spark.conf.set(arrow_config_key, \"true\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mSExY4vGJc4P" + }, + "source": [ + "## Reading the data\n", + "\n", + "For the sake of simplicity (and computational efforts, so you can run this notebook from your local machine), we will read the Wine Quality dataset, available in this URL: \"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv\"." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "vgmlP2S2Jc4P" + }, + "outputs": [], + "source": [ + "from pyspark import SparkFiles\n", + "\n", + "data_url = \"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv\"\n", + "spark.sparkContext.addFile(data_url)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "KzNoJLBCJc4Q" + }, + "outputs": [], + "source": [ + "spark_dataframe = spark.read.option(\"delimiter\", \";\").option(\"inferSchema\", \"true\").csv(SparkFiles.get(\"winequality-red.csv\"), header=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "pycharm": { + "name": "#%%\n" + }, + "id": "D_Il3a7_Jc4R", + "outputId": "7d74162d-6eb7-446e-9e38-83f61d0e49e9", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "-RECORD 0----------------------\n", + " fixed acidity | 7.4 \n", + " volatile acidity | 0.7 \n", + " citric acid | 0.0 \n", + " residual sugar | 1.9 \n", + " chlorides | 0.076 \n", + " free sulfur dioxide | 11.0 \n", + " total sulfur dioxide | 34.0 \n", + " density | 0.9978 \n", + " pH | 3.51 \n", + " sulphates | 0.56 \n", + " alcohol | 9.4 \n", + " quality | 5 \n", + "only showing top 1 row\n", + "\n" + ] + } + ], + "source": [ + "spark_dataframe.show(n=1, vertical=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "y_Q85tjAJc4R", + "outputId": "442f78d2-6999-450d-ef74-976d9845dc99", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "root\n", + " |-- fixed acidity: double (nullable = true)\n", + " |-- volatile acidity: double (nullable = true)\n", + " |-- citric acid: double (nullable = true)\n", + " |-- residual sugar: double (nullable = true)\n", + " |-- chlorides: double (nullable = true)\n", + " |-- free sulfur dioxide: double (nullable = true)\n", + " |-- total sulfur dioxide: double (nullable = true)\n", + " |-- density: double (nullable = true)\n", + " |-- pH: double (nullable = true)\n", + " |-- sulphates: double (nullable = true)\n", + " |-- alcohol: double (nullable = true)\n", + " |-- quality: integer (nullable = true)\n", + "\n" + ] + } ], - "text/plain": [ - " counts/n counts/null types/integral types/fractional \\\n", - "column \n", - "alcohol 1599 0 0 1599 \n", - "chlorides 1599 0 0 1599 \n", - "citric acid 1599 0 0 1599 \n", - "density 1599 0 0 1599 \n", - "fixed acidity 1599 0 0 1599 \n", - "\n", - " types/boolean types/string types/object cardinality/est \\\n", - "column \n", - "alcohol 0 0 0 65.000010 \n", - "chlorides 0 0 0 153.000058 \n", - "citric acid 0 0 0 80.000016 \n", - "density 0 0 0 439.557368 \n", - "fixed acidity 0 0 0 96.000023 \n", - "\n", - " cardinality/upper_1 cardinality/lower_1 ... \\\n", - "column ... \n", - "alcohol 65.003256 65.000000 ... \n", - "chlorides 153.007697 153.000000 ... \n", - "citric acid 80.004010 80.000000 ... \n", - "density 445.310933 433.943761 ... \n", - "fixed acidity 96.004816 96.000000 ... \n", - "\n", - " distribution/min distribution/q_10 distribution/q_25 \\\n", - "column \n", - "alcohol 8.40000 9.30000 9.5000 \n", - "chlorides 0.01200 0.06000 0.0700 \n", - "citric acid 0.00000 0.01000 0.0900 \n", - "density 0.99007 0.99451 0.9956 \n", - "fixed acidity 4.60000 6.60000 7.1000 \n", - "\n", - " distribution/median distribution/q_75 distribution/q_90 \\\n", - "column \n", - "alcohol 10.20000 11.10000 12.00000 \n", - "chlorides 0.07900 0.09100 0.10900 \n", - "citric acid 0.26000 0.43000 0.53000 \n", - "density 0.99675 0.99786 0.99914 \n", - "fixed acidity 7.90000 9.20000 10.70000 \n", - "\n", - " type ints/max ints/min \\\n", - "column \n", - "alcohol SummaryType.COLUMN NaN NaN \n", - "chlorides SummaryType.COLUMN NaN NaN \n", - "citric acid SummaryType.COLUMN NaN NaN \n", - "density SummaryType.COLUMN NaN NaN \n", - "fixed acidity SummaryType.COLUMN NaN NaN \n", - "\n", - " frequent_items/frequent_strings \n", - "column \n", - "alcohol NaN \n", - "chlorides NaN \n", - "citric acid NaN \n", - "density NaN \n", - "fixed acidity NaN \n", - "\n", - "[5 rows x 24 columns]" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" + "source": [ + "spark_dataframe.printSchema()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SsWuKLSbJc4R" + }, + "source": [ + "## Profiling the data with whylogs\n", + "\n", + "Now that we have a Spark DataFrame in place, let's see how easy it is to profile our data with whylogs." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "pycharm": { + "name": "#%%\n" + }, + "id": "aP4jgNpXJc4S" + }, + "outputs": [], + "source": [ + "from whylogs.api.pyspark.experimental import collect_column_profile_views\n", + "\n", + "column_views_dict = collect_column_profile_views(spark_dataframe)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CVYL1xEgJc4S" + }, + "source": [ + "Yeap. It's done. It is **that** easy.\n", + "\n", + "But what do we get with a `column_views_dict`?" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "id": "ADl0tMuGJc4S", + "outputId": "c78887c7-d6a8-4420-b9d2-f4311808c8f0", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{'alcohol': , 'chlorides': , 'citric acid': , 'density': , 'fixed acidity': , 'free sulfur dioxide': , 'pH': , 'quality': , 'residual sugar': , 'sulphates': , 'total sulfur dioxide': , 'volatile acidity': }\n" + ] + } + ], + "source": [ + "print(column_views_dict)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SA4t1YMoJc4T" + }, + "source": [ + "It is a dictionary with one `ColumnProfileView` object per column in your dataset. And we can inspect some of the metrics on each one of them, such as the counts for a given column" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "id": "pROki397Jc4T", + "outputId": "84a163ad-8dce-4209-d2a3-06a0c1c76cb8", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(1599, 1599)" + ] + }, + "metadata": {}, + "execution_count": 10 + } + ], + "source": [ + "column_views_dict[\"density\"].get_metric(\"counts\").n.value, spark_dataframe.count()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R231V7EhJc4T" + }, + "source": [ + "Or their `mean` value:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "id": "VzcsqEplJc4T", + "outputId": "a70602d8-d364-48f4-8de9-af5526455919", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.9967466791744841" + ] + }, + "metadata": {}, + "execution_count": 11 + } + ], + "source": [ + "column_views_dict[\"density\"].get_metric(\"distribution\").mean.value" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GsCgZLMXJc4U" + }, + "source": [ + "And now let's check how accurate whylogs did store that `mean` calculation." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "id": "RkUkf58wJc4U", + "outputId": "bb565093-a05e-42fc-f802-4b3f5c879b72", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "+------------------+\n", + "| avg(density)|\n", + "+------------------+\n", + "|0.9967466791744831|\n", + "+------------------+\n", + "\n" + ] + } + ], + "source": [ + "from pyspark.sql.functions import mean\n", + "spark_dataframe.select(mean(\"density\")).show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o_NspRo4Jc4U" + }, + "source": [ + "It is not the literal exact value, but it gets really close, right? That is because we are not extracting the exact information, but we are also **not sampling** the data. `whylogs` will look at **every data point** and *statistically* decide wether or not that data point is relevant to the final calculation.\n", + "\n", + "Is it just me or this is extremely powerful? Yes, it is.\n", + "\n", + "> \"Cool! But what can I do with a bunch of `ColumnProfileView`'s from my Dataset? I want to see everything together\n", + "\n", + "Well, you've come to the right place, because we will inspect the next method that does just that!" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "id": "19nYwCzxJc4U" + }, + "outputs": [], + "source": [ + "from whylogs.api.pyspark.experimental import collect_dataset_profile_view\n", + "\n", + "dataset_profile_view = collect_dataset_profile_view(input_df=spark_dataframe)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xI6yLa-DJc4V" + }, + "source": [ + "Yes, that easy. You now have a `DatasetProfileView`. As you might have seen from other example notebooks in our repo, you can turn this *lightweight* object into a pandas DataFrame, and visualize all the important metrics that we've profiled, like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "XZGmQnGDJc4V", + "outputId": "93c00a89-8de1-4155-f5dc-80d29b37d58e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 316 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " cardinality/est cardinality/lower_1 cardinality/upper_1 \\\n", + "column \n", + "alcohol 65.000010 65.000000 65.003256 \n", + "chlorides 153.000058 153.000000 153.007697 \n", + "citric acid 80.000016 80.000000 80.004010 \n", + "density 439.557368 433.943761 445.310933 \n", + "fixed acidity 96.000023 96.000000 96.004816 \n", + "\n", + " counts/inf counts/n counts/nan counts/null counts/true \\\n", + "column \n", + "alcohol 0 1599 0 0 0 \n", + "chlorides 0 1599 0 0 0 \n", + "citric acid 0 1599 0 0 0 \n", + "density 0 1599 0 0 0 \n", + "fixed acidity 0 1599 0 0 0 \n", + "\n", + " distribution/max distribution/mean ... type \\\n", + "column ... \n", + "alcohol 14.90000 10.422983 ... SummaryType.COLUMN \n", + "chlorides 0.61100 0.087467 ... SummaryType.COLUMN \n", + "citric acid 1.00000 0.270976 ... SummaryType.COLUMN \n", + "density 1.00369 0.996747 ... SummaryType.COLUMN \n", + "fixed acidity 15.90000 8.319637 ... SummaryType.COLUMN \n", + "\n", + " types/boolean types/fractional types/integral types/object \\\n", + "column \n", + "alcohol 0 1599 0 0 \n", + "chlorides 0 1599 0 0 \n", + "citric acid 0 1599 0 0 \n", + "density 0 1599 0 0 \n", + "fixed acidity 0 1599 0 0 \n", + "\n", + " types/string types/tensor frequent_items/frequent_strings \\\n", + "column \n", + "alcohol 0 0 NaN \n", + "chlorides 0 0 NaN \n", + "citric acid 0 0 NaN \n", + "density 0 0 NaN \n", + "fixed acidity 0 0 NaN \n", + "\n", + " ints/max ints/min \n", + "column \n", + "alcohol NaN NaN \n", + "chlorides NaN NaN \n", + "citric acid NaN NaN \n", + "density NaN NaN \n", + "fixed acidity NaN NaN \n", + "\n", + "[5 rows x 32 columns]" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
cardinality/estcardinality/lower_1cardinality/upper_1counts/infcounts/ncounts/nancounts/nullcounts/truedistribution/maxdistribution/mean...typetypes/booleantypes/fractionaltypes/integraltypes/objecttypes/stringtypes/tensorfrequent_items/frequent_stringsints/maxints/min
column
alcohol65.00001065.00000065.0032560159900014.9000010.422983...SummaryType.COLUMN015990000NaNNaNNaN
chlorides153.000058153.000000153.007697015990000.611000.087467...SummaryType.COLUMN015990000NaNNaNNaN
citric acid80.00001680.00000080.004010015990001.000000.270976...SummaryType.COLUMN015990000NaNNaNNaN
density439.557368433.943761445.310933015990001.003690.996747...SummaryType.COLUMN015990000NaNNaNNaN
fixed acidity96.00002396.00000096.0048160159900015.900008.319637...SummaryType.COLUMN015990000NaNNaNNaN
\n", + "

5 rows × 32 columns

\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe" + } + }, + "metadata": {}, + "execution_count": 14 + } + ], + "source": [ + "import pandas as pd\n", + "\n", + "dataset_profile_view.to_pandas().head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4DsdTSFMJc4V" + }, + "source": [ + "## Persisting as a file\n", + "\n", + "After collecting profiles, it is a good practice to store them as a file. This will allow you to later on read them back, merge with future profiles and track how is your data behaving along the way." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uv5LsfqTJc4V" + }, + "outputs": [], + "source": [ + "dataset_profile_view.write(path=\"my_super_awesome_profile.bin\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tvcl73JsJc4W" + }, + "source": [ + "And that's it, you have just written a profile generated with spark to your local environment! If you wish to upload to different locations, such as s3, whylabs or others, please make sure to check out our [other examples](https://github.com/whylabs/whylogs/tree/mainline/python/examples) page.\n", + "\n", + "Hopefully this tutorial will help you get started to profile and observe your data behaviour in your Spark jobs with almost no friction :)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Periods in column names\n", + "\n", + "Spark interprets periods in column names as structured fields. This can cause problems when using Spark to profile dataframes created outside of Spark that have periods in their column names. For example:" + ], + "metadata": { + "id": "GubNHJy6LJDy" + } + }, + { + "cell_type": "code", + "source": [ + "pandas_df = pd.DataFrame({\"col1.mass\": [2.3, 4.1, 1.8], \"col1.volume\": [10.3, 8.7, 6.3]})\n", + "spark_df = spark.createDataFrame(pandas_df)\n", + "spark_df.show()\n", + "collect_column_profile_views(spark_df)" + ], + "metadata": { + "id": "pIOeZewIM9bs", + "outputId": "a6563674-45ac-468c-d7cf-30b9637284d7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 484 + } + }, + "execution_count": 21, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "+---------+-----------+\n", + "|col1.mass|col1.volume|\n", + "+---------+-----------+\n", + "| 2.3| 10.3|\n", + "| 4.1| 8.7|\n", + "| 1.8| 6.3|\n", + "+---------+-----------+\n", + "\n" + ] + }, + { + "output_type": "error", + "ename": "AnalysisException", + "evalue": "[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `col1`.`mass` cannot be resolved. Did you mean one of the following? [`col1`.`mass`, `col1`.`volume`].", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mAnalysisException\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mspark_df\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mspark\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreateDataFrame\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpandas_df\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mspark_df\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshow\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0mcollect_column_profile_views\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mspark_df\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/whylogs/api/pyspark/experimental/profiler.py\u001b[0m in \u001b[0;36mcollect_column_profile_views\u001b[0;34m(input_df, schema)\u001b[0m\n\u001b[1;32m 68\u001b[0m \u001b[0mwhylogs_pandas_map_profiler_with_schema\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpartial\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mwhylogs_pandas_map_profiler\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mschema\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 69\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 70\u001b[0;31m \u001b[0mprofile_bytes_df\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0minput_df_arrays\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmapInPandas\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mwhylogs_pandas_map_profiler_with_schema\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mschema\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcp\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# type: ignore\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 71\u001b[0m column_profiles = profile_bytes_df.groupby(COL_NAME_FIELD).applyInPandas( # linebreak\n\u001b[1;32m 72\u001b[0m \u001b[0mcolumn_profile_bytes_aggregator\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mschema\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/pyspark/sql/pandas/map_ops.py\u001b[0m in \u001b[0;36mmapInPandas\u001b[0;34m(self, func, schema, barrier)\u001b[0m\n\u001b[1;32m 109\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreturnType\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfunctionType\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mPythonEvalType\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSQL_MAP_PANDAS_ITER_UDF\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 110\u001b[0m ) # type: ignore[call-overload]\n\u001b[0;32m--> 111\u001b[0;31m \u001b[0mudf_column\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mudf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mcol\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mcol\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 112\u001b[0m \u001b[0mjdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmapInPandas\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mudf_column\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexpr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbarrier\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 113\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mDataFrame\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mjdf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msparkSession\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/pyspark/sql/pandas/map_ops.py\u001b[0m in \u001b[0;36m\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 109\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreturnType\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfunctionType\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mPythonEvalType\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSQL_MAP_PANDAS_ITER_UDF\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 110\u001b[0m ) # type: ignore[call-overload]\n\u001b[0;32m--> 111\u001b[0;31m \u001b[0mudf_column\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mudf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mcol\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mcol\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 112\u001b[0m \u001b[0mjdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmapInPandas\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mudf_column\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexpr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbarrier\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 113\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mDataFrame\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mjdf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msparkSession\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/pyspark/sql/dataframe.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, item)\u001b[0m\n\u001b[1;32m 3078\u001b[0m \"\"\"\n\u001b[1;32m 3079\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3080\u001b[0;31m \u001b[0mjc\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3081\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mColumn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mjc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3082\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mColumn\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/py4j/java_gateway.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m 1320\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1321\u001b[0m \u001b[0manswer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgateway_client\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1322\u001b[0;31m return_value = get_return_value(\n\u001b[0m\u001b[1;32m 1323\u001b[0m answer, self.gateway_client, self.target_id, self.name)\n\u001b[1;32m 1324\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/usr/local/lib/python3.10/dist-packages/pyspark/errors/exceptions/captured.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m 183\u001b[0m \u001b[0;31m# Hide where the exception came from that shows a non-Pythonic\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 184\u001b[0m \u001b[0;31m# JVM exception message.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 185\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mconverted\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 186\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 187\u001b[0m \u001b[0;32mraise\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mAnalysisException\u001b[0m: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `col1`.`mass` cannot be resolved. Did you mean one of the following? [`col1`.`mass`, `col1`.`volume`]." + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "The error above is caused by Spark misinterpreting the periods in the column names. We can work around this limitation by replacing any periods in the column names:" + ], + "metadata": { + "id": "oeiVK0iaQ9NG" + } + }, + { + "cell_type": "code", + "source": [ + "renamed_df = spark_df.toDF(*(c.replace('.', '_') for c in spark_df.columns))\n", + "renamed_df.show()\n", + "collect_column_profile_views(renamed_df)" + ], + "metadata": { + "id": "ht1Ic5DGRd1G", + "outputId": "73a7cab9-90ee-4f34-d812-70120fd0a01b", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": 22, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "+---------+-----------+\n", + "|col1_mass|col1_volume|\n", + "+---------+-----------+\n", + "| 2.3| 10.3|\n", + "| 4.1| 8.7|\n", + "| 1.8| 6.3|\n", + "+---------+-----------+\n", + "\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'col1_mass': ,\n", + " 'col1_volume': }" + ] + }, + "metadata": {}, + "execution_count": 22 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "âš  Be sure that any column names specified in whylogs schemas or UDFs match the renamed columns if you use this approach." + ], + "metadata": { + "id": "I3-RSh1BR8oP" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CUSeRZehJc4W" + }, + "source": [ + "## Important note\n", + "\n", + "As you might have seen from the imports, currently this pyspark implementation is the **experimental** phase. We ran some benchmark ourselves with it, and for the sake of example, a `90Gb` dataset with 80M rows could be profiled in under 3 minutes! Cool, right? But we still want more users to try this on their own, see if there are places to be improved and give us feedback before we make it officially **the** spark module here.\n", + "Please, feel free to reach out to our [community Slack](https://communityinviter.com/apps/whylabs-community/rsqrd-ai-community) and interact with us there. We will love to hear from you :)" + ] + } + ], + "metadata": { + "interpreter": { + "hash": "16a10773934acde374a1cd808bcd53b1085f60e17ec18f4c0c26564dd890a5a0" + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.14" + }, + "colab": { + "provenance": [] } - ], - "source": [ - "import pandas as pd \n", - "\n", - "dataset_profile_view.to_pandas().head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Persisting as a file\n", - "\n", - "After collecting profiles, it is a good practice to store them as a file. This will allow you to later on read them back, merge with future profiles and track how is your data behaving along the way." - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [], - "source": [ - "dataset_profile_view.write(path=\"my_super_awesome_profile.bin\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And that's it, you have just written a profile generated with spark to your local environment! If you wish to upload to different locations, such as s3, whylabs or others, please make sure to check out our [other examples](https://github.com/whylabs/whylogs/tree/mainline/python/examples) page.\n", - "\n", - "Hopefully this tutorial will help you get started to profile and observe your data behaviour in your Spark jobs with almost no friction :)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Important note\n", - "\n", - "As you might have seen from the imports, currently this pyspark implementation is the **experimental** phase. We ran some benchmark ourselves with it, and for the sake of example, a `90Gb` dataset with 80M rows could be profiled in under 3 minutes! Cool, right? But we still want more users to try this on their own, see if there are places to be improved and give us feedback before we make it officially **the** spark module here. \n", - "Please, feel free to reach out to our [community Slack](https://communityinviter.com/apps/whylabs-community/rsqrd-ai-community) and interact with us there. We will love to hear from you :)" - ] - } - ], - "metadata": { - "interpreter": { - "hash": "16a10773934acde374a1cd808bcd53b1085f60e17ec18f4c0c26564dd890a5a0" - }, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.14" - } - }, - "nbformat": 4, - "nbformat_minor": 2 + "nbformat": 4, + "nbformat_minor": 0 }