Documentation | Discord | Stack Overflow | Latest changelog
Do you like this project? Show us your love and give feedback!
ydata-profiling
primary goal is to provide a one-line Exploratory Data Analysis (EDA) experience in a consistent and fast solution. Like pandas df.describe()
function, that is so handy, ydata-profiling delivers an extended analysis of a DataFrame while allowing the data analysis to be exported in different formats such as html and json.
The package outputs a simple and digested analysis of a dataset, including time-series and text.
Thank you for using and following pandas-profiling
developments. Yet, we have a new exciting feature - we are now thrilled to announce
that Spark is now part of the Data Profiling family from version 4.0.0 onwards
With its introduction, there was also the need for a new naming, one that will allow to decouple the concept of profiling from the Pandas Dataframes - ydata-profiling
!
But fear not, pip install pandas-profiling
will still be a valid for a while, and we will keep investing in growing the best open-source for data profiling, so you can use it for even more use cases.
- Type inference: automatic detection of columns' data types (Categorical, Numerical, Date, etc.)
- Warnings: A summary of the problems/challenges in the data that you might need to work on (missing data, inaccuracies, skewness, etc.)
- Univariate analysis: including descriptive statistics (mean, median, mode, etc) and informative visualizations such as distribution histograms
- Multivariate analysis: including correlations, a detailed analysis of missing data, duplicate rows, and visual support for variables pairwise interaction
- Time-Series: including different statistical information relative to time dependent data such as auto-correlation and seasonality, along ACF and PACF plots.
- Text analysis: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
- File and Image analysis: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata
- Compare datasets: one-line solution to enable a fast and complete report on the comparison of datasets
- Flexible output formats: all analysis can be exported to an HTML report that can be easily shared with different parties, as JSON for an easy integration in automated systems and as a widget in a Jupyter Notebook.
The report contains three additional sections:
- Overview: mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint)
- Alerts: a comprehensive and automatic list of potential data quality issues (high correlation, skewness, uniformity, zeros, missing values, constant values, between others)
- Reproduction: technical details about the analysis (time, version and configuration)
- Want to scale? Check the latest release with ββ‘Spark support!
- Looking for how you can do an EDA for Time-Series π ? Check this blogpost.
- You want to compare 2 datasets and get a report? Check this blogpost
Spark support has been released, but we are always looking for an extra pair of hands π. Check current work in progress!.
YData-profiling can be used to deliver a variety of different use-case. The documentation includes guides, tips and tricks for tackling them:
Use case | Description |
---|---|
Comparing datasets | Comparing multiple version of the same dataset |
Profiling a Time-Series dataset | Generating a report for a time-series dataset with a single line of code |
Profiling large datasets | Tips on how to prepare data and configure ydata-profiling for working with large datasets |
Handling sensitive data | Generating reports which are mindful about sensitive data in the input dataset |
Dataset metadata and data dictionaries | Complementing the report with dataset details and column-specific data dictionaries |
Customizing the report's appearance | Changing the appearance of the report's page and of the contained visualizations |
Start by loading your pandas DataFrame
as you normally would, e.g. by using:
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport
df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])
To generate the standard profiling report, merely run:
profile = ProfileReport(df, title="Profiling Report")
There are two interfaces to consume the report inside a Jupyter notebook: through widgets and through an embedded HTML report.
The above is achieved by simply displaying the report as a set of widgets. In a Jupyter Notebook, run:
profile.to_widgets()
The HTML report can be directly embedded in a cell in a similar fashion:
profile.to_notebook_iframe()
To generate a HTML report file, save the ProfileReport
to an object and use the to_file()
function:
profile.to_file("your_report.html")
Alternatively, the report's data can be obtained as a JSON file:
# As a JSON string
json_data = profile.to_json()
# As a file
profile.to_file("your_report.json")
For standard formatted CSV files (which can be read directly by pandas without additional settings), the pandas_profiling
executable can be used in the command line. The example below generates a report named Example Profiling Report, using a configuration file called default.yaml
, in the file report.html
by processing a data.csv
dataset.
pandas_profiling --title "Example Profiling Report" --config_file default.yaml data.csv report.html
Additional details on the CLI are available on the documentation.
The following example reports showcase the potentialities of the package across a wide range of dataset and data types:
- Census Income (US Adult Census data relating income with other demographic properties)
- NASA Meteorites (comprehensive set of meteorite landing - object properties and locations)
- Titanic (the "Wonderwall" of datasets)
- NZA (open data from the Dutch Healthcare Authority)
- Stata Auto (1978 Automobile data)
- Colors (a simple colors dataset)
- Vektis (Vektis Dutch Healthcare data)
- UCI Bank Dataset (marketing dataset from a bank)
- Russian Vocabulary (100 most common Russian words, showcasing unicode text analysis)
- Website Inaccessibility (website accessibility analysis, showcasing support for URL data)
- Orange prices and
- Coal prices (simple pricing evolution datasets, showcasing the theming options)
- USA Air Quality (Time-series air quality dataset EDA example)
- HCC (Open dataset from healthcare, showcasing compare between two sets of data, before and after preprocessing)
Additional details, including information about widget support, are available on the documentation.
You can install using the pip
package manager by running:
pip install -U ydata-profiling
The package declares "extras", sets of additional dependencies.
[notebook]
: support for rendering the report in Jupyter notebook widgets.[unicode]
: support for more detailed Unicode analysis, at the expense of additional disk space.[pyspark]
: support for pyspark for big dataset analysis
Install these with e.g.
pip install -U ydata-profiling[notebook,unicode,pyspark]
You can install using the conda
package manager by running:
conda install -c conda-forge ydata-profiling
Download the source code by cloning the repository or click on Download ZIP to download the latest stable version.
Install it by navigating to the proper directory and running:
pip install -e .
The profiling report is written in HTML and CSS, which means a modern browser is required.
You need Python 3 to run the package. Other dependencies can be found in the requirements files:
Filename | Requirements |
---|---|
requirements.txt | Package requirements |
requirements-dev.txt | Requirements for development |
requirements-test.txt | Requirements for testing |
setup.py | Requirements for widgets etc. |
To maximize its usefulness in real world contexts, pandas-profiling
has a set of implicit and explicit integrations with a variety of other actors in the Data Science ecosystem:
Integration type | Description |
---|---|
Other DataFrame libraries | How to compute the profiling of data stored in libraries other than pandas |
Great Expectations | Generating Great Expectations expectations suites directly from a profiling report |
Interactive applications | Embedding profiling reports in Streamlit, Dash or Panel applications |
Pipelines | Integration with DAG workflow execution tools like Airflow or Kedro |
Cloud services | Using pandas-profiling in hosted computation services like Lambda, Google Cloud or Kaggle |
IDEs | Using pandas-profiling directly from integrated development environments such as PyCharm |
Need help? Want to share a perspective? Report a bug? Ideas for collaborations? Reach out via the following channels:
- Stack Overflow: ideal for asking questions on how to use the package
- GitHub Issues: bugs, proposals for changes, feature requests
- Discord: ideal for projects discussions, ask questions, collaborations, general chat
- Email: project collaborations or sponsoring
β Before reporting an issue on GitHub, check out Common Issues.
Learn how to get involved in the Contribution Guide.
A low-threshold place to ask questions or start contributing is the Data Centric AI Community's Discord.