Pandas-profiling aims to ease exploratory data analysis for structured datasets, including time-series. Our focus is to provide users with useful and robust statistics for such datasets encountered in industry, academia and elsewhere. Pandas-profiling is open-source and stimulates contributions from passionate community users.
In line with our aim, we identify the following themes:
-
Exploratory data analysis: The core of the package is a dataset summarization by its main characteristics, which is complemented with warnings on data issues and visualisations.
Suggestions for contribution: Extend the support of more data types (think of paths, location or GPS coordinates and ordinal data types), text data (e.g. encoding, vocabulary size, spelling errors, language detection), time series analysis, or even images (e.g. dimensions, EXIF).
Related: #7, #129, #190, #204 or create one.
-
Stability, Performance and Restricted environment compatibility: Data exploration takes place in all kinds of conditions, on the latest machine learning platforms with enormous dataset to managed environments in large corporations.
pandas-profiling
helps analysts, researchers and engineers alike in these cases. We do this by fixing bugs, improving performance on big datasets and adding environment compatibility.Suggestions for contribution (Performance): Perform concurrency analysis or profile execution times and leverage the gained insights for improved performance (e.g. multiprocessing, cython, numba) or test the performance of
pandas-profiling
with big data sets and corresponding commonly used data formats (such as parquet).Suggestions for contribution (Stability): Either review the code and add tests or watch the issues page and Stackoverflow tag to find current issues.
Related: #98, #122 or create one.
-
Interaction, presentation and user experience: As
pandas-profiling
eases exploratory data analysis, working with the package should reflect that. Interaction and user experience plays a central role in working with the package. Working on interactive and static features is possible through the modular nature of the package: the user can configure which features to use.Suggestions for contribution (interactivity): Interactivity allows for more user friendly applications, including but not limited to on demand analysis (don't compute what you don't want to see) and interactive histograms and correlations. This is ideal for smaller datasets, where we can compute this on-the-fly.
ipywidgets
would be a great place to start (e.g. widget based view).Suggestions for contribution (presentation): Other forms of distribution than HTML (for example PDF or packaged as an GUI application via PyQt) Users should be able to share reports (improve size of labels in graph, add explanations to correlation matrices and allow for styling/branding).
Related: #161, #175, #191 or create one.
-
Community: The success of this package demonstrates the power of sharing and working together. You are welcome as part of this community.
Suggestions for contribution: Share with us if this package is of value to you, let us know in our community. We are interested in how you use
pandas-profiling
in your work.Related: #87 or create one.
-
Machine learning:
pandas-profiling
is not a machine learning package, even though many of our users use EDA as a step prior to developing their models. Our focus lies in the exploratory data analysis. Any functionality that enables machine learning applications by more effective data profiling, is welcome.Related: #124, #173, #198 or create one.
-
Ensure the bug was not already reported by searching on Github under Issues.
-
If you're unable to find an open issue addressing the problem, open a new one. If possible, use the relevant bug report templates to create the issue.
-
Open a new Github pull request with the patch.
-
Ensure the PR description clearly describes the problem and solution. Include the relevant issue number if applicable.
We would like to thank everyone who has helped getting us to where we are now.
See the Contributor Graph