You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Encountering performance issues when generating a profiling report for more than 200 columns across 5 million records. I am applying almost all the metrics to generate profiling report. Applied metrics such as datatype, entropy, minimum, maximum, sum, standard deviation, mean, maxlength, minlength, histogram, completeness, distinctness, uniquevalueratio, uniqueness, countdistinct, and correlation. I am trying to generate report similar to ydata-profiling(https://github.com/ydataai/ydata-profiling)
The job has been running for over 3 hours despite attempts to optimize Spark configuration. When checking the logs each metrics is calculated sequentially. Sequential computation of each metric is causing the prolonged runtime. Is it possible to parallelize this operation for improved efficiency?
The text was updated successfully, but these errors were encountered:
Hi @rdsharma26, I was doing more testing. By analyzing the spark execution tasks, I believe the performance issue is because for metrics such as CountDistinct, Histogram, each metrics calculation is done on each column in sequential manner. So more columns in dataframe is causing the job to run for more time. Parallelizing these calculations would enhance efficiency.
Encountering performance issues when generating a profiling report for more than 200 columns across 5 million records. I am applying almost all the metrics to generate profiling report. Applied metrics such as datatype, entropy, minimum, maximum, sum, standard deviation, mean, maxlength, minlength, histogram, completeness, distinctness, uniquevalueratio, uniqueness, countdistinct, and correlation. I am trying to generate report similar to ydata-profiling(https://github.com/ydataai/ydata-profiling)
The job has been running for over 3 hours despite attempts to optimize Spark configuration. When checking the logs each metrics is calculated sequentially. Sequential computation of each metric is causing the prolonged runtime. Is it possible to parallelize this operation for improved efficiency?
The text was updated successfully, but these errors were encountered: