Skip to content

Latest commit

 

History

History
19 lines (16 loc) · 877 Bytes

README.md

File metadata and controls

19 lines (16 loc) · 877 Bytes

Spark Table Stats

Spark Table Stats is intended to provide summary statistics by column in an efficient manner. As designed now the intent is to generate the following statistics in only two passes by making use of repartitionAndSortWithinPartitions leveraging custom partitioning and foreachPartition leveraging custom accumulators.

Summary Statistics By Column:

  • Sum
  • Average
  • Standard Deviation
  • Max
  • Min
  • Carnality (The number of records of frequency / total records)
  • Count Nulls
  • Count Empties
  • Top (K) Values by Frequency (NOT COMPLETED)

TODO:

  • Top(K) - Evaluate oppertunity to use combineByKey and create an empty min queue for each key. Merge values into the queue if its size is < K. If >= K, only merge the value if it exceeds the smallest element; if so add it and remove the smallest element.

Collaborators:

Eric, Roderick, Brad