spark-feature-engineering-toolkit

Snippets of spark/scala code used to do some handy feature engineering.

Many seems simple, but they save a lot of LoC and make the code more readable.

We standardize our datasets to follow a template of :

groupByCol: column on which the features are developed around,
eventDateCol: column indicating the date of an event of interest,
valueCol: column with values associated with an event,
pivotCol or pivotCols: column or columns that based on which we would build the features' functional or domain granularity.

In the example portrayed in this code, we will compute features associated with a client's card usage:

groupByCol = id_client,
eventDateCol = timestamp_interaction,
valueCol = montant_dhs,
pivotCols = lieu_interaction and type_interaction.

Having a dataset :

id_client	timestamp_interaction	montant_dhs	lieu_interaction	type_interaction
sgjuL0CYJf	2019-01-01T09:01:25.000	1500	GAB MAROC	retrait_gab
17AbPYC2M7	2019-02-01T01:08:23.000	1207.10	MAROC	paiement_tpe
sgjuL0CYJf	2019-01-03T07:01:21.000	200	GAB MAROC	retrait_gab
sgjuL0CYJf	2019-04-01T01:06:19.000	0	GAB MAROC	consultation_solde
17AbPYC2M7	2019-01-05T05:01:17.000	500	GAB MAROC	retrait_gab
17AbPYC2M7	2019-06-01T01:04:15.000	299	INTERNATIONAL	paiement_tpe
17AbPYC2M7	2019-01-07T03:01:13.000	0	MAROC	pin_errone

We apply basic transformations

  df
    .withColumn("basedt", lit("2019-07-01"))
    .withColumn("pivot", concat_ws("_", col("lieu_interaction"), col("type_interaction")))

This spark/scala code builds features (that could be pivoted) :

How many events exist in different month milestones ?

val result = df.getCountEventsWithinMilestonesFeatures("timestamp_interaction", "basedt", "id_client", 
                                 "pivot", "", Seq(1, 3, 6))

Counts of event column associated for each month milestone. In the example below, we count the number of time a card usage (i.e. event) has been recorded for each client and the sums of amounts associated with the usage type, for each month milestone.

id_client	pivot	nb_1m	nb_3m	nb_6m
17AbPYC2M7	MAROC_pin_errone	0	0	1
17AbPYC2M7	MAROC_paiement_tpe	0	0	1
17AbPYC2M7	INTERNATIONAL_paiement_tpe	1	1	1
17AbPYC2M7	GAB_MAROC_retrait_gab	0	0	1
sgjuL0CYJf	GAB_MAROC_retrait_gab	0	0	2
sgjuL0CYJf	GAB_MAROC_consultation_solde	0	1	1

What is the value associated with each milestone ?

val result = df.getCountsSumsValuesWithinMilestonesFeatures("timestamp_interaction", "basedt", "montant_dhs", "id_client", 
                                      "pivot", "mt", Seq(1, 3, 6))

Counts and Sums of a value column associated with an event. In the example below, we count the number of time a card usage (i.e. event) has been recorded for each client and the sums of amounts associated with the usage type, for each month milestone.

id_client	pivot	nb_mt_1m	nb_mt_3m	nb_mt_6m	sum_mt_1m	sum_mt_3m	sum_mt_6m
17AbPYC2M7	MAROC_pin_errone	0	0	1	0.0	0.0	0.0
17AbPYC2M7	MAROC_paiement_tpe	0	0	1	0.0	0.0	1200.1
17AbPYC2M7	INTERNATIONAL_paiement_tpe	1	1	1	300.0	300.0	300.0
17AbPYC2M7	GAB_MAROC_retrait_gab	0	0	1	0.0	0.0	500.0
sgjuL0CYJf	GAB_MAROC_retrait_gab	0	0	2	0.0	0.0	1700.0
sgjuL0CYJf	GAB_MAROC_consultation_solde	0	1	1	0.0	0.0	0.0

How is the event evolving between two time periods ?

val result = df.getRatioFeatures("timestamp_interaction", "basedt", "montant_dhs", "id_client", 
                                 "pivot", Seq((1, 6, 6, 12)))

Ratios calculate trend features. In the example below, we compute the evolution between the period 1-6 months and 6-12 months before a base date.

id_client	pivot	ratio_1_6_6_12
17AbPYC2M7	MAROC_pin_errone	0.0
17AbPYC2M7	MAROC_paiement_tpe	7.096804156525518
17AbPYC2M7	INTERNATIONAL_paiement_tpe	0.0
17AbPYC2M7	GAB_MAROC_retrait_gab	6.2166061010848646
sgjuL0CYJf	GAB_MAROC_retrait_gab	5.303304908059076
sgjuL0CYJf	GAB_MAROC_consultation_solde	0.0

More information can be found on https://issam.ma/jekyll/update/2020/03/23/datacience-feature-engineering.html (French)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.idea		.idea
generator/src		generator/src
ingestion		ingestion
modelisation/src/main/scala/usecaseexample		modelisation/src/main/scala/usecaseexample
processing/src		processing/src
warehouse/src/main/scala/core		warehouse/src/main/scala/core
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-feature-engineering-toolkit

How many events exist in different month milestones ?

What is the value associated with each milestone ?

How is the event evolving between two time periods ?

About

Releases

Packages

Languages

License

Non-NeutralZero/spark-feature-engineering-toolkit

Folders and files

Latest commit

History

Repository files navigation

spark-feature-engineering-toolkit

How many events exist in different month milestones ?

What is the value associated with each milestone ?

How is the event evolving between two time periods ?

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages