Author: Carlos Hernandez Alvarado
Email: [email protected]
University: University of Puerto Rico at Mayaguez
This project is part of the Venmito Data Engineering Challenge, where the goal was to process, filter, and analyze data from multiple file formats (.json
, .yml
, .csv
, .xml
). The data was cleaned, merged, and transformed into uniform Pandas DataFrames to derive meaningful insights and generate visualizations.
- I started by reading files from the
data
folder in different formats (.json
,.yml
,.csv
,.xml
). - Each file was converted into a Pandas DataFrame for uniformity and easier analysis.
- The
people.json
andpeople.yml
files were merged into a single People DataFrame. - Columns from
people.json
:id
,first_name
,last_name
,telephone
,email
. - Columns from
people.yml
:Android
,Desktop
,iPhone
,city
. - This ensured a unified structure for all people-related data.
- The
promotions.csv
file was filtered using the People DataFrame. - First, the data was merged using the
telephone
column with an inner join. - Next, another merge was performed using
email
(left DataFrame) andclient_email
(right DataFrame). - A final merge with an outer join was executed to consolidate all information.
- The
transfers.csv
file was filtered by ensuring bothsender_id
andrecipient_id
exist in the People DataFrame. - This step ensures that only valid transfers between known people are retained.
- The
transactions.xml
file was filtered using thetelephone
column from the People DataFrame. - This ensured that only transactions related to people in the unified DataFrame were considered.
- Filtered data was saved into the
data/processed
folder as.csv
files:people_filtered.csv
promotions_filtered.csv
transfers_filtered.csv
transactions_filtered.csv
- Used a
groupby
operation onitem
andquantity
. - Calculated the total quantity sold for each item.
- Visualization: A Line Plot was generated with:
- X-axis: Item Names
- Y-axis: Total Quantity Sold
- Analyzed
sender_id
andrecipient_id
to determine the total amount sent and received by each person. - Created an
amount_left
column (amount_received - amount_sent
) to measure net balance.
Visualizations:
- Graph Plot:
- Nodes: Represent person IDs.
- Edges: Represent the amount sent between individuals.
- Tabular View: Displaying columns:
id
,amount_sent
,amount_receive
,amount_left
.
- Grouped by
promotion
and counted how many clients responded "Yes". - Calculated the percentage of positive responses for each promotion.
Visualizations:
- Combo Chart:
- X-axis: Promotions
- Y-axis: Total Promotions
- Line plot showing the percentage of positive responses.
- Pie Chart:
- Displays the Distribution of Yes Responses by Promotion.
- The largest slice is highlighted to emphasize the promotion with the highest success rate.
- Clone the Repository:
git clone <repository_url> cd <project_folder>
- Install Required Libraries:
- Pandas
- Matplotlib
- Networkx
- pyyaml
pip install pandas matplotlib networkx pyyaml