This code is to demonstrate spark streaming and kafka implementation using a real life e-commerce website product recommendation example. For any (as in data dir) specific item_id we get best recommended items.
Disclaimer: This is not the best tutorial for learning and implementing Recommendation engine. Though I have used item-based collaborative filtering built using python and no machine learning libraries for this tutorial.
- Create a /data folder with below two files.
- item-data.csv - Item Id and Item Name
19444 | Radhe gold aata 25 kg
- user-item.csv - User Id & Item Id mapped to timestamp for matrix for collaborative filtering
49721 | 19853 | 2020-09-18T20:02:20+05:30
- Zookeeper
- Kafka
- Spark: 2.4
- Spark Stream Package: Necessary package it is part of the code.
- Python: 3.7 (for Spark 2.4 version)
All these commands (except for Registry of topics) go in seperate console windows on the terminal.
-
Zookeeper: zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties
(Zoopkeeper is the broker that helps coordinate the show) -
Kafka Server: kafka-server-start /usr/local/etc/kafka/server.properties
(Kafka server syncs data flow between Producer and Consumers) -
Register 1st Topic: kafka-topics --create --zookeeper localhost:2181 --topic idpushtopic --partitions 1 --replication-factor 1
(idpushtopic is the topic to get an Id of product for recommendations. To be run once only.) -
Register 2nd Topic: kafka-topics --create --zookeeper localhost:2181 --topic prodRecommSend --partitions 1 --replication-factor 1
(prodRecommSend is the topic to which the stream pushes the recommendations. To be run once only.) -
Producer: kafka-console-producer --broker-list localhost:9092 --topic idpushtopic
(Producer to be opened in one window that ultimately pushes the product id to the spark stream) -
Spark Streaming: spark-submit --jars spark-streaming-kafka-0-8-assembly_2.11-2.4.7.jar kafka-product-recom.py localhost:2181 idpushtopic
(Stream helps fetch the Id, run the product recommendation and then acts as producer and pushes to another topic) -
Consumer: kafka-console-consumer --bootstrap-server localhost:9092 --topic prodRecommSend --from-beginning
(Consumer to be opened in one window that ultimately fetches the product list from the spark stream)
- Start Producer in one window and push an item id: 19444 (to test)
- You'd see things happening on Spark streaming window.
- Simultaneously you'd get the data in the consumer window
- Vaibhav Magon