A proof of concept about collecting real-time clickstream data using Javascript, Divolte Collector, Apache Kafka, Kafka Streams, Apache Druid and Apache Superset.
At the end of the youtube video attached here, we are going to compare our results with Microsoft Clarity and Google Analytics. The comparison is going to be just for fun, as those platforms are complete products and built for years by big companies.
You can visit the website as a client, and then go to Apache superset dashboard to see real-time results.
Apache Superset dashboard credentials:
username: admin
password: admin
- Website: http://soufianeodf.tech
- Kafka Manager (CMAK): http://soufianeodf.tech:9000
- Apache Druid: http://soufianeodf.tech:8888
- Apache Superset: http://soufianeodf.tech:8080
- A tool developed with Selenium and Python (used for website user visits simulation)
- Javascript (used with ipstack tool to collect user information when visiting the website)
- Divolte Collector (used as a server to collect clickstream data in Apache Kafka)
- Apache Avro (used inside Divolte Collector as a schema for the payload)
- Apache Kafka (used as a publish/subscribe system)
- Kafka Manager (CMAK) (used as a dashboard manager of Apache Kafka cluster)
- Kafka Streams (used to convert avro payload to json)
- Apache Druid (used as a high performance real-time analytics database)
- Apache Superset (used as a dashboard for data visualization)
- Mapbox (used in Apache Superset in order to use maps)
- Docker and Swarm (used for containerization and deployment)
- Ansible (used to facilitate deployment on remote servers)
- Digitalocean (used as a cloud infrastructure)
- Divolte Collector for real-time clickstream
- Divolte Collector with Apache Kafka for real-time clickstream
- Docker
- Ansible (if you want to automate the deployment on remote server)
git clone https://github.com/soufianeodf/youtube-divolte-kafka-druid-superset.git
cd youtube-divolte-kafka-druid-superset
- Add microsoft clarity and google analytics tags to the header of index.html.
- Change the
divolte-ip-address
value by the ip-address or DNS of your divolte server in index.html. - You can change if you want the nginx config file.
- You can adapt the payload sent from main.js.
You can modify divolte-collector config files and adapt them to your needs:
You can control all config variables of Zookeeper, Apache Kafka and Kafka Manager from docker-compose.yml.
You can modify Kafka Streams variable from application.properties file.
Make sure that the avro file is them same as the one you have in Divolte Collector server.
Don't forget to generate java .jar after you make any change.
You can modify the Apache Druid config file if you want.
After running Apache Druid, to filter payloads having null as country value, we use the following:
{
"type":"not",
"field":{
"type":"selector",
"dimension":"country",
"value":null
}
}
superset.sh is the file responsible for setting the username and password of Apache Superset dashboard and more, make sure you execute it after Apache Superset is up and running.
In order for Apache Superset to use maps, it's using Mapbox under the hood, so for that, you need to set up the mapbox key in the config file:
MAPBOX_API_KEY = "you_mapbox_token"
After running Apache Superset, to connect to Apache Druid:
druid://<User>:<password>@<Host>:<Port-default-8888>/druid/v2/sql
You need to build your images and push them to your docker hub repository, because docker swarm suppose that the images are already built and exists in a docker registry.
Adapt docker-compose.yml to your needs, and then build and push the images to your docker hub repository as bellow:
docker-compose build
docker-compose push
Ansible project is highly inspired from pg3io/ansible-do-swarm, shout-out to him.
The ansible playbook is doing the following tasks:
- Create droplets in DigitalOcean.
- Install Docker on created droplets.
- Create cluster Docker Swarm with single manager.
- Copy docker-compose.yml and superset.sh files to manager node.
- Run Docker Swarm.
- Execute superset.sh.
All variables of the playbook can be found in vars.yml
- do_token : token Digital Ocean link.
- droplets : list of droplets to deploy, first of the list will be the manager.
- do_region : datacenter location . Listing: curl -X GET --silent "https://api.digitalocean.com/v2/regions?per_page=999" -H "Authorization: Bearer " |jq -r '{name: .regions[].name, regions_id: .regions[].slug}'
- do_size : droplet size. Listing: curl -X GET --silent "https://api.digitalocean.com/v2/sizes?per_page=999" -H "Authorization: Bearer " |jq -r '.sizes[] .slug' | sort
- ssh_key_ids : register a ssh in your DigitalOcean account and then obtain its id with the following command: curl -X GET -H 'Content-Type: application/json' -H 'Authorization: Bearer '$DOTOKEN "https://api.digitalocean.com/v2/account/keys" 2>/dev/null | jq '.ssh_keys[] | {name: .name, id: .id}'
cd ansible/
ansible-playbook do-swarm.yml -e do_token="<DO TOKEN>"
Issue: Unexpected Exception: name 'basestring' is not defined when invoking ansible2
Solution: pip uninstall dopy
and pip3 install git+https://github.com/eodgooch/[email protected]#egg=dopy
Issue: The CSRF session token is missing
Solution: set up this property WTF_CSRF_ENABLED = False
in config file
In the video, I have simulated with a Selenium tool, visits to the website from different browsers, Operating systems and countries as described in the image bellow, to check if our clickstream solution we built is able to intercept those hits accurately:
The Selenium tool that simulate website user visits is private at this moment because it's still in the development phase, it will be public as soon as it's completed.
Licensed under the MIT License.