-
Notifications
You must be signed in to change notification settings - Fork 231
Dewu Builds Trillion Level Monitoring System Based on AutoMQ
Guide:
Dewu, also known as Poison, is a prominent Chinese e-commerce platform that has gained significant popularity among millennials and Gen Z. Initially focused on trading authentic sneakers, Dewu has expanded its offerings to include a wide range of fashion items, sports gear, and accessories. Apache Kafka is a vital infrastructure for the observability platform of Dewu. As a leading global trend e-commerce platform, Dewu's rapidly growing business in recent years has posed significant challenges to the stability of its observability platform. The growth of the business has led to a rapid increase in the computation and storage costs of the Dewu Apache Kafka cluster, and due to the lack of elasticity in Apache Kafka, it is difficult to support Dewu's normal business during peak promotions.
Dewu used to rely on Kafka to build its observability platform, requiring many man-days every quarter to implement scaling operations. Since the introduction of AutoMQ, it has offloaded storage to OSS, made the computing layer stateless, fully compatible with Kafka, achieved fully automatic elastic scaling without human intervention, greatly reducing cloud resource costs and saving up to 85% of expenses.
This case study, based on the key points summarised from the sharing by Hao Hao, the person in charge of Dewu's stable production, at the AutoMQ organized Meetup, will help everyone understand this customer case. The complete video can be viewed directly at the end of the article.
At the conference, Hao Hao, the Head of Stable Production at Dewu, shared the journey of how Dewu's SLA climbed from the fourth tier in the industry to consistently being in the top tier. Hao's experience resonated with the audience.
This might be the reason why Dewu could rapidly elevate its SLA to industry-leading levels. Dewu has built an end-to-end observability system, promoted blue-green deployment/cross-city active-active architecture, and constructed chaos engineering infrastructure, continuously exploring advanced technologies in the field of stability.
Stability has its sophisticated aspects, but it also requires diligent hard work. Having smart people consistently working on a project might encapsulate the essence of Dewu's stability. Hao mentioned that in business scenario management and alarm rule sorting, he has been persistently working for three years, from 2021 to 2023, with different directions and goals each year.
When discussing the application scenarios of Kafka in Dewu's stability, Hao shared that during the 2023 Double Eleven event, Kafka's inability to scale during peak times led to the degradation of some trace-related product capabilities. Dewu's business has been rapidly growing, with significant data growth every three months, necessitating Kafka cluster expansions every three months. Each expansion was a multi-day ordeal, causing much distress. To cope with sudden traffic surges and achieve rapid scaling, Dewu began evaluating AutoMQ in the second half of 2023. Here, Hao shared Dewu's expectations and thoughts on introducing AutoMQ.
The introduction of AutoMQ aims to achieve cost reduction and efficiency improvement, but the understanding of this concept should not be superficial. Firstly, if cost reduction lowers the "effectiveness," the saved costs could cause greater damage to the business, which is unacceptable for Dewu. Secondly, cost reduction should have a significant magnitude; traditional methods like shortening data lifecycle or optimizing encoding/decoding can achieve a 10% or 20% reduction. However, achieving a substantial cost reduction requires architectural optimization of the entire pipeline.
Choosing AutoMQ also considers human efficiency, as Dewu cannot afford to invest manpower in ineffective research.
This reflects Hao Hao's engineering philosophy on how to choose when a new technology emerges. Hao Hao has two viewpoints:
● The new technical solution must not disrupt the compatibility of existing businesses; it should be transparent and imperceptible to the upper layers.
● The technical solution should not be overly complex; it must be simple and easy to understand, as Dewu cannot invest a significant amount of manpower to familiarize themselves with a complex system.
AutoMQ's 100% Apache Kafka®-compatible architecture that separates storage and computation, and offloads complexity in a cloud-native manner, aligns perfectly with Dewu's selection criteria. Ultimately, Dewu introduced AutoMQ to reduce costs, enhance system stability, and improve operational efficiency. Through observing and analyzing the features of the new system, AutoMQ matched their system requirements, leading them to attempt its integration and continuous improvement.
When the audience asked about the core value that cloud computing brings to Dewu, Hao Hao's response was humorous yet thought-provoking.
- What is automq: Overview
- Difference with Apache Kafka
- Difference with WarpStream
- Difference with Tiered Storage
- Compatibility with Apache Kafka
- Licensing
- Deploy Locally
- Cluster Deployment on Linux
- Cluster Deployment on Kubernetes
- Example: Produce & Consume Message
- Example: Simple Benchmark
- Example: Partition Reassignment in Seconds
- Example: Self Balancing when Cluster Nodes Change
- Example: Continuous Data Self Balancing
-
S3stream shared streaming storage
-
Technical advantage
- Deployment: Overview
- Runs on Cloud
- Runs on CEPH
- Runs on CubeFS
- Runs on MinIO
- Runs on HDFS
- Configuration
-
Data analysis
-
Object storage
-
Kafka ui
-
Observability
-
Data integration