Skip to content

Latest commit

 

History

History
1492 lines (1472 loc) · 40.6 KB

168-designing-a-metric-system.md

File metadata and controls

1492 lines (1472 loc) · 40.6 KB
slug id title date comments tags slides references
168-designing-a-metric-system
168-designing-a-metric-system
Designing a metric system
2019-08-26 11:58
true
system design
false

Requirements

Log v.s Metric: A log is an event that happened, and a metric is a measurement of the health of a system.

We are assuming that this system’s purpose is to serve metrics - namely, counters, conversion rate, timers, etc. for monitoring the system performance and health. If the conversion rate drops drastically, the system should alert the on-call.

  1. Monitoring business metrics like signup funnel’s conversion rate
  2. Supporting various queries, like on different platforms (IE/Chrome/Safari, iOS/Android/Desktop, etc.)
  3. data visualization
  4. Scalability and Availability

Architecture

Two ways to build the system:

  1. Push Model: Influx/Telegraf/Grafana
  2. Pull Model: Prometheus/Grafana

The pull model is more scalable because it decreases the number of requests going into the metrics databases - there is no hot path and concurrency issue.

<svg xmlns="http://www.w3.org/2000/svg" xmlnsXlink="http://www.w3.org/1999/xlink" version="1.1" width="100%" viewBox="-0.5 -0.5 361 556" content='7Vttd6I4FP41fpweIIL6sb7Udrc7225nz3T2y54IETMTiSfGqv31k2BAIanSFoSe0/nQIZcQ4Lkvz72X2AKD+WbM4GL2Fw0QaTlWsGmBYctxOl1P/JWC7U7geUoQMhzsRPZe8ICfkRJaSrrCAVpmJnJKCceLrNCnUYR8npFBxug6O21KSfauCxgiTfDgQ6JLv+OAz3bSrmvt5dcIh7PkzralzsxhMlkJljMY0PWBCIxaYMAo5buj+WaAiMQuwWV33dULZ9MHYyjiRS6Y3Nzjn//DP8ZP4W14/zh+Dn+sv6hVniBZqRdWD8u3CQKMrqIAyUWsFuivZ5ijhwX05dm1ULmQzficiJEtDvWHSu6AGEebA5F6yDGic8TZVkxRZ3sKL2UwDlDj9R5+O8F0dgC9p2RQaTxMV96DIg4ULq/AyGkeRqmVJSC16wYJNBAkOweSWzdIbQ2kB8TECwnZFWRzDTHxojwLy5Iz+gsNKKFMSCIaiZn9KSYkJ4IEh5EY+gI+sT7oS9iwCGuX6sQcB4G8jVEPWU1VoArg6KroGTThVKWJroY1CkTcV0PK+IyGNIJktJfmUNnPuaV0odTzE3G+VSQGV5xmlYc2mD8eHP+QS124ajTcqJXjwTYZROJ1H+OJtpuMdxda6Xh/aTzKXHuHGBaYSSM4rs0lXTEfHcHMVfQLWYj4kXmK4CWgR22DIQI5fsoSbematvXwvWbC5As4G0NL/Awn8QQJ3YLiiMfP5/Zb7vBVbkbgBJE+9H+FsRkl7tpywDT+Z9TMMdPVnC/NgdQTZ9IMk1NKi2p7Gb9M8oHCqlGL30lkDqbQ6XQpbCSvu/QZ3q5OV9MmRwSFDE7r55tOjm86rhbkXFcPcqCqIOdpWN1EU7LaDPsaViI5XchDf0uwAI2dBmyyQ/d2kgpS4/57xcUqqDxk0/QmoQ9Ppw/PQB9uVcjatobgGfnDunDcAwqxjxLI2+N9r+w4ngsWiXK7wKzcZIkd4airyo8otp66/jN6+CYkl3c3H5Yl7Jec7NU08cW6AL0knU94ot10nuhpSh0LjoARrJ0mQL52c3Wa6BqCWbuy+tY6whPW3Wo5E/+pxk4Z3iDi2nIX4uwD5yBoyg2uwWVs7C+FQnAUfosDpTC+cvi6m1WE3TPUh6CiquTPr7Ph5VcIr4fuyhug/67H/pWhG3PH5NIztFrGqoiXaKYqDIFG086R1MnNqKJtSJ3s9jlVoVcNl4sFEXBwTKP3xZBCMJcAaq6R5howNdXcZXQ/jJB69eZMxVOmg5o7W3KfqLeN+dLJ3MuIlcpJDmtt4zz3nSnZuxSqJ06jzULoSDa0ynCQNF8q3xk8A+manKGM2syInd4K9AlGcXJE8IRBdYtGYZjv4nmWDmLnnCDqzQDAAkmUkHGxukdkfJ0Ia/RCeVR6BK/QQDsGbE0GWkZWaMS2V2u0fkuH1M62R091R0uM1t2C0bpdZ7S29exyESeUzaxwCxeuWZW9bMvHClzLUd7WiILWnJC2602fMi0nq5hH5vIn+3we2SnokS/1Ps7jkh3NIw/rvY9QY6Q+mKQEtl5DO6YaultZDd2p00/e9mmv8cRl18tc9eYiH6tyLKxSUKdKk4+YBz13sS5GQtC81DwNYEmQawMtyJk+PwG3KoewNJA+k/P3OkSt/mD4CCW3BmHxZo41xEufCgtuYEsg7xodA/+f1zU+4M6e2lzDbhcli1r7jLZh3xynTO7cbZxDACv7JaN2rnDAB0yevJ6TdQrL6Z1wiyOb3crxFcP2NzPgTq2+ovdEZU15f9tAV+k2LK1yPkvH4v6QFP6n/aFW7nA+84EqdOrVqlN9w8olEZFoDiORE7DmRbp8l8wt+ksEAKqC0NEg/I4mQvDvTUs+yJX4m26YUuN4O5w1iD9TltSeLJVOCpQiZpAr4xMDF0sDHa54A0u5PB27ht9onJeO9f3LoznEDWwQadB5+sfbkqATw/1PBnfflva/uwSj3w==' onclick="(function(svg){var src=window.event.target||window.event.srcElement;while (src!=null&&src.nodeName.toLowerCase()!='a'){src=src.parentNode;}if(src==null){if(svg.wnd!=null&&!svg.wnd.closed){svg.wnd.focus();}else{var r=function(evt){if(evt.data=='ready'&&evt.source==svg.wnd){svg.wnd.postMessage(decodeURIComponent(svg.getAttribute('content')),'*');window.removeEventListener('message',r);}};window.addEventListener('message',r);svg.wnd=window.open('https://www.draw.io/?client=1&lightbox=1&edit=_blank');}}})(this);" style={{ cursor: "pointer", maxWidth: "100%", maxHeight: 556 }}

Server Farm
Server Farm
write
write
telegraf
telegraf
InfluxDB
InfluxDB
REST API
REST API
Grafana
Grafana
InfluxDB Push Model
InfluxDB Push Model
Prometheus Pull Model
Prometheus Pull Model
Application
Application
Exporter
Exporter
client library
client library
3rd Party
Application
3rd Party<br>Application
pull
pull
Prometheus
Prometheus
Retrieval
Retrieval
Service Discovery
Service Discovery
Storage
Storage
PromQL
PromQL
Alertmanager
Alertmanager
Web UI / Grafana / API Clients
Web UI / Grafana / API Clients
PagerDuty
PagerDuty
Email
Email

Features and Components

Measuring Sign-up Funnel

Take a four-step sign up on the mobile app for example

INPUT_PHONE_NUMBER -> VERIFY_SMS_CODE -> INPUT_NAME -> INPUT_PASSWORD

Every step has IMPRESSION and POST_VERIFICATION phases. And emit metrics like this:

{
  "sign_up_session_id": "uuid",
  "step": "VERIFY_SMS_CODE",
  "os": "iOS",
  "phase": "POST_VERIFICATION",
  "status": "SUCCESS",
  // ... ts, contexts, ...
}

Consequently, we can query the overall conversion rate of VERIFY_SMS_CODE step on iOS like

(counts of step=VERIFY_SMS_CODE, os=iOS, status: SUCCESS, phase: POST_VERIFICATION) / (counts of step=VERIFY_SMS_CODE, os=iOS, phase: IMPRESSION)

Data Visualization

Graphana is mature enough for the data visualization work. If you do not want to expose the whole site, you can use Embed Panel with iframe.