Zipkin 2.4
Zipkin 2.4 increases http collection performance, introduces a new quick start and adds chinese language support
Http collector performance
Prior to Zipkin 2.4, our http collector was written as a normal Spring WebMVC application. Usually, this is fine, and it is still fine for our query endpoints. The below details how we ended up switching to raw Undertow to implement Zipkin's POST endpoints. Tests show up to 5x throughput with no connection errors. You don't need to do anything but upgrade.
Under a surge of POST requests, threads could backlog resulting in timeout exceptions. Timeouts are another way of saying "you held up clients", which is a bad thing as tracing isn't supposed to hurt clients. Worse. failures here occur prior to metrics, making the count of failures invisible. This led to custom code in the stackdriver-zipkin proxy to help unveil thread pool issues. In efforts to simplify troubleshooting and custom code, we now implement our http collector directly at the network layer. Benchmarks show dramatically better throughput, without network errors and with less memory pressure. Performance is a work in progress, so please do help if you are capable.
PS Some ask why Undertow and not Netty? quick answer is that undertow is already supported in Spring boot 1.5. Also, Netty versions are sensitive and we already have to play games to ensure for example Cassandra's Netty doesn't conflict with gRPC's netty. Undertow being a bit obscure helps avoid conflicts.
New quickstart
For about 2 years, we've used maven central's query api to find the latest version of the server and download it in a single semantic request. Recently, this stopped working and broke our ability to do a quick start. Thanks to hard work by @abesto, we have a replacement script, hosted on our https endpoint, that does the same. Simply copy/paste below to grab the latest server:
$ curl -sSL https://zipkin.io/quickstart.sh | bash -s
i18n and Chinese language support
Zipkin has many chinese speaking, or should we say chinese reading users. Distributed tracing has a lot of vocabulary and some things can be confused in translation. Thanks to @gzchenyong from China Telecom, Zipkin's UI now includes labels and tool tips in chinese. There's even more to do, so if you can help, join @MrGlaucus who's taking this work further.
Other improvements
- UI now loads in IE 11
- The server "exec" jar is now a few megs smaller, under 50MiB, by eliminating some unused deps
- The server "exec" jar's MD5s were incorrect. They are now fixed
- Prometheus duration metrics could result in double-counting. this is now fixed.
- ES_TIMEOUT Controls the connect, read and write socket timeouts for Elasticsearch Api.
- @dos65 fixed rendering numeric service names in the UI
- @mikewrighton added guards against writing huge data in thrift
- @shakuzen made the build fail nicer when JDK 9 is in use (JDK 9 support on the way soon)