Refactor async exporter #642

reyang · 2019-05-02T16:22:03Z

This is a preview to collect early feedbacks, I'm trying to address the following things:

Refactor exporter, avoid the extra concept of "transport".
Provide an upper bound for the queue, instead of letting the queue go wild and exhaust the memory.
Have a clear semantic of flush and _stop, based on eventing. Both will return None if timeout, or the actual time taken.
Have a consistent behavior as PeriodTask, use the concept of interval instead of wait_time (which doesn't count the time taken to transform/send data).
Avoid the concept of sync and async exporter, all exporters should be async. In the future we might provide a sync interface for exporters to capture certain contextual information. The concept will be similar to the way operating systems handles IRQ using DPC.

2, 3, 4 are important requirements for Azure exporter.
I'm trying to take a general approach, would be happy to either keep it inside Azure or contribute to the core opencensus-python lib.

reyang · 2019-05-02T16:32:33Z

Here goes an example how it works:

import time
from opencensus.ext.azure.common.exporter import BaseExporter

class AsyncPrintExporter(BaseExporter):
    def emit(self, batch, event=None):
        # if event is EXIT, persist the data before trying to send
        print(batch, event)
        if event:
            event.set()

x = AsyncPrintExporter(export_interval=1, max_batch_size=3)
x.export([1, 2, 3, 4, 5, 6, 7])
print('time taken to flush', x._queue.flush(timeout=5.0))
time.sleep(2)
x.export([8, 9, 10, 11])
time.sleep(5)
x.export([12, 13, 14, 15, 16])

Output:

(1, 2, 3) None
(4, 5, 6) None
(7,) QueueEvent(SYNC(timeout=5.0))
time taken to flush 0.0029981136322021484
() None
(8, 9, 10) None
(11,) None
() None
() None
() None
(12, 13, 14) None
(15, 16) QueueEvent(EXIT)

c24t · 2019-05-02T18:56:07Z

Points 2, 3, 4, and 5 sound like clear and important improvements, and I think you should push the changes here up into the core library.

I'm less sure about making exporters inherit from this class. Composition seems better in this case, even if we only have a single implementation for the queue (or "transport") class -- it's surprising that creating an exporter spawns a background thread, makes it difficult to test exporters in isolation, etc. We'll also need a strategy for using multiple exporters eventually, and each having their own background thread will make this difficult.

I'll take a second pass and add comments on the implementation.

c24t · 2019-05-02T18:29:22Z

contrib/opencensus-ext-azure/opencensus/ext/azure/common/exporter.py

+        self._thread.join(timeout=wait_time)
+        if self._thread.is_alive():
+            return
+        return time.time() - start_time  # time taken to flush


Something to keep in mind for the future: this is a good use case for time.monotonic, but it's not clear that it's worth it if we still have to fall back to time.time for py2 compatibility.

songy23 · 2019-05-02T19:26:59Z

Agree with @c24t. It's great to have 2,3,4,5 which also match the way how exporters are implemented in other languages.

For 2, it would be great if you can also share your idea on when event queue gets full (census-instrumentation/opencensus-specs#262).

For 4, you can also take a look at https://github.com/census-instrumentation/opencensus-go/blob/master/metric/metricexport/reader.go#L59-L73.

reyang · 2019-05-03T00:32:02Z

Points 2, 3, 4, and 5 sound like clear and important improvements, and I think you should push the changes here up into the core library.

Sure. I will take a step by step approach:

[This PR] Move the Queue and BaseExporter to opencensus.common (instead of making it trace specific, I think we will need to use it for logs as well).
[This PR] Move AzureExporter to this new mechanism.
[In separate PRs] move existing stuff to the new model.
[In a separate PR] retire the old mechanism.

I'm less sure about making exporters inherit from this class. Composition seems better in this case, even if we only have a single implementation for the queue (or "transport") class -- it's surprising that creating an exporter spawns a background thread, makes it difficult to test exporters in isolation, etc. We'll also need a strategy for using multiple exporters eventually, and each having their own background thread will make this difficult.

That's a good feedback, I think eventually we might also have scenario where one exporter will have multiple worker threads taking data from the same queue. Probably we will have three concepts here: queue, exporter and worker.

reyang · 2019-05-03T03:45:40Z

contrib/opencensus-ext-azure/opencensus/ext/azure/common/exporter.py

+        self._worker.start()
+        atexit.register(self._worker.stop, options.grace_period)
+
+    # Ideally we don't want to have emit and export


@c24t @songy23 I've dumped some thoughts here, please comment inline with your feedbacks.

reyang · 2019-05-07T21:05:56Z

@c24t here goes a simple diagram of what the final state should look like, please review.

Queues and workers will be owned by the core SDK as global instances. Traces/logs/metrics exporter will be a resource consumer (providing export method) instead of owning any queue or thread.

In the future, we might want to move the storage to a common place as well.

c24t · 2019-05-07T20:41:57Z

opencensus/common/schedule/__init__.py

+            return
+        elapsed_time = time.time() - start_time
+        wait_time = timeout and max(timeout - elapsed_time, 0)
+        if event.wait(timeout):


It's surprising to see the exporter set this event instead of the queue.

I think only the exporter would know when the batch got handled, instead of relying on the return of export? This gives the flexibility for exporter to do async work.

I see, but the only reason you need the event is to get the duration? It seems better to me not to force exporters to handle the event, especially if the event would be internal to the queue otherwise.

What do you imagine using the duration for?

I see, but the only reason you need the event is to get the duration?

Besides duration, the exporter could tell if there is an explicit intention to flush the queue, or it is under situation where the application is exiting.

It seems better to me not to force exporters to handle the event, especially if the event would be internal to the queue otherwise.

Yep, it is better not to force every exporter to handle the event, unless they explicitly ask for it (and benefit from it).

What do you imagine using the duration for?

I think the duration is less important than the event itself. The event could be useful for telling whether we're about to exit, or there is explicit intention to flush.

If the goal is to let exporters do some cleanup work on shutdown, what do you think about having a separate shutdown API method instead of the event?

@bogdandrutu suggests an interface like this, implemented by the worker + queue class and the exporters:

interface SpanConsumer void addSpans(List<Span>) # add all spans to the queue void shutdown() # 1. stop accepting new spans (addSpans now raises) # 2. flush existing spans # 3. call shutdown on next consumer in pipeline # 4. cleanup work, e.g. shut down worker thread

For the exporters addSpans is export.

Trying to understand the proposal here:

Looks like addSpans will run under the users' context, and return immediately after it enqueues the spans (or discard the spans if queue is full).

shutdown is a blocking API, which does the cleanup work, and return after done the cleanup. (will shutdown has an input called remaining_time, or it could block indefinitely?)

How do we plan to do flush?

For (2) passing a timeout down the chain seems like it could work. I see how a long blocking call could be a problem here.

For (3) shutdown is effectively a flush -- the queue would send all spans to the exporter, the exporter would try to send all spans to the backend. Where would you expect to call flush?

One benefit of flush is that users would know if it returns true, the telemetry data is safe. The application can also decide if it should proceed or stop if the telemetry is important (e.g. auditing events).

c24t · 2019-05-07T20:55:34Z

opencensus/common/schedule/__init__.py

+        src = self.src
+        dst = self.dst
+        while True:
+            batch = src.gets(dst.max_batch_size, dst.export_interval)


Seems odd to get these from the exporter instead of making them attributes of the worker.

I was referring to the design of networking layer, where the MTU is defined by the low level stack.
Worker shouldn't know MTU, right?

What is the worker in the networking analogy? As it is it seems the worker does know the MTU, it just has to get it from the exporter.

Removing this from the exporter would mean the worker only has to call export, including it means it's part of the exporter API.

Compare this to the java implementation where the worker has the batch size and export interval.

Worker + Queue seems to be the upper level network stack.

I was considering multiple exporter scenario, for example exporter A with MTU=100 and interval=1s, exporter B with MTU 1000 and interval=10s. The aggregated exporter which export to both A and B could have interval=1 and MTU=100, while accumulating data internally and send them in bigger batches for B.

If we have exporters configuring the interval and MTU, the aggregated exporter needs to have a way to get that knowledge from A and B?

I'd prefer to avoid this altogether by having a separate queue for each exporter, in which case the aggregated exporter doesn't have to know anything about the exporters it forwards spans to.

Do you mean each exporter will have its own queue + worker?

I think that's the simplest solution: the exporter class itself doesn't own the queue, and may even be stateless if you don't need e.g. to persist spans to disk. Each exporter comes with a factory for initializing the queue and worker.

In this case the tracer would send spans to multiple queues, one for each exporter. E.g. as SpanQueueRegistry here:

exporter package: # create the queue, worker, and exporter (Exporter, SpanQueue) get_exporter(config) class Exporter # called by the worker thread export(list<Span>) class SpanQueueRegistry # add spans to all queues to be exported void add_spans(list<Span>) # tracer adds spans to queue instead of calling export directly Tracer(SpanQueueRegistry)

I gave this approach a try yesterday and it seems to create other problems.
Given this PR is running for a long time, probably we should split it in several stages?
Today exporter owns the queue and worker, we can decouple the queue first.

Here goes my proposal:

Have the queue in the common namespace. Use AzureExporter as an experiment/PoC. [This PR]

Let the core SDK take ownership of the queue creation, all exporters switch to that queue. [Next PR] After this, the exporters are not going to block the core SDK.

Prototype a queue multiplexer, which takes data from the queue and send them to multiple exporters. In this PR, we can explore how to manage worker threads.

Decouple the worker.

@c24t does this sound okay? If yes, I will move Worker class to AzureExporter for now, so we can focus on the Queue class in this PR.

That sounds great, and since the changes are largely contained in the azure exporter I don't see any problem merging this PR as (1) into master. We can figure out the API changes for the other exporters in (2).

opencensus/common/schedule/__init__.py

c24t · 2019-05-07T21:01:28Z

opencensus/common/schedule/__init__.py

+        while True:
+            batch = src.gets(dst.max_batch_size, dst.export_interval)
+            if batch and isinstance(batch[-1], QueueEvent):
+                dst.emit(batch[:-1], event=batch[-1])


Any way to do this without exposing the event to the exporter?

I was thinking this as a benefit to tell the exporter explicitly about the intention.
For example, if we have a slow network, the exporter can decide to persist the data locally to prevent data loss on exit / flush.

There are two possible ways in my mind to make it optional for exporters.

Having a base exporter that handle the event by default.

class BaseExporter(object): def export_internal(self, batch, event): try: return self.export(batch) finally: if event: event.set() def export(self, batch): pass

Use runtime inspection (a bit dirty).

export_thunk = exporter.export if 'event' in inspect.signature(export_thunk).parameters: def export(batch, event): try: exporter.export(batch) finally: if event: event.set() export_thunk = export

(1) looks better to me, but does shutdown solve the same problem this does?

contrib/opencensus-ext-azure/opencensus/ext/azure/common/exporter.py

c24t · 2019-05-07T22:30:50Z

contrib/opencensus-ext-azure/opencensus/ext/azure/common/exporter.py

+    #     payload = transform(span_data)
+    #     self.transmit(payload)
+    def emit(self, batch, event=None):
+        raise NotImplementedError  # pragma: NO COVER


I think it's cleaner if we remove emit and make export the only API method here. What about making the batch size a configurable option in the exporter package, but not an attribute of the exporter class?

Yep, we will remove emit, and only have export here.

I wonder what should we do when there are multiple exporters, and they try to configure the batch size? Having the MTU concept part of exporter seems to be an advantage, since if we have something like aggregated exporter, it can determine the MTU based on underlying exporters' MTU.

I've updated the comment early this afternoon, please take a look at the diff and see if it is better explained.

The more I think about supporting multiple exporters, the better it sounds to have one queue per worker. If we did this the tracer could add spans to a wrapper that adds a span to the queue for each registered exporter. This is what I assume you mean by aggregated exporter, and it's what the agent does in multiconsumer.go.

If the aggregated exporter has a single underlying queue we can only drain the queue as fast as the longest export interval (/smallest MTU), which could cause us to drop spans that other exporters would otherwise export in time. It also makes the implementation more complex.

For now all we need to do is make sure the design in this PR doesn't preclude multiple exporters, we don't actually have to support them yet.

c24t · 2019-05-07T22:36:08Z

contrib/opencensus-ext-azure/opencensus/ext/azure/common/exporter.py

+    # Exporter defines the MTU (max_batch_size) and export_interval.
+    # There can be one worker for each queue, or multiple workers for each
+    # queue, or shared workers among queues (e.g. queue for traces, queue
+    # for logs).


We still have to solve the problem of multiple exporters. If there are multiple workers per queue they'll either have to process each item at the same time as the others or queue items somewhere else to support multiple batch sizes/export intervals.

contrib/opencensus-ext-azure/opencensus/ext/azure/common/exporter.py

c24t · 2019-05-07T23:12:06Z

opencensus/common/schedule/__init__.py

+        try:
+            self._queue.put(item, block, timeout)
+        except queue.Full:
+            pass  # TODO: log data loss


This is a good use for metrics, we should try to emit the same metric for all clients for unexported spans.

Here goes the metrics that we should consider census-instrumentation/opencensus-specs#262 (comment).

c24t · 2019-05-07T23:12:38Z

opencensus/common/schedule/__init__.py

+            return
+        elapsed_time = time.time() - start_time
+        wait_time = timeout and max(timeout - elapsed_time, 0)
+        if event.wait(timeout):


I see, but the only reason you need the event is to get the duration? It seems better to me not to force exporters to handle the event, especially if the event would be internal to the queue otherwise.

What do you imagine using the duration for?

opencensus/common/schedule/__init__.py

c24t · 2019-05-07T23:30:58Z

opencensus/common/schedule/__init__.py

+        src = self.src
+        dst = self.dst
+        while True:
+            batch = src.gets(dst.max_batch_size, dst.export_interval)


What is the worker in the networking analogy? As it is it seems the worker does know the MTU, it just has to get it from the exporter.

Removing this from the exporter would mean the worker only has to call export, including it means it's part of the exporter API.

Compare this to the java implementation where the worker has the batch size and export interval.

c24t

LGTM, let's revisit the API in the next PR.

reyang requested review from c24t, songy23 and a team as code owners May 2, 2019 16:22

googlebot added the cla: yes label May 2, 2019

reyang added enhancement exporters ext labels May 2, 2019

c24t reviewed May 2, 2019

View reviewed changes

c24t mentioned this pull request May 2, 2019

Memory increase due to too many traces #590

Closed

reyang mentioned this pull request May 2, 2019

Fix build errors #362

Merged

reyang commented May 3, 2019

View reviewed changes

reyang changed the title ~~[WIP] Refactor async exporter~~ Refactor async exporter May 3, 2019

reyang mentioned this pull request May 3, 2019

custom parameters to BackgroundThreadTransport #249

Open

reyang force-pushed the azure branch from d1f01c1 to dcf55cd Compare May 3, 2019 20:28

c24t mentioned this pull request May 4, 2019

exporters: perhaps make all default transport AsyncTransport instead of SyncTransport #522

Open

c24t reviewed May 7, 2019

View reviewed changes

reyang force-pushed the azure branch 2 times, most recently from 7f1d4d9 to 59aa7ce Compare May 10, 2019 22:12

reyang requested a review from c24t May 14, 2019 21:11

reyang force-pushed the azure branch from 23e5c82 to f568d00 Compare May 14, 2019 23:21

c24t approved these changes May 15, 2019

View reviewed changes

reyang added 3 commits May 14, 2019 21:03

add a base exporter prototype

5e1f94d

add emergency_mode

a3e19a0

support grace period, max queue size, max batch size and export interval

40a2539

reyang added 21 commits May 14, 2019 21:03

switch to the prototype exporter

5291df0

minor cleanup:

d223c49

update test cases

ddd7e86

fix test

6376a5b

fix code coverage for azure exporter

e8a9bc0

fix coverage

3ac4740

decouple exporter with queue and worker

ac02e4a

fix typo

8f9e94b

fix lint

96fce1c

fix typo

4c5377b

use test id as the storage folder name

fb4ad64

move Queue and Worker to opencensus.common

0bed696

add logging for emit exception

4299b25

improve decoupling

335c02d

update the TODO comments

94fad67

use containment versus inheritance since Python2 Queue is not a class

c5fd415

improve test coverage

8a7bfde

fix lint

de5fd77

move worker to azure exporter, will refactor in the next PR

5525c4a

add coverage test cases

cfea29b

improve coverage

369c072

reyang force-pushed the azure branch from 0c5f71e to 369c072 Compare May 15, 2019 04:04

songy23 approved these changes May 15, 2019

View reviewed changes

skip coverage for exporter.py as it will be refactored

9aab0fb

reyang merged commit e0603e0 into master May 15, 2019

reyang deleted the azure branch May 15, 2019 06:20

c24t mentioned this pull request May 21, 2019

Skeleton for Azure Log Exporter #657

Merged

This was referenced May 24, 2019

Support transport options in exporter #380

Closed

Azure log exporter #668

Merged

c24t mentioned this pull request Jan 29, 2020

Add local storage and retry logic for Azure Metrics Exporter + flush telemetry on exit #845

Merged

Refactor async exporter #642

Refactor async exporter #642

Conversation

reyang commented May 2, 2019

reyang commented May 2, 2019 • edited Loading

c24t commented May 2, 2019

Choose a reason for hiding this comment

songy23 commented May 2, 2019

reyang commented May 3, 2019

Choose a reason for hiding this comment

reyang commented May 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reyang May 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reyang May 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c24t left a comment

Choose a reason for hiding this comment

reyang commented May 2, 2019 •

edited

Loading

reyang commented May 7, 2019 •

edited

Loading

reyang May 14, 2019 •

edited

Loading

reyang May 8, 2019 •

edited

Loading