Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] TSP description, vision and guidelines #91

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 95 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,105 @@
# trace-server-protocol

Specification of the Trace Server Protocol

This protocol is built to decouple the backend and frontend of trace analysers, allowing traces to reside and be analysed on the backend, and visual models to be exchanged with a variety of clients.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like this sentence. It describes what the idea of the protocol is as well as the separation of front-end and back-end and their responsibilities.


The protocol is meant to be RESTful, over HTTP.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to remove this sentence since the protocol is not fully RESTful. I commented more below.

Specification of the Trace Server Protocol (TSP).

The specification is currently written in **OpenAPI 3.0** and can be pretty-visualized in the [github pages][tspGhPages].

**👋 Want to help?** Read our [contributor guide][contributing].

## About, why and goals of the TSP:

Similarly to the philosophy behind the Language Server Protocol (LSP),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Similarly to the Language Server Protocol (LSP)" instead of "Similarly to the philosophy behind the Language Server Protocol (LSP)"

the TSP is an open, HTTP-based protocol for use between *analytics or
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LSP is a protocol based on json-rpc but it's not HTTP based. On contrary the TSP is a HTTP-based protocol. Please correct.

The idea of TSP comes from the LSP though, in respect of having the domain specific logic on the server side and the TSP transports the relevant data to a client. With that, client implementations can be exchanged as well as sever implementations, as long as they implement the TSP.

interactive visualization web applications* and *servers that provide
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"analytics or interactive web applications" it's not a clear description. There is a client and server, where the client can a command-line application for e.g. automation, there can graphical front-ends like web applications.

trace analysis specific features* such as:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we need at the beginning a chapter to describe what the applications that uses the TSP is used for. The applications are trace and log analysis tools that extract data from traces and logs (time series) from applications (or HW nodes). The tool will provide views, graphs, metrics, that are easier to understand than plan text dumps. Then we can talk about the roles of server, client and TSP.


- generation of data that can interpeted as charts from trace analysis
- selection, filtering and correlation of elements in charts using data
contained in the traces
- statistics calculation from data contained in the traces

The goal of the protocol is to allow trace analysis support to be
implemented and distributed independently of any given analytics or
interactive visualization web application.

## What the TSP is:

The TSP is a RESTful API on top of the HTTP protocol that enables to:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not RESTFul, even if the initial idea was to have a RESTful API. I suggest to change it to Cloud-API. I have used that term in recent presentation at EclipseCon 2023. I have also seen that term for other applications.


- analyze time series data (e.g. traces and logs);
Copy link

@abhinava-ericsson abhinava-ericsson Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has been one of the biggest sources of contention:
Is it:

  1. "Analyze time series data (e.g. traces and logs)"
    or
  2. "Analyze computational trace and log data (E.g. CPU traces, GPU traces, software logs etc.)"

This will decide what is "domain-specific" (i.e. are Traces/Logs themselves a domain or are GPU traces a domain). Since tsp and the viewer can't be made directly aware of anything domain specific, this choice will dictate the kind of explicit end-points and parameters we can have, v/s, what we need to pass only implicitly as optional embedded parameters.

The choice is also important because if we

  • Choose 1:
    a) tsp will be have the potential to be used in analyzing non trace data. e.g. financial data.
    b) We may end up making it difficult (if not impossible) to support some specific trace use case (especially if the use case doesn't fall under draw and navigate chart category)
    c) We can defer making the choice 2 later, when we actually find such a use case. But we probably would want to rethink the tsp architecture anyway at that point to reduce complexity.

  • Choose 2:
    a) At some point, probably soon enough, we will end up taking a decision about tsp architecture, which will disallow Choice 1 in the future.
    b) If trace analysis and visualization indeed is limited to draw and navigate chart, we may end up over specializing tsp for known trace types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of the TSP was to show analysis results of the analysis of traces and logs coming from different sources (HW, SW, network) and different layers in applications. They have all in common that they contain events (time-ordered) that have a timestamp and a payload. The payload is domain specific and hence it can contain all kinds of information pertinent to the domain. The TSP doesn't have any restrictions on what the payload has to be.

The TSP provides analysis results in data structures provided by trace server back-ends that do the computations and know about the domain they are analyzing. The data structures of analysis results are like UI models that can easily serialized and visualized. Hence there are data structures for tables, trees, xy-charts, time-graphs (gantt-chart). Since it handles time series many data structures have a common time axis. But it's not limited to that. It can have other x-axis to show other results (distribution charts). It can have tables for summary information like computation statistics.

Because of the nature of time series, time stamps, time ranges have some important role in the TSP.

CPUs, GPUs, processes etc. are specific to the trace data and the back-end implementation that analyses time series data. If these concepts were added into the TSP specification, then we might end up having to add other "concepts" for other use cases (network, spans, and so on).

I think CPU, GPUs, processes should be abstracted in the trace data by using common identifiers that are associated with a given chart element. For example a row can have CPU identifier (key) and a value, an trace event will have a CPU field shown in a table column. With this a trace event can be correlated with a row in another graph.

- return the result of the analysis in form of data that represents
*charts* or *charts components*;
- *navigate (or explore)* through the returned data;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Navigation, correlation are for me more client concepts. For example, the client can decide to synchronize all the open view to a given timestamp upon user interaction. The protocol will allow to request data based on the timestamps to enable that use case.


Summarizing:

> the TSP is an API to explore data and navigate charts that has been
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Navigation is a client concept. The TSP will allow clients to navigate charts by using its APIs, i.e. allow to specify time related query parameters. Maybe rephrase it.

> generated from the interpretation of time series data.
With the term *chart* we mean different type of visualizations such as
XY charts, timegraph charts, tables, pie charts, etc.

With the term *chart components* we mean different elements that
can be used to build or enrich charts such as arrows, annotations,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can mention also the default chart components, like rows, columns, states and lines.

bookmarks, etc.

With the terms *exploration and navigation* we mean interactions with
the charts, such as:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list below are actions that can be performed through a client. The TSP API will allow to query the back-end with specific query parameter to perform these actions. Each client can choose what to provide as action.

I think the tricky part that we are facing is to provide certain actions to the default client implementation (theia-trace-extension) without knowing the domain. There are some common behaviours (e.g. query xy chart for given time range with a certain resolution, synchronize all views to a given time range). Other things that would require some domain specific logic is tricker. The TSP has to allow the abstraction of domain specific logic. I think this causes a lot of our discussion because it's not clear how to achieve that.

It would be much simpler to have web-frontend for domain A, web front-end for domain B and so on where each front-end "knows about" the domain and has special implementations to achieve whatever is useful for each domain.

- **filter** specific components of the charts. For example:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to distinguish between client side filtering and server-side filtering. Trace analyzers is supposed to handle a large amount of data and only a subset of the data is returned to client, the data is often sampled to avoid busting memory constraints and minimize messages exchanged. The client can choose to filter on the client side data, however it won't be able to show accurate filter results or might not be able to apply filter on certain data because it's not available on the client side. The server will need to be queried for more detailed results.

- remove from the chart all components that do not match a condition;
- return components of the chart that do match a condition;
- **select** a subset of all data points of a chart. For example:
- given a Timegraph chart, return only the data between a specific
time interval
- **select** specific components of the charts. For example:
- when analyzing the trace of a SW execution, select only some
components of the chart that represent specific SW functions;
- when analyzing the trace of a SW execution, select only some
components of the chart executed on specific CPUs;
- given a Timegraph chart, automatically select areas of the chart
where the components match a specific condition (**zoom-on-crash-area**)
- **correlate** components between charts. For example:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be able to correlate components there needs to be common indicators in the components to be able to do so. In the current version, time is on of such indicators. Other indicators are provided by metadata associated with a component, for example a time graph row has a metadata or a table headers. We could have metadata always available in the returned data or on-demand ("tooltips" for states). It's the back-end repsonsibility to provide that data in the APIs defined in the TSP.

- given 2 charts generated by the analysis of the same time series data,
it enables to correlate the chart components that represent the same
information. For example, given a table containing the max length
of an interval in a time chart, it enables to identify the corresponding
interval in the time chart (**go-to-min/max**)
- **correlate** chart components and their information. For example:
- given a interval in a time chart, extrapolate all information related
to that interval (e.g. **source location**)
- **order** specific parts of the charts. For example:
- return the **N longest** intervals of a timegraph chart;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done through specific analysis in the back-end. The front-end doesn't know about it. I see this feature a table with a time range column with N row. The client implementation can choose to provide a user interaction to navigate (synchronize the views) to that trace location.

- **customize** charts and charts components. For example:
- given a table containing statistics from analysis of all input data,
generate a new table containing statistics from analysis of a
subset of the input data;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on how a client implementation knows what is the subset of the input data and how the client can pass this information. Please clarify.

- given a time chart where intervals represent thread execution over
time, and they are connected using arrows, generate a new time chart
showing the longest path of connected intervals (e.g. **critical path**)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear here either how the client implementation would know to provide that action of showing the longest path of connected intervals. Please clarify.

## Guidelines for the TSP implementation

The main resources (endpoints) of the API should be:

- **data sources** (e.g. time series data such as traces and logs). These
are analyzed and used to create the data that can be retrieved;
- **charts** (e.g. tables, XY charts), used to retrieve data representing
a chart;
- **chart components** (e.g. annotations, styles), used to retrieve data
representing specific information for a chart;

The *exploration and navigation* should be achieved by:

- parameters in the HTTP message body data (usually when POST-ing), or
as query strings (usually when GET-ing). These parameters should be:
- **data source related**. For example, considering that data sources
have a strong relation to time, a classic parameter can be "time"
related (e.g. *requested_timerange*);
- **chart related**. For example, when "talking" to a resource representing
an XY chart, a parameter can be related to "XY dimensions" (e.g.
*requested_Y_elements*);

## Current version

The current version of the specification is currently implemented and supported in the [Trace Compass trace-server][tcServer] (reference implementation) and what is currently supported by the [tsp-typescript-client][tspClient].
Expand Down
120 changes: 120 additions & 0 deletions TSP-examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# How to follow TSP guidelines
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to have concrete examples explain how certain things can be achieved. Each example should have a description, sequence diagram and description on the message content.


This file/document will be removed. It is meant to give some examples
of how to follow the TSP guidelines when extending the TSP.

NOTE:

- the examples below are just pseudocode to try to give a concrete
feeling on how to implement the exploration/navigation features and
follow the TSP guidelines. They are not meant to be seen as concrete
proposals to extend/modify the existing TSP;
- the GET / POST methods are chosen just to be more RESTful (i.e. try
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as mention above the TSP is not RESTful. One diversion from it is that POST is used to query data and not just to create new resources. The reason for that is to be able to pass query parameters inside the payload and not just in the URL.

to not use a POST when we do not create new resources)
- the parameters in the requests are always written as body params,
just to make it easier to write. But they do not have to be body params.

## Timegraph states (a.k.a. intervals)

```
GET tsp/api/experiments/{expUUID}/outputs/timeGraph/{outputId}/states
{"parameters":
{"requested_timerange":
{"start": 111111111,
"end": 222222222,
"nbTimes": 1920}, // to follow "TSP guidelines", would it be better "precision"? or "samples"?
"requested_items": [1,2] // to follow "TSP guidelines", would it be better "requested_row_ids"?
}
}
```

## Filter rows of a timegraph

```
GET tsp/api/experiments/{expUUID}/outputs/timeGraph/{outputId}/states
{"parameters":
{"requested_timerange":
{"start": 111111111,
"end": 222222222,
"nbTimes": 1920}, // to follow "TSP guidelines", would it be better "precision"? or "samples"?
"requested_row_labels":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row labels are not unique. For example, where rows represent processes the name can be there multiple times. Hence we have a unique ID associated with each row to identify them. The row IDs are unique within a given view. To correlate items between views some other things need to provided. For example, time graph rows have a key-value-map as metadata. So, a process row has a metadata with pid, tid, ppid, exec_name etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is already in place. The gantt-chart of time-graph (right side) can be queried by passing the a list of row ids. Filtering of rows can be implemented by only showing certain rows.

[CPU0,CPU2] // this is the label of the rows. How does the cli know about name of rows?
// The cli "asks" to the server about the labels of rows in the chart (e.g.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are suggesting here that the client fetches the labels or what I called metadata. This would be on-demand. Advantage is to minimize the size of data for regular queries, but adds a round-trip to the server. Currently both behaviours are implemented. For rows, the metadata is embedded. For states, since there are a huge amount, the metadata is on-demand ("tooltip", which could rename).

For other tables, the column header is also the metadata.

I see some need for better definition and implementation of this extra infromation called metadata for each chart and chart components.

// from the timegraph/tree endpoint)
}
}
```

## Filter states (a.k.a. intervals) of a timegraph chart

```
GET tsp/api/experiments/{expUUID}/outputs/timeGraph/{outputId}/states
{"parameters":
{"requested_timerange":
{"start": 111111111,
"end": 222222222,
"nbTimes": 1982},
"requested_intervals":
[ThreadA*,FunctionB*,BankTransactionC*]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed solution for that is to provide a filter data structure to the back-end which will be applied when querying data from the back-end (here states).

}
}
```

## Filter states (a.k.a. intervals) of the chart with fullsearch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we are still struggling with this. The reason why there is a full search is for performance reasons. Having to search the whole data set can be slow. So, it's proposed to allow for query only sampled data instead of the whole data set in the requested interval.


```
GET tsp/api/experiments/{expUUID}/outputs/timeGraph/{outputId}/states
{"parameters":
{"requested_timerange":
{"start": 111111111,
"end": 222222222,
"nbTimes": inf}, // or "max", however give the idea that we are trying to get all possible samples
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need to have the actual number of states to be returned here. We still want to have one single state per time, but the type of state might change that is returned for a given time after applying the filter.

Hence we need another parameter to indicate full (deep or inf) search.

"requested_intervals":
[ThreadA*,FunctionB*]
}
}
```

## Correlate components between charts (e.g. go-to-max)

An example of how to jump from a value in a table to the interval/state
that it represents in another chart.

First, ask some info to the table:

```
cli (ask): GET tsp/api/experiments/{expUUID}/outputs/<chart-type>/<chart-id>/tree
{"parameters":{"requested_times":[0,100000000]}} // Side question: why is "requested_times" needed?
srv (ret): {"headers":[{Min},{Max},...], "entries" :[{1 sec},{2 min},...]}
```

In order to implement the "go-to-max" functionality, the client asks to the server
to return the "time series data" (in this case an interval) that was used to calculate
the value "2 min" at the table index 1,1:

```
cli (ask): GET tsp/api/experiments/{expUUID}/outputs/<chart-type>/<chart-id>/tree
{"parameters":{"table_row": 1, "table_col":1]}}
srv (ret): {"data":[{"start": 1234, "end": 2345, "label": "ThreadA"}]}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is an on-demand query of the min or max duration which would be ok. The front-end would need to be instructed in some way when and when not to fetch this on-demand information. With your suggestion, a column with header 'Max' or 'Min' would have a special meaning and would instruct the client implementation to ask for the min and max duration.

What is currently, implemented is that a data provider for such a tree can have columns of type "Time Range". This indicates that the cells have time range data in a special format that can be parsed as start and end time. How would you indicate to the client implementation to provide the UI action as well as do the remote call?

The introducing of Time Range data type can be re-used in other places. Any "Time Range" values can be used to "Select", "zoom" or "navigate" to.

Both solutions are valid.

```

The client uses than that info to "go to max" i.e. zoom in, filter or whaterver.
In theory it could also be possible to enable the server to "auto-align" other charts,
not sure if it is a good idea.

## Customize (e.g. select a subset of the input data to use to create a new chart)

Cli asks to the "time series input data" (e.g. the event table) to return some info on the input data:

```
cli (ask): GET tsp/api/experiments/{expUUID}/outputs/table/<chart-id>/columns
srv (ret): {"model":[{"name": "timestamp", "type":"number", ...}, {"name": "Device", "type":"string", ...}, ...]}
```

Cli asks to create a new statistics table using a subset of the input data.
The subset is selected using some of the previously returned params.

```
cli (ask): POST tsp/api/experiments/{expUUID}/outputs/<chart-type>
{"parameters":{"outputId/chart-id":"my.custom.chart", "include":[{"Device": "CPU0"},...]}}
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the API of the virtual table sends the column IDs to the back-end (requested-items) to request only certain columns to be returned. For data-tree tables, I think it's not possible to remove columns in the back-end call.

To show a new table instead of updating the existing is the client implementation choice. Both options could be provided.