Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] TSP description, vision and guidelines #91

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

frallax
Copy link

@frallax frallax commented Apr 4, 2023

My interpretation of the TSP. A tentative description of what it is, what it aims to accomplish and the use cases it aims to solve.

Some initial guidelines to follow when extending the TSP to accomodate new use cases.

@marco-miller marco-miller requested a review from bhufmann April 4, 2023 17:01

The TSP is a RESTful API on top of the HTTP protocol that enables to:

- analyze time series data (e.g. traces and logs);
Copy link

@abhinava-ericsson abhinava-ericsson Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has been one of the biggest sources of contention:
Is it:

  1. "Analyze time series data (e.g. traces and logs)"
    or
  2. "Analyze computational trace and log data (E.g. CPU traces, GPU traces, software logs etc.)"

This will decide what is "domain-specific" (i.e. are Traces/Logs themselves a domain or are GPU traces a domain). Since tsp and the viewer can't be made directly aware of anything domain specific, this choice will dictate the kind of explicit end-points and parameters we can have, v/s, what we need to pass only implicitly as optional embedded parameters.

The choice is also important because if we

  • Choose 1:
    a) tsp will be have the potential to be used in analyzing non trace data. e.g. financial data.
    b) We may end up making it difficult (if not impossible) to support some specific trace use case (especially if the use case doesn't fall under draw and navigate chart category)
    c) We can defer making the choice 2 later, when we actually find such a use case. But we probably would want to rethink the tsp architecture anyway at that point to reduce complexity.

  • Choose 2:
    a) At some point, probably soon enough, we will end up taking a decision about tsp architecture, which will disallow Choice 1 in the future.
    b) If trace analysis and visualization indeed is limited to draw and navigate chart, we may end up over specializing tsp for known trace types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of the TSP was to show analysis results of the analysis of traces and logs coming from different sources (HW, SW, network) and different layers in applications. They have all in common that they contain events (time-ordered) that have a timestamp and a payload. The payload is domain specific and hence it can contain all kinds of information pertinent to the domain. The TSP doesn't have any restrictions on what the payload has to be.

The TSP provides analysis results in data structures provided by trace server back-ends that do the computations and know about the domain they are analyzing. The data structures of analysis results are like UI models that can easily serialized and visualized. Hence there are data structures for tables, trees, xy-charts, time-graphs (gantt-chart). Since it handles time series many data structures have a common time axis. But it's not limited to that. It can have other x-axis to show other results (distribution charts). It can have tables for summary information like computation statistics.

Because of the nature of time series, time stamps, time ranges have some important role in the TSP.

CPUs, GPUs, processes etc. are specific to the trace data and the back-end implementation that analyses time series data. If these concepts were added into the TSP specification, then we might end up having to add other "concepts" for other use cases (network, spans, and so on).

I think CPU, GPUs, processes should be abstracted in the trace data by using common identifiers that are associated with a given chart element. For example a row can have CPU identifier (key) and a value, an trace event will have a CPU field shown in a table column. With this a trace event can be correlated with a row in another graph.

My interpretation of the TSP. A tentative description of what it is,
what it aims to accomplish and the use cases it aims to solve.

Some initial guidelines to follow when extending the TSP to accomodate
new use cases.
@bhufmann
Copy link
Collaborator

I haven't had a chance to comment on this PR. I'll provide my feedback in the coming days.

Copy link
Collaborator

@bhufmann bhufmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this update. I think many things are inline with the current version of the TSP.

One thing is not clear how to enable custom behaviors in the client implementation through the TSP without the client "knowing" what the domain it is that is visualized.

One other thing is not handled in the TSP is how to add customization of the server, e.g. create user defined views through some input definition. Not sure if that is part of the scope of this PR.

## About, why and goals of the TSP:

Similarly to the philosophy behind the Language Server Protocol (LSP),
the TSP is an open, HTTP-based protocol for use between *analytics or
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LSP is a protocol based on json-rpc but it's not HTTP based. On contrary the TSP is a HTTP-based protocol. Please correct.

The idea of TSP comes from the LSP though, in respect of having the domain specific logic on the server side and the TSP transports the relevant data to a client. With that, client implementations can be exchanged as well as sever implementations, as long as they implement the TSP.


## What the TSP is:

The TSP is a RESTful API on top of the HTTP protocol that enables to:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not RESTFul, even if the initial idea was to have a RESTful API. I suggest to change it to Cloud-API. I have used that term in recent presentation at EclipseCon 2023. I have also seen that term for other applications.


The specification is currently written in **OpenAPI 3.0** and can be pretty-visualized in the [github pages][tspGhPages].

**👋 Want to help?** Read our [contributor guide][contributing].

## About, why and goals of the TSP:

Similarly to the philosophy behind the Language Server Protocol (LSP),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Similarly to the Language Server Protocol (LSP)" instead of "Similarly to the philosophy behind the Language Server Protocol (LSP)"


This protocol is built to decouple the backend and frontend of trace analysers, allowing traces to reside and be analysed on the backend, and visual models to be exchanged with a variety of clients.

The protocol is meant to be RESTful, over HTTP.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to remove this sentence since the protocol is not fully RESTful. I commented more below.

@@ -1,15 +1,105 @@
# trace-server-protocol

Specification of the Trace Server Protocol

This protocol is built to decouple the backend and frontend of trace analysers, allowing traces to reside and be analysed on the backend, and visual models to be exchanged with a variety of clients.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like this sentence. It describes what the idea of the protocol is as well as the separation of front-end and back-end and their responsibilities.

"end": 222222222,
"nbTimes": 1982},
"requested_intervals":
[ThreadA*,FunctionB*,BankTransactionC*]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed solution for that is to provide a filter data structure to the back-end which will be applied when querying data from the back-end (here states).

}
```

## Filter states (a.k.a. intervals) of the chart with fullsearch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we are still struggling with this. The reason why there is a full search is for performance reasons. Having to search the whole data set can be slow. So, it's proposed to allow for query only sampled data instead of the whole data set in the requested interval.

{"requested_timerange":
{"start": 111111111,
"end": 222222222,
"nbTimes": inf}, // or "max", however give the idea that we are trying to get all possible samples
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need to have the actual number of states to be returned here. We still want to have one single state per time, but the type of state might change that is returned for a given time after applying the filter.

Hence we need another parameter to indicate full (deep or inf) search.

```
cli (ask): GET tsp/api/experiments/{expUUID}/outputs/<chart-type>/<chart-id>/tree
{"parameters":{"table_row": 1, "table_col":1]}}
srv (ret): {"data":[{"start": 1234, "end": 2345, "label": "ThreadA"}]}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is an on-demand query of the min or max duration which would be ok. The front-end would need to be instructed in some way when and when not to fetch this on-demand information. With your suggestion, a column with header 'Max' or 'Min' would have a special meaning and would instruct the client implementation to ask for the min and max duration.

What is currently, implemented is that a data provider for such a tree can have columns of type "Time Range". This indicates that the cells have time range data in a special format that can be parsed as start and end time. How would you indicate to the client implementation to provide the UI action as well as do the remote call?

The introducing of Time Range data type can be re-used in other places. Any "Time Range" values can be used to "Select", "zoom" or "navigate" to.

Both solutions are valid.

```
cli (ask): POST tsp/api/experiments/{expUUID}/outputs/<chart-type>
{"parameters":{"outputId/chart-id":"my.custom.chart", "include":[{"Device": "CPU0"},...]}}
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the API of the virtual table sends the column IDs to the back-end (requested-items) to request only certain columns to be returned. For data-tree tables, I think it's not possible to remove columns in the back-end call.

To show a new table instead of updating the existing is the client implementation choice. Both options could be provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants