Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]PPL new trendline command #3013

Open
YANG-DB opened this issue Sep 12, 2024 · 2 comments · May be fixed by #3071
Open

[FEATURE]PPL new trendline command #3013

YANG-DB opened this issue Sep 12, 2024 · 2 comments · May be fixed by #3071
Labels
enhancement New feature or request PPL Piped processing language

Comments

@YANG-DB
Copy link
Member

YANG-DB commented Sep 12, 2024

Is your feature request related to a problem?

Adding a new PPL trendline command to support computing a moving averages of fields.

We would like to support two flavours of moving average:

SMA : Simple moving average

  • f[i]: The value of field 'f' in the i-th data-point
  • n: The number of data-points in the moving window (period)
  • t: The current time index

SMA(t) = (1/n) * Σ(f[i]), where i = t-n+1 to t


WMA : Weighted moving average

WMA(t) = Σ(w[i] * f[i]) / Σ(w[i]), where i = t-n+1 to t
Where w[i] is the weight for the i-th data-point.

In a typical WMA, the weights are linearly decreasing from the most recent to the oldest data-point:
w[i] = n - (t - i), where i = t-n+1 to t

The complete forumlation would be:
WMA(t) = Σ((n - (t - i)) * f[i]) / Σ(n - (t - i)), where i = t-n+1 to t


Example

The next command shows a trendline over a 5 month period events by month

source=t | stats count(date_month) | trendline sma(5, count) AS trend | fields  trend

The next command would compute a 5-point simple moving average of the 'cpu_usage' field and store it in a new field called 'smooth_cpu'.

source=t| trendline sma(5,cpu_usage) as smooth_cpu

Multiple trendlines could be calculated in a single command, such as

| trendline sma(10,memory) as mem_trend wma(5,network_traffic) as net_trend.
@YANG-DB YANG-DB added enhancement New feature or request untriaged PPL Piped processing language labels Sep 12, 2024
@YANG-DB YANG-DB moved this to Todo in PPL Commands Sep 13, 2024
@YANG-DB YANG-DB removed the untriaged label Sep 14, 2024
@jduo jduo linked a pull request Oct 12, 2024 that will close this issue
7 tasks
@jduo
Copy link

jduo commented Oct 17, 2024

Cross-posting from the PPL PR:

I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " }

I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 }

However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row.

Is it correct that ProjectOperator does not use the schema from its input?

I would expect the only field out of this schema to be the one computation in trendline ("foo"), rather than all 3 fields in the real index, but perhaps I'm mistaken here.

I also noticed that the PhysicalPlan class mentions that only ProjectOperator should implement schema():

"[BUG] schema can been only applied to " + "ProjectOperator, instead of %s",

Is the proper way to specify a schema for an operator to implement the operator then create a ProjectOperator on top of it?

Thanks

@jduo
Copy link

jduo commented Oct 17, 2024

Possible design for trendline output schema:

  1. If the field in the input is not in the trendline computations, it shows up unaltered.
  2. If the field is used in trendline and the computation alias is the same as the field name, it gets replaced with the trendline computation.
  3. If the field is used in trendline and the computation alias has a different name than the field name, it shows up as a new field in the result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request PPL Piped processing language
Projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

2 participants