-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE]PPL new trendline command #3013
Comments
Cross-posting from the PPL PR: I have this almost hooked up. I loaded the students table which has name, gpa, and grad_year fields. When I issue this PPL query, it seems like it is using the schema from the implied ProjectOperator instead of using the schema from the TRENDLINE command, even though I overrode TrendlineOperator#schema() to just build a schema based on the computations list: { "query" : "source=students | TRENDLINE SMA(1, gpa) as foo " } I get the following JSON result of null arrays: { "schema": [ { "name": "grad_year", "type": "long" }, { "name": "name", "type": "string" }, { "name": "gpa", "type": "float" } ], "datarows": [ [ null, null, null ], [ null, null, null ], [ null, null, null ] ], "total": 3, "size": 3 } However if change the PPL to use an alias that happens to have the same name as the original field: { "query" : "source=students | TRENDLINE SMA(1, gpa) as gpa " } I get data back correctly for one of the array elements in each row. Is it correct that ProjectOperator does not use the schema from its input? I would expect the only field out of this schema to be the one computation in trendline ("foo"), rather than all 3 fields in the real index, but perhaps I'm mistaken here. I also noticed that the PhysicalPlan class mentions that only ProjectOperator should implement schema():
Is the proper way to specify a schema for an operator to implement the operator then create a ProjectOperator on top of it? Thanks |
Possible design for trendline output schema:
|
Is your feature request related to a problem?
Adding a new PPL
trendline
command to support computing a moving averages of fields.We would like to support two flavours of moving average:
SMA : Simple moving average
SMA(t) = (1/n) * Σ(f[i]), where i = t-n+1 to t
WMA : Weighted moving average
WMA(t) = Σ(w[i] * f[i]) / Σ(w[i]), where i = t-n+1 to t
Where w[i] is the weight for the i-th data-point.
In a typical WMA, the weights are linearly decreasing from the most recent to the oldest data-point:
w[i] = n - (t - i), where i = t-n+1 to t
The complete forumlation would be:
WMA(t) = Σ((n - (t - i)) * f[i]) / Σ(n - (t - i)), where i = t-n+1 to t
Example
The next command shows a trendline over a 5 month period events by month
The next command would compute a 5-point simple moving average of the 'cpu_usage' field and store it in a new field called 'smooth_cpu'.
Multiple trendlines could be calculated in a single command, such as
The text was updated successfully, but these errors were encountered: