-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce RecordBuilder concept to split up Archiver code #20394
Conversation
…d since archiving is a recursive process
…before this change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diosmosis the record builders look nice 👍 Makes it easier to have these separated.
The only concern I have is that the aggregation of range reports might still be slowish when there are say 20 goals configured but the data is only requested for one goal. Then we still also need to aggregate the data for the other 19 goals if I understand this correctly? Less concerned around day archives but more when aggregating existing reports to build the range.
I understand there's a problem around knowing which reports need to be processed?
I was hoping that when for example a report like below is requested:
$archive = Archive::build($idSite, 'range', '2021-02-03,2023-03-03', $segment);
$dataTable = $archive->getDataTable('Goal_1_nb_conversions');
That we would only need to aggregate the Goal_1_nb_conversions
records but not anything else. Or at least only aggregate the records for the same goal (Goal_1_*
). As having 20 goals makes it effectively 20 times slower than needed otherwise.
Was this where there is a problem around partial archives and knowing what reports we already have?
No, this would only be an issue for day periods where we're constrained by the query. Multiple periods can just archive one report if we want. Just because a RecordBuilder returns metadata for, say, 100 goals, doesn't mean we have to aggregate all of them. We could add some code to just aggregate the requested report. Note, though, that this logic is not in this PR since it would require implementing RecordBuilders everywhere first. Actually... I guess it would need to support Archiver only use still... I could try and do it here, if a RecordBuilder can be found for a report. I'll give that a shot, but the ultimate idea was to get rid of done flags altogether, otherwise for range archives there would be nothing but partial archives, which complicates things (eg, we'd have to check that a partial archives we found has the requested data, but only for reports where we are doing single report archiving, before we query the data).
The partial archive problem was mostly just for the "simple" solution that applied the aggregation change automatically to every report. If we request, eg, Actions_actions_url first for a range (and there is no all plugins archive since we don't create those anymore for ranges), we'll get that data. But Actions_nb_keywords, which is built from the sitesearch record, will be set to 0 (since we skip building the sitesearch record). If the user then looks at the visits overview, distinct keywords will be 0 until they look at the sitesearch record. Edge cases like this would be unavoidable since Archivers just aggregate everything for a plugin, not just related reports/metrics. RecordBuilders fix that. |
…rd needed is created by a RecordBuilder
…ta is present within them. if some are not present, only archive those in a new partial archive.
…rent archive request
I went over the PR and left a ton of inline comments/questions that we can go over together as a team. Some of those might be just from me not fully understanding stuff potentially even before the change, but otherwise I think it helped me to read through the whole change line by line. |
Applied some PR fixes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of minor code improvement suggestions like adding or improving (return) type hints.
Generally this would then be good to merge I think. We would consider merging this before releasing the Matomo 5 beta.
@diosmosis Are you going to add / update documentation around the new record builders? Would be awesome to have that documented e.g. here:
https://developer.matomo.org/guides/archiving
https://developer.matomo.org/guides/archiving-behavior-specification
We could maybe even consider adding something to the migration guide, so plugin developers know how to migrate their plugins:
https://developer.matomo.org/guides/migrate-matomo-4-to-5
…reporting whether archives were found or not
…ted for use in RecordBuilders that need to manually insert data
…ing column name to op
…rdBuilders event for requested plugin since it is expected for those event handlers to perform queries
@sgiehl @michalkleiner made the requested changes + a couple others that might require another quick review.
That's not currently a part of my work but I might do it at some later point. For now I created an issue: matomo-org/developer-documentation#733. Do note that RecordBuilders are not currently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some more comments in terms of security. Otherwise this PR looks good to me to merge.
{ | ||
$requestedReport = null; | ||
if (SettingsServer::isArchivePhpTriggered()) { | ||
$requestedReport = Request::fromRequest()->getStringParameter('requestedReport', ''); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: The new request class doesn't do any sanitizing. I had a rough look where this parameter is passed through. Looks like there shouldn't be any risk of using the possibly user provided value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now.
@diosmosis I'll merge that one now. Feel free to rebase the PRs that were built on that one and add a needs review
label to those we shall review and merge next.
Description:
Refs #7573
This PR introduces a new concept to more granularly encapsulate archiving logic. Instead of a single Archiver class that aggregates all numeric and blob records, they are put into smaller RecordBuilder classes that are meant to encapsulate as few log table queries as possible, while still tying together metrics/reports that are dependent on each other.
The introduction of this pattern would have the following benefits:
Splitting up large Archiver classes.
Better query origin hints. Right now we add the plugin name to LogAggregator queries, like,
SELECT /* Goals */
, now the RecordBuilder class name is added as well (if defined), which allows faster identification of the source of a query.Possibility of simplified archiving logic as the archiving process could be modified to no longer need done flags. The done flags exist to primarily differentiate between "all plugin" archives or "single plugin" archives or "partial archives", but these "types" of archives are only necessary because Matomo groups many log aggregation queries within the same unit of logic, ie, Archivers. With smaller classes (ie, RecordBuilders), we no longer need to think about whether it's more efficient to only archive for a single plugin or every plugin when only a single record is requested.
The archiving process changes from:
to:
the
Rules
class should reduce to having two public methods:isRequestAuthorizedToArchive()
andshouldArchiveAllPlugins()
Changes:
Review