Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce RecordBuilder concept to split up Archiver code #20394

Merged
merged 65 commits into from
Jun 5, 2023

Conversation

diosmosis
Copy link
Member

@diosmosis diosmosis commented Feb 23, 2023

Description:

Refs #7573

This PR introduces a new concept to more granularly encapsulate archiving logic. Instead of a single Archiver class that aggregates all numeric and blob records, they are put into smaller RecordBuilder classes that are meant to encapsulate as few log table queries as possible, while still tying together metrics/reports that are dependent on each other.

The introduction of this pattern would have the following benefits:

  • Splitting up large Archiver classes.

  • Better query origin hints. Right now we add the plugin name to LogAggregator queries, like, SELECT /* Goals */, now the RecordBuilder class name is added as well (if defined), which allows faster identification of the source of a query.

  • Possibility of simplified archiving logic as the archiving process could be modified to no longer need done flags. The done flags exist to primarily differentiate between "all plugin" archives or "single plugin" archives or "partial archives", but these "types" of archives are only necessary because Matomo groups many log aggregation queries within the same unit of logic, ie, Archivers. With smaller classes (ie, RecordBuilders), we no longer need to think about whether it's more efficient to only archive for a single plugin or every plugin when only a single record is requested.

    The archiving process changes from:

    • find latest archives containing reports for site(s), period(s), segment (keeping in mind that there are "all", "single plugin" or "single report" plugins)
    • figure out if the done flag has an old ts_archived
    • based on parameter values (ie, period = range? segment being used?) determine what type of new archive to make if out of date
    • if all plugins archive, archive all plugins w/ single idarchive. if single plugin archive, archive one plugin w/ single idarchive. if single report, archive one report w/ plugin done flag and done value = 5.

    to:

    • find idarchives for site(s), period(s), segment
    • look for data by name within idarchives where ts_archived is within ttl
    • for data that is not found (because it does not exist or is too old), find associated RecordBuilders and invoke, archiving data. (note: this can still take into account whether the browser can trigger archiving or not)

the Rules class should reduce to having two public methods: isRequestAuthorizedToArchive() and shouldArchiveAllPlugins()

Changes:

  • Introduce RecordBuilder base class.
  • Get plugin RecordBuilders (if any) in Plugin\Archiver, and invoke in callAggregateDayReport/callAggregateMultipleReports.
  • Split Goals Archiver into two classes, GeneralGoalsRecords (which performs one log aggregation query to build many records) and ProductRecord (which performs 2+ queries to build 2 records). ProductRecord is parameterized and is added via an event.

Review

@diosmosis diosmosis marked this pull request as draft February 23, 2023 22:53
@diosmosis diosmosis marked this pull request as ready for review February 26, 2023 22:50
@diosmosis diosmosis marked this pull request as draft February 26, 2023 22:52
@diosmosis diosmosis marked this pull request as ready for review February 26, 2023 23:25
@diosmosis diosmosis added the Needs Review PRs that need a code review label Feb 26, 2023
@diosmosis diosmosis added this to the 5.0.0 milestone Feb 26, 2023
Copy link
Member

@tsteur tsteur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diosmosis the record builders look nice 👍 Makes it easier to have these separated.

The only concern I have is that the aggregation of range reports might still be slowish when there are say 20 goals configured but the data is only requested for one goal. Then we still also need to aggregate the data for the other 19 goals if I understand this correctly? Less concerned around day archives but more when aggregating existing reports to build the range.

I understand there's a problem around knowing which reports need to be processed?

I was hoping that when for example a report like below is requested:

$archive = Archive::build($idSite, 'range', '2021-02-03,2023-03-03', $segment);
$dataTable = $archive->getDataTable('Goal_1_nb_conversions');

That we would only need to aggregate the Goal_1_nb_conversions records but not anything else. Or at least only aggregate the records for the same goal (Goal_1_*). As having 20 goals makes it effectively 20 times slower than needed otherwise.

Was this where there is a problem around partial archives and knowing what reports we already have?

@diosmosis
Copy link
Member Author

@tsteur

The only concern I have is that the aggregation of range reports might still be slowish when there are say 20 goals configured but the data is only requested for one goal. Then we still also need to aggregate the data for the other 19 goals if I understand this correctly? Less concerned around day archives but more when aggregating existing reports to build the range.

No, this would only be an issue for day periods where we're constrained by the query. Multiple periods can just archive one report if we want. Just because a RecordBuilder returns metadata for, say, 100 goals, doesn't mean we have to aggregate all of them. We could add some code to just aggregate the requested report. Note, though, that this logic is not in this PR since it would require implementing RecordBuilders everywhere first. Actually... I guess it would need to support Archiver only use still... I could try and do it here, if a RecordBuilder can be found for a report. I'll give that a shot, but the ultimate idea was to get rid of done flags altogether, otherwise for range archives there would be nothing but partial archives, which complicates things (eg, we'd have to check that a partial archives we found has the requested data, but only for reports where we are doing single report archiving, before we query the data).

Was this where there is a problem around partial archives and knowing what reports we already have?

The partial archive problem was mostly just for the "simple" solution that applied the aggregation change automatically to every report. If we request, eg, Actions_actions_url first for a range (and there is no all plugins archive since we don't create those anymore for ranges), we'll get that data. But Actions_nb_keywords, which is built from the sitesearch record, will be set to 0 (since we skip building the sitesearch record). If the user then looks at the visits overview, distinct keywords will be 0 until they look at the sitesearch record. Edge cases like this would be unavoidable since Archivers just aggregate everything for a plugin, not just related reports/metrics. RecordBuilders fix that.

@michalkleiner
Copy link
Contributor

I went over the PR and left a ton of inline comments/questions that we can go over together as a team. Some of those might be just from me not fully understanding stuff potentially even before the change, but otherwise I think it helped me to read through the whole change line by line.

@diosmosis
Copy link
Member Author

Applied some PR fixes.

Copy link
Member

@sgiehl sgiehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of minor code improvement suggestions like adding or improving (return) type hints.

Generally this would then be good to merge I think. We would consider merging this before releasing the Matomo 5 beta.
@diosmosis Are you going to add / update documentation around the new record builders? Would be awesome to have that documented e.g. here:
https://developer.matomo.org/guides/archiving
https://developer.matomo.org/guides/archiving-behavior-specification

We could maybe even consider adding something to the migration guide, so plugin developers know how to migrate their plugins:
https://developer.matomo.org/guides/migrate-matomo-4-to-5

core/Archive.php Outdated Show resolved Hide resolved
core/ArchiveProcessor/RecordBuilder.php Outdated Show resolved Hide resolved
core/DataAccess/ArchiveSelector.php Outdated Show resolved Hide resolved
core/ArchiveProcessor/RecordBuilder.php Outdated Show resolved Hide resolved
core/ArchiveProcessor/RecordBuilder.php Outdated Show resolved Hide resolved
plugins/Goals/RecordBuilders/GeneralGoalsRecords.php Outdated Show resolved Hide resolved
plugins/Goals/RecordBuilders/GeneralGoalsRecords.php Outdated Show resolved Hide resolved
plugins/Goals/RecordBuilders/GeneralGoalsRecords.php Outdated Show resolved Hide resolved
plugins/Goals/RecordBuilders/ProductRecord.php Outdated Show resolved Hide resolved
plugins/Goals/RecordBuilders/ProductRecord.php Outdated Show resolved Hide resolved
diosmosis added 5 commits May 26, 2023 18:56
…reporting whether archives were found or not
…ted for use in RecordBuilders that need to manually insert data
…rdBuilders event for requested plugin since it is expected for those event handlers to perform queries
@diosmosis
Copy link
Member Author

@sgiehl @michalkleiner made the requested changes + a couple others that might require another quick review.

Are you going to add / update documentation around the new record builders? Would be awesome to have that documented e.g. here:
https://developer.matomo.org/guides/archiving
https://developer.matomo.org/guides/archiving-behavior-specification

We could maybe even consider adding something to the migration guide, so plugin developers know how to migrate their plugins:
https://developer.matomo.org/guides/migrate-matomo-4-to-5

That's not currently a part of my work but I might do it at some later point. For now I created an issue: matomo-org/developer-documentation#733. Do note that RecordBuilders are not currently @api and the Archiver methods have not yet been deprecated (up to a decision maker when/if this would be).

Copy link
Member

@sgiehl sgiehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some more comments in terms of security. Otherwise this PR looks good to me to merge.

{
$requestedReport = null;
if (SettingsServer::isArchivePhpTriggered()) {
$requestedReport = Request::fromRequest()->getStringParameter('requestedReport', '');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: The new request class doesn't do any sanitizing. I had a rough look where this parameter is passed through. Looks like there shouldn't be any risk of using the possibly user provided value.

core/ArchiveProcessor/Record.php Show resolved Hide resolved
Copy link
Member

@sgiehl sgiehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now.
@diosmosis I'll merge that one now. Feel free to rebase the PRs that were built on that one and add a needs review label to those we shall review and merge next.

@sgiehl sgiehl merged commit adcae6d into 5.x-dev Jun 5, 2023
@sgiehl sgiehl deleted the record-builders-poc branch June 5, 2023 07:52
@sgiehl sgiehl changed the title introduce RecordBuilder concept to split up Archiver code and use in Goals introduce RecordBuilder concept to split up Archiver code Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Do not close PRs with this label won't be marked as stale by the Close Stale Issues action Needs Review PRs that need a code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants