introduce RecordBuilder concept to split up Archiver code #20394

diosmosis · 2023-02-23T22:53:40Z

Description:

This PR introduces a new concept to more granularly encapsulate archiving logic. Instead of a single Archiver class that aggregates all numeric and blob records, they are put into smaller RecordBuilder classes that are meant to encapsulate as few log table queries as possible, while still tying together metrics/reports that are dependent on each other.

The introduction of this pattern would have the following benefits:

Splitting up large Archiver classes.
Better query origin hints. Right now we add the plugin name to LogAggregator queries, like, SELECT /* Goals */, now the RecordBuilder class name is added as well (if defined), which allows faster identification of the source of a query.
Possibility of simplified archiving logic as the archiving process could be modified to no longer need done flags. The done flags exist to primarily differentiate between "all plugin" archives or "single plugin" archives or "partial archives", but these "types" of archives are only necessary because Matomo groups many log aggregation queries within the same unit of logic, ie, Archivers. With smaller classes (ie, RecordBuilders), we no longer need to think about whether it's more efficient to only archive for a single plugin or every plugin when only a single record is requested.

The archiving process changes from:
- find latest archives containing reports for site(s), period(s), segment (keeping in mind that there are "all", "single plugin" or "single report" plugins)
- figure out if the done flag has an old ts_archived
- based on parameter values (ie, period = range? segment being used?) determine what type of new archive to make if out of date
- if all plugins archive, archive all plugins w/ single idarchive. if single plugin archive, archive one plugin w/ single idarchive. if single report, archive one report w/ plugin done flag and done value = 5.
to:
- find idarchives for site(s), period(s), segment
- look for data by name within idarchives where ts_archived is within ttl
- for data that is not found (because it does not exist or is too old), find associated RecordBuilders and invoke, archiving data. (note: this can still take into account whether the browser can trigger archiving or not)

the Rules class should reduce to having two public methods: isRequestAuthorizedToArchive() and shouldArchiveAllPlugins()

Changes:

Introduce RecordBuilder base class.
Get plugin RecordBuilders (if any) in Plugin\Archiver, and invoke in callAggregateDayReport/callAggregateMultipleReports.
Split Goals Archiver into two classes, GeneralGoalsRecords (which performs one log aggregation query to build many records) and ProductRecord (which performs 2+ queries to build 2 records). ProductRecord is parameterized and is added via an event.

Review

…via RecordBuilders

…d since archiving is a recursive process

…ng tests)

…before this change)

tsteur

@diosmosis the record builders look nice 👍 Makes it easier to have these separated.

The only concern I have is that the aggregation of range reports might still be slowish when there are say 20 goals configured but the data is only requested for one goal. Then we still also need to aggregate the data for the other 19 goals if I understand this correctly? Less concerned around day archives but more when aggregating existing reports to build the range.

I understand there's a problem around knowing which reports need to be processed?

I was hoping that when for example a report like below is requested:

$archive = Archive::build($idSite, 'range', '2021-02-03,2023-03-03', $segment);
$dataTable = $archive->getDataTable('Goal_1_nb_conversions');

That we would only need to aggregate the Goal_1_nb_conversions records but not anything else. Or at least only aggregate the records for the same goal (Goal_1_*). As having 20 goals makes it effectively 20 times slower than needed otherwise.

Was this where there is a problem around partial archives and knowing what reports we already have?

diosmosis · 2023-03-03T00:23:04Z

@tsteur

The only concern I have is that the aggregation of range reports might still be slowish when there are say 20 goals configured but the data is only requested for one goal. Then we still also need to aggregate the data for the other 19 goals if I understand this correctly? Less concerned around day archives but more when aggregating existing reports to build the range.

No, this would only be an issue for day periods where we're constrained by the query. Multiple periods can just archive one report if we want. Just because a RecordBuilder returns metadata for, say, 100 goals, doesn't mean we have to aggregate all of them. We could add some code to just aggregate the requested report. Note, though, that this logic is not in this PR since it would require implementing RecordBuilders everywhere first. Actually... I guess it would need to support Archiver only use still... I could try and do it here, if a RecordBuilder can be found for a report. I'll give that a shot, but the ultimate idea was to get rid of done flags altogether, otherwise for range archives there would be nothing but partial archives, which complicates things (eg, we'd have to check that a partial archives we found has the requested data, but only for reports where we are doing single report archiving, before we query the data).

Was this where there is a problem around partial archives and knowing what reports we already have?

The partial archive problem was mostly just for the "simple" solution that applied the aggregation change automatically to every report. If we request, eg, Actions_actions_url first for a range (and there is no all plugins archive since we don't create those anymore for ranges), we'll get that data. But Actions_nb_keywords, which is built from the sitesearch record, will be set to 0 (since we skip building the sitesearch record). If the user then looks at the visits overview, distinct keywords will be 0 until they look at the sitesearch record. Edge cases like this would be unavoidable since Archivers just aggregate everything for a plugin, not just related reports/metrics. RecordBuilders fix that.

…rd needed is created by a RecordBuilder

…ng keys

…ta is present within them. if some are not present, only archive those in a new partial archive.

…rent archive request

…uested report

core/Plugin/Archiver.php

plugins/Goals/RecordBuilders/GeneralGoalsRecords.php

michalkleiner · 2023-05-22T13:49:05Z

I went over the PR and left a ton of inline comments/questions that we can go over together as a team. Some of those might be just from me not fully understanding stuff potentially even before the change, but otherwise I think it helped me to read through the whole change line by line.

diosmosis · 2023-05-22T19:21:43Z

Applied some PR fixes.

… empty segment

sgiehl

Left a couple of minor code improvement suggestions like adding or improving (return) type hints.

Generally this would then be good to merge I think. We would consider merging this before releasing the Matomo 5 beta.
@diosmosis Are you going to add / update documentation around the new record builders? Would be awesome to have that documented e.g. here:
https://developer.matomo.org/guides/archiving
https://developer.matomo.org/guides/archiving-behavior-specification

We could maybe even consider adding something to the migration guide, so plugin developers know how to migrate their plugins:
https://developer.matomo.org/guides/migrate-matomo-4-to-5

core/Archive.php

core/ArchiveProcessor/RecordBuilder.php

core/DataAccess/ArchiveSelector.php

core/ArchiveProcessor/RecordBuilder.php

plugins/Goals/RecordBuilders/GeneralGoalsRecords.php

plugins/Goals/RecordBuilders/ProductRecord.php

…reporting whether archives were found or not

…ted for use in RecordBuilders that need to manually insert data

…ing column name to op

…rdBuilders event for requested plugin since it is expected for those event handlers to perform queries

diosmosis · 2023-05-27T21:13:40Z

@sgiehl @michalkleiner made the requested changes + a couple others that might require another quick review.

Are you going to add / update documentation around the new record builders? Would be awesome to have that documented e.g. here:
https://developer.matomo.org/guides/archiving
https://developer.matomo.org/guides/archiving-behavior-specification

We could maybe even consider adding something to the migration guide, so plugin developers know how to migrate their plugins:
https://developer.matomo.org/guides/migrate-matomo-4-to-5

That's not currently a part of my work but I might do it at some later point. For now I created an issue: matomo-org/developer-documentation#733. Do note that RecordBuilders are not currently @api and the Archiver methods have not yet been deprecated (up to a decision maker when/if this would be).

sgiehl

Left some more comments in terms of security. Otherwise this PR looks good to me to merge.

sgiehl · 2023-06-02T13:54:37Z

core/Archive.php

+    {
+        $requestedReport = null;
+        if (SettingsServer::isArchivePhpTriggered()) {
+            $requestedReport = Request::fromRequest()->getStringParameter('requestedReport', '');


Note: The new request class doesn't do any sanitizing. I had a rough look where this parameter is passed through. Looks like there shouldn't be any risk of using the possibly user provided value.

core/ArchiveProcessor/Record.php

sgiehl

Looks good to me now.
@diosmosis I'll merge that one now. Feel free to rebase the PRs that were built on that one and add a needs review label to those we shall review and merge next.

diosmosis added 2 commits February 23, 2023 13:23

introduce RecordBuilder concept and re-organize Goals archiving code …

6a388a0

…via RecordBuilders

fix loop iteration bug

5f42975

diosmosis marked this pull request as draft February 23, 2023 22:53

diosmosis added 8 commits February 25, 2023 15:54

Merge branch '5.x-dev' into record-builders-poc

a9c3168

split ecommerce records recordbuilder into 3 separate records

5921f9d

make sure Goals::getRecordMetadata() behaves like old archiver code

b2d8e10

make sure recordbuilder archive processor is restored after being use…

2be26f2

…d since archiving is a recursive process

just make ArchiveProcessor a parameter

0db7cb5

check for plugin before calling buildMultiplePeriod()

fda7ee7

do not invoke record builders if archiver has no plugin (happens duri…

b2ea252

…ng tests)

insert empty DataTables (as this appears to be the existing behavior …

6cfe3ee

…before this change)

diosmosis marked this pull request as ready for review February 26, 2023 22:50

diosmosis marked this pull request as draft February 26, 2023 22:52

diosmosis added 2 commits February 26, 2023 14:57

add RecordBuilder class name to aggregation query hint

43bd346

clear up in-source todo

c24d3aa

diosmosis marked this pull request as ready for review February 26, 2023 23:25

diosmosis added the Needs Review PRs that need a code review label Feb 26, 2023

diosmosis added this to the 5.0.0 milestone Feb 26, 2023

tsteur reviewed Mar 2, 2023

View reviewed changes

diosmosis added 10 commits March 3, 2023 06:54

attempt only archiving requested report if range archive and the reco…

c2de813

…rd needed is created by a RecordBuilder

refactor ArchiveSelector::getArchiveIds() to provide result with stri…

0139de3

…ng keys

when all found archives are partial archives, check that requested da…

f7e3831

…ta is present within them. if some are not present, only archive those in a new partial archive.

return correct value in Model::getRecordsContainedInArchives()

70b5bab

fix if formatting

d05f243

existingArchives can be falsy

1609d31

existing archives can be null if the check is not relevant to the cur…

234467a

…rent archive request

do not archive dependent segments if only processing the specific req…

cb1cf08

…uested report

fix more tests

21b62d8

fix LoaderTest

99599e9

michalkleiner reviewed May 22, 2023

View reviewed changes

core/Plugin/Archiver.php Outdated Show resolved Hide resolved

michalkleiner reviewed May 22, 2023

View reviewed changes

plugins/Goals/RecordBuilders/GeneralGoalsRecords.php Outdated Show resolved Hide resolved

michalkleiner reviewed May 22, 2023

View reviewed changes

plugins/Goals/RecordBuilders/GeneralGoalsRecords.php Outdated Show resolved Hide resolved

michalkleiner reviewed May 22, 2023

View reviewed changes

plugins/Goals/RecordBuilders/GeneralGoalsRecords.php Outdated Show resolved Hide resolved

diosmosis added 2 commits May 22, 2023 12:00

apply review feedback

8a7c75d

remove stray debugging change

831e000

michalkleiner added 5 commits May 25, 2023 22:55

Merge branch '5.x-dev' into record-builders-poc

cf3def7

Merge remote-tracking branch 'origin/5.x-dev' into record-builders-poc

fd5f69e

Update variable name for consistency

b8bccd7

Remove unnecessary array_filter since a valid class name never has an…

76a2823

… empty segment

Add TODOs

e59621d

sgiehl requested changes May 26, 2023

View reviewed changes

diosmosis added 5 commits May 26, 2023 18:56

add comment on why we look for data within partial archives prior to …

bb1b2f3

…reporting whether archives were found or not

typehint fixes + make insertBlobRecord (formerly insertRecord) protec…

94c9796

…ted for use in RecordBuilders that need to manually insert data

more typehints

0a30895

in aggregateNumericMetrics() allow operationsToApply to be array mapp…

6a5c2fd

…ing column name to op

optimization: when getting recordbuilders, only post Archiver.addReco…

b99f84d

…rdBuilders event for requested plugin since it is expected for those event handlers to perform queries

diosmosis and others added 2 commits May 30, 2023 02:22

default to null if default column aggregation operation is not specified

0b197bb

Merge branch '5.x-dev' into record-builders-poc

f06453b

sgiehl reviewed Jun 2, 2023

View reviewed changes

diosmosis and others added 3 commits June 2, 2023 13:32

add check for invalid record name to Record

264c500

allow dashes in record name since entity IDs can be used in them

540e6a2

Merge branch '5.x-dev' into record-builders-poc

cdafadc

sgiehl approved these changes Jun 5, 2023

View reviewed changes

sgiehl merged commit adcae6d into 5.x-dev Jun 5, 2023

sgiehl deleted the record-builders-poc branch June 5, 2023 07:52

sgiehl changed the title ~~introduce RecordBuilder concept to split up Archiver code and use in Goals~~ introduce RecordBuilder concept to split up Archiver code Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduce RecordBuilder concept to split up Archiver code #20394

introduce RecordBuilder concept to split up Archiver code #20394

diosmosis commented Feb 23, 2023 •

edited

Loading

tsteur left a comment

diosmosis commented Mar 3, 2023

michalkleiner commented May 22, 2023

diosmosis commented May 22, 2023

sgiehl left a comment

diosmosis commented May 27, 2023

sgiehl left a comment

sgiehl Jun 2, 2023

sgiehl left a comment

introduce RecordBuilder concept to split up Archiver code #20394

introduce RecordBuilder concept to split up Archiver code #20394

Conversation

diosmosis commented Feb 23, 2023 • edited Loading

Description:

Review

tsteur left a comment

Choose a reason for hiding this comment

diosmosis commented Mar 3, 2023

michalkleiner commented May 22, 2023

diosmosis commented May 22, 2023

sgiehl left a comment

Choose a reason for hiding this comment

diosmosis commented May 27, 2023

sgiehl left a comment

Choose a reason for hiding this comment

sgiehl Jun 2, 2023

Choose a reason for hiding this comment

sgiehl left a comment

Choose a reason for hiding this comment

diosmosis commented Feb 23, 2023 •

edited

Loading