Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GOV.UK GA4 and PII process pages edits #284

Merged
merged 2 commits into from
Nov 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 20 additions & 15 deletions source/data-sources/ga/ga4-logs/index.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,40 +1,45 @@
---
title: GA4 access logs
title: GOV.UK GA4 access logs
weight: 9
last_reviewed_on: 2024-11-20
last_reviewed_on: 2024-11-21
review_in: 6 months
---

# GA4 access logs
The GA4 access logs data details usage of the [GOV.UK Google Analytics GA4 data](/data-sources/ga/ga4/) via the Google Analytics Data API.
# GOV.UK GA4 access logs
## Content
The GOV.UK GA4 access logs data details usage of the [GOV.UK Google Analytics GA4 data](/data-sources/ga/ga4/) via the Google Analytics Data API.
This includes usage of the GOV.UK GA4 user interface and Looker Studio connections, as well as direct querying of the API, but does not include usage of the data [exported to BigQuery](/data-sources/ga/ga4-bq/).

The fields in this dataset and their descriptions can be seen in the [schema table below](/data-sources/ga/ga4-logs/#schema).

## Access
Access to the dataset is limited to GA4 user admins, but reporting
Access to the BigQuery dataset is limited to GA4 user admins.
However, the [GA4 usage report](/products/ga4-usage-dashboard/) which visualises this data is shared with all GDS performance analysts and the Data Services Google group.

[Summarised data is also provided to SPOCs](/products/govuk-ga4-spocs/) for their department.

### Location
The data is located in BigQuery under the `ga4-analytics-352613.ga4_logs` dataset, in the [GA4 Analytics project](/tools/google-cloud-platform/gcp-projects/#ga4-analytics).

The data is sharded, with a table created each day with the date as a suffix (in the format YYYYMMDD).
The data is partitioned on the `epoch_time_micros` timestamp.

## Set-up
### Data collection
This data is generated by querying the [Google Analytics Admin API](https://developers.google.com/analytics/devguides/config/admin/v1/access-api).
A Google Cloud Run function is triggered by a Cloud Scheduler job to run every day at 6am, retrieving the data and appending it to the `ga4_logs` table in the dataset mentioned above.

### Schema
| field name | type | mode |
| field name | type | mode | description |
| --- | --- | --- |
| epochTimeMicros | TIMESTAMP | NULLABLE |
| userEmail | STRING | NULLABLE |
| accessMechanism | STRING | NULLABLE |
| accessorAppName | STRING | NULLABLE |
| dataApiQuotaCategory | STRING | NULLABLE |
| reportType | STRING | NULLABLE |
| accessCount | INTEGER | NULLABLE |
| dataApiQuotaPropertyTokensConsumed | INTEGER | NULLABLE |
| epoch_time_micros | TIMESTAMP | NULLABLE | The unix microseconds since the epoch that the GA user accessed GA reporting data |
| user_email | STRING | NULLABLE | The user's email address |
| access_mechanism | STRING | NULLABLE | The mechanism through which a user accessed GA reporting data, for example 'Google Analytics User Interface' or 'Google Analytics API' |
| accessor_app_name | STRING | NULLABLE | The name of the application that accessed Google Analytics reporting data, for example 'Looker Studio' or 'Power BI' |
| api_quota_category | STRING | NULLABLE | The quota category for the Data API request, for example 'Core' or 'Realtime' |
| report_type | STRING | NULLABLE | The type of reporting data that the GA user accessed, for example 'Realtime' or 'Free form exploration' |
| access_count | INTEGER | NULLABLE | The number of times GA reporting data was accessed. Note that every report viewed can result in one or more data access events |
| api_tokens_consumed | INTEGER | NULLABLE | The number of property quota tokens consumed for Data API requests |
| domain | STRING | NULLABLE | The email domain, taken from the user's email address |

### Retention
The data retention is currently set to 2 years.
25 changes: 14 additions & 11 deletions source/data-sources/ga/ga4/index.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,23 +1,24 @@
---
title: GOV.UK GA4
weight: 1
last_reviewed_on: 2024-11-12
last_reviewed_on: 2024-11-21
review_in: 6 months
---

# GOV.UK GA4
Google Analytics 4 (GA4) is used to collect data on the usage of GOV.UK.

This data can be accessed via the Google Analytics 4 user interface, the GA4 Looker Studio connection, and the Google Analytics Data API.
This data can be accessed via the [Google Analytics 4 user interface](/analysis/govuk-ga4/use-ga4/ga4-interface/), the [GA4 Looker Studio connection](/analysis/govuk-ga4/use-ga4/looker-studio/), and the [Google Analytics Data API](/analysis/govuk-ga4/use-ga4/analytics-data-api/).

This data is also exported to [BigQuery](/data-sources/ga/ga4-bq/).

Information on how to understand and use this data can be found in the [Analysis section of this site](/analysis/govuk-ga4/).

## Content
## Data source contents

Data was first collected into this property on 23/09/22.
The events captured have changed significantly over time, and so early data quality may be patchy.
Data quality notes and a dashboard containing annotations can be found on the [GOV.UK GA4 data quality page](/data-sources/ga/ga4/data-quality/#gov-uk-ga4-data-quality).

Data is collected from dataLayer pushes using [Google Tag Manager](/processes/gtm/), which applies any appropriate transformations to the data and then sends it to GA4.
More information on the dataLayer pushes implemented on GOV.UK can be found in the [Implementation record](https://docs.publishing.service.gov.uk/analytics/events.html).
Expand All @@ -30,7 +31,7 @@ Organisations without SPOCs should contact the [Analytics and Insights team](mai
For more information, see [the GA access policy](/processes/ga-access/).

### Location
There are three GA4 GOV.UK properties. These correspond to the integration, staging, and production or live GOV.UK websites.
There are three GA4 GOV.UK properties. These correspond to the integration, staging, and production GOV.UK websites.

## GA4 settings
### Property set-up
Expand All @@ -42,14 +43,14 @@ At set-up, we chose the following reporting options:
- Business size - Very Large
- When asked about our business objectives (which determines which standard reports GA4 will provide in the property), we chose 'Get baseline reports'

No [subproperties](https://support.google.com/analytics/answer/11525732) are set up for the GOV.UK GA4 properties.
No [subproperties](https://support.google.com/analytics/answer/11525732) have been set up for the GOV.UK GA4 properties.
Instead, we are using BigQuery and Looker Studio to clean, filter, and control access to the data.

### Data collection

GA4's inbuilt 'Enhanced measurement' feature is not in use on GOV.UK as it was determined that it would not meet our data collection needs.
However, default GA4 dimensions have been used wherever there is a clear fit to the event being captured to ensure that GDS will benefit from any default reporting functionality associated with those events and dimensions.
Tracking has been implemented on GOV.UK using custom dataLayer pushes which are sent to GA4 with Google Tag Manager.
However, default GA4 event names and dimensions have been used wherever there is a clear fit to the information being captured to ensure that GDS will benefit from any default reporting functionality associated with those events and dimensions.
Tracking has been implemented on GOV.UK using custom dataLayer pushes which are sent to GA4 via Google Tag Manager.
More information on this approach can be found in our [Implementation record](https://docs.publishing.service.gov.uk/analytics/).

Google Signals data collection and Ads personalisation is disabled for GDS GA4 properties within the interface.
Expand All @@ -60,20 +61,22 @@ Granular location and device data collection is enabled to allow reporting on th
### Data processing and modification

At present, we have not created any custom events or designated any events as key events (formerly known as conversions) within GA4.
We have tested the custom event data import feature to ensure that it would meet our needs to join additional metadata to the events we collect, but have not yet implemented any custom data imports.
We are using the custom event data import feature to join additional metadata to the events we collect.
More information on GOV.UK GA4 data modifications can be found in the [Policies and processes section](/processes/ga-modifications/).

[Expanded datasets](/analysis/govuk-ga4/use-ga4/ga4-expanded-datasets/) are being used to reduce the impact of cardinality on various key reports.
You can use the [steps in this guidance](/analysis/govuk-ga4/use-ga4/ga4-expanded-datasets/) to view the expanded datasets we have set up and to request new expanded datasets.

We are also not using GA4's inbuilt data redaction feature on GOV.UK.
We are not using GA4's inbuilt data redaction feature on GOV.UK.
Instead, we are applying redaction to strip out potential Personally Identifable Information (PII) from values before they are pushed to the dataLayer (see our [documentation on this in GitHub](https://github.com/alphagov/govuk_publishing_components/blob/main/docs/analytics-ga4/pii-remover.md)).
Additional PII redaction is also applied in Google Tag Manager as a fail-safe.
Additional PII redaction is also applied in Google Tag Manager as a precaution.
If you spot any PII or other data that should not be present in the GA4 datasets, please raise a Zendesk ticket or contact the [GA4 support inbox](mailto:[email protected]).
Our PII process is documented in the [Policies and processes section](/processes/ga-pii/).

### Retention

A data retention period for this data is set in the GA4 user interface. For the integration, staging, and production GOV.UK sites, this is set to 38 months.
A data retention period for this data is set in the GA4 user interface.
For the integration, staging, and production GOV.UK sites, this is set to 38 months.

## Governance

Expand Down
2 changes: 1 addition & 1 deletion source/data-sources/ga/index.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Google Analytics
weight: 1
last_reviewed_on: 2024-11-15
last_reviewed_on: 2024-11-21
review_in: 6 months
---

Expand Down
27 changes: 16 additions & 11 deletions source/data-sources/index.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Data sources
weight: 40
last_reviewed_on: 2024-05-23
last_reviewed_on: 2024-11-21
review_in: 6 months
---

Expand All @@ -10,13 +10,18 @@ review_in: 6 months
This section contains guidance on the key data sources we use in Data Services and across GDS.
These include:

- [GA4](./ga/ga4/)
- [GA4 (BigQuery export)](./ga/ga4-bq/)
- [GA4 flattened data](./ga/ga4-flat/)
- [Universal Analytics](./ga/ua/)
- [Universal Analytics flattened data](./ga/ua-flat/)
- [GA4 access logs](./ga/ga4-logs/)
- [Google Cloud Platform logs](./gcp-logs/)
- [GA Settings Database](./ga/ga-settings/)
- [GOV.UK User Feedback](./user-feedback/)
- [Google Search Console for GOV.UK](./gsc/)
- [www.gov.uk GA4 data](/data-sources/ga/ga4/ga4/)
- [www.gov.uk GA4 data in BigQuery](/data-sources/ga/ga4/ga4-bq/)
- [www.gov.uk flattened GA4 data in BigQuery](/data-sources/ga/ga4-flat/)
- [GOV.UK Publishing GA4](/data-sources/ga/publishing-ga4/)
- [GOV.UK Blogs GA4](/data-sources/ga/blogs-ga4/)
- [GA4 usage logs](/data-sources/ga/ga4-logs/)
- [Historic GOV.UK analytics data](/data-sources/ga/historic-ua/)
- [www.gov.uk Universal Analytics data in BigQuery](/data-sources/ga/ua/)
- [www.gov.uk flattened Universal Analytics data in BigQuery](/data-sources/ga/ua-flat/)
- [GA4 access logs](/data-sources/ga/ga4-logs/)
- [GA settings database](/data-sources/ga/ga-settings/)
- [Google Search Console for GOV.UK](/data-sources/gsc/)
- [Google Cloud Platform logs](/data-sources/gcp-logs/)
- [GOV.UK User Feedback](/data-sources/user-feedback/)
- [Google Search Console for GOV.UK](/data-sources/gsc/)
11 changes: 7 additions & 4 deletions source/processes/ga-pii/index.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,20 +1,23 @@
---
title: GA4 PII process
weight: 5
last_reviewed_on: 2024-08-27
last_reviewed_on: 2024-11-21
review_in: 6 months
---

# PII process (Google Analytics data)

Personally Identifable Information (PII) is data that could potentially identify a specific individual.

We [apply redaction](/data-sources/ga/ga4/#data-processing-and-modification) to try to ensure that we do not collect any PII into Google Analytics, but it is still possible that some slips through on occasion.
We apply redaction to try to ensure that we do not collect any PII into Google Analytics.
Details of the redaction applied can be found on [data source pages](/data-sources/).
For example, for our [GOV.UK GA4 data](/data-sources/ga/ga4/#data-processing-and-modification), we apply redaction in both the dataLayer and Google Tag Manager.
However, it is still possible that some unwanted data slips through on occasion.

If you notice any PII in GOV.UK GA4, please raise a Zendesk ticket as quickly as possible.
If you notice any PII in GDS-owned GA4 data, please raise a Zendesk ticket as quickly as possible.

## Process
The Analytics team will:
The Insights and Analytics team will:

- Confirm whether the data identified is Personally Identifable Information
- Ascertain how long and why this data has been collected
Expand Down
6 changes: 3 additions & 3 deletions source/products/content-data/index.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Content Data app
weight: 34
last_reviewed_on: 2024-11-13
last_reviewed_on: 2024-11-21
review_in: 6 months
---

Expand Down Expand Up @@ -56,9 +56,9 @@ Further technical documentation can be found on the [Developer docs](https://doc

The table supplying the Content Data app with GA4 data can be found in BigQuery under the data set `govuk-content-data.ga4.GA4 dataform`.
This is updated every day with the previous day's data via a scheduled query ('Daily content data') in the same project (see the code below).
This query is currently scheduled to run at 9:00am.
This query is currently scheduled to run at 9:30am.

A script then imports yesterday's data from the `GA4 dataform` table into Content Data every day at 9:10am UTC.
A script then imports yesterday's data from the `GA4 dataform` table into Content Data every day at 9:40am UTC.

If the import fails for any reason, it needs to be manually re-run. The process and SQL code to used can be found on the [Content Data backfilling page](/products/content-data/content-data-processing/).
Once the `GA4 dataform` table in BigQuery has been updated, the GOV.UK Platform Engineering team need to be alerted to re-run the import.
Expand Down
25 changes: 13 additions & 12 deletions source/products/ga4-project-use-dashboard/index.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,36 +1,35 @@
---
title: Use the GA4 Project Use Dashboard
title: GA4 project use dashboard
weight: 89
last_reviewed_on: 2024-11-18
last_reviewed_on: 2024-11-21
review_in: 6 months
---

# GA4 Project Use Dashboard
# GA4 project use dashboard

The [GA4 Project Use Dashboard](https://lookerstudio.google.com/reporting/90c992bc-5473-4e27-a662-b100d36b22d3/page/fSyIE) uses log data to show what BigQuery datasets and tables in the `ga4-analytics-352613` project are being accessed from users across Government.
The [GA4 project use dashboard](https://lookerstudio.google.com/reporting/90c992bc-5473-4e27-a662-b100d36b22d3/page/fSyIE) uses log data to show what BigQuery datasets and tables in the `ga4-analytics-352613` project are being accessed from users across Government.

The report is a single page, and provides visualisations to show the most frequently referenced tables and datasets, and from which departments these are being accessed. The report’s main table breaks down the data by dataset, table name, the email address of the user running the query, and what organisation they belong to. For some queries it is possible to ascertain which connector was used, and this information will be displayed in the connector_type column

By monitoring this report, we can monitor the use of the datasets and tables contained within the GA4 GCP project, even from users who sit outside of GDS.

## Using the GA4 Project Use Dashboard
## Using the GA4 project use dashboard

### Get access to the GA4 Project Use Dashboard
### Get access to the GA4 project use dashboard

The [report](https://lookerstudio.google.com/reporting/90c992bc-5473-4e27-a662-b100d36b22d3/page/fSyIE) can be viewed by anyone with a @digital.cabinet-office.gov.uk email address.

It can be edited by anyone in the GDS performance analysts Google group.

## How the GA4 Project Use Dashboard works
## How the GA4 project use dashboard works

### Data sources

The GA4 Project Use Dashboard uses a pre-aggregated BigQuery table as its data source. This table is called `log_test_4` and can be found in the `ga4_log_data dataset` within the `gds-bq-data` project.
The GA4 project use dashboard uses a pre-aggregated BigQuery table as its data source.
This table is called `log_test_4` and can be found in the `ga4_log_data dataset` within the `gds-bq-data` project.

The table is updated each day from a scheduled query, and processes information from stored BigQuery logs.

### The scheduled query in full

The scheduled query used to build the table is as follows:

```SQL
Expand Down Expand Up @@ -87,6 +86,8 @@ ORDER BY date ASC)
SELECT distinct principal_email, dataset_name, table_name_clean, date, connector_type, TO_JSON_STRING(service_data) as service_data FROM data_2
```

It works by grabbing various fields from the `_AllLogs` table, extracting with REGEX the dataset and table names from the auth_info.resource field. It also determines the user’s email and timestamp, as well as the connection type (scheduled query, Google Sheets, Looker Studio or other) from the log’s service_data field.
This works by grabbing various fields from the `_AllLogs` table, extracting with REGEX the dataset and table names from the auth_info.resource field.
The query also determines the user’s email and timestamp, as well as the connection type (scheduled query, Google Sheets, Looker Studio or other) from the log’s service_data field.
Finally, the results are filtered to include only distinct references to resources where the method is either 'jobservice.insert' or 'jobservice.query'.

Finally, the results are filtered to include only distinct references to resources where the method is either 'jobservice.insert' or 'jobservice.query'. This processing means that the table is then connected directly to the charts in Looker Studio without having to create a Looker Studio custom query.
This processing means that the table can be connected to directly using the Looker Studio BigQuery connector, without the need for a custom query.
Loading