Skip to content

Commit

Permalink
Update current docs to reference unified views (#561)
Browse files Browse the repository at this point in the history
* corrected typo

* updated schema page; added stub tcpinfo schema page; added new views, updated list of tests and core services; reversed changelog order

* remaining updates re: unified views; reorganized 301s; moved some content from old docs into appropriate spots

* updated web100 page, link in data page

* updated views section
  • Loading branch information
critzo authored Apr 17, 2020
1 parent c55ba50 commit 0ca1cb9
Show file tree
Hide file tree
Showing 39 changed files with 514 additions and 473 deletions.
13 changes: 7 additions & 6 deletions _pages/04-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,18 @@ There is typically at least a 24-hour delay between data collection and data pub
* [NDT]({{site.baseurl}}/tests/ndt)
* Network Diagnostic Tool (NDT) measures characteristics of a TCP connection under heavy load.
* NDT data is processed by the M-Lab ETL Pipeline.
* More information is available at [Internet2](http://software.internet2.edu/ndt/){:target="_blank"} and [GitHub](https://github.com/ndt-project/ndt){:target="_blank"}.
* More technical information is available on [GitHub](https://github.com/ndt-project/ndt){:target="_blank"}.
* Protocols: _ndt7 (coming soon)_, [ndt5]({{ site.baseurl }}/tests/ndt/ndt5), [web100]({{ site.baseurl }}/tests/ndt/web100)
* [NDT Raw Data](https://console.developers.google.com/storage/browser/archive-measurement-lab/ndt/){:target="_blank"} - [NDT BigQuery Dataset](https://bigquery.cloud.google.com/dataset/measurement-lab:ndt)
* [Neubot DASH]({{site.baseurl}}/tests/neubot)
* Neubot measured the Internet in order to gather data useful to study broadband performance, network neutrality, and Internet censorship.
* More information is available at [Nexa Center](https://neubot.nexacenter.org/){:target="_blank"} and [GitHub](https://github.com/neubot){:target="_blank"}.
* [Neubot Raw Data](https://console.developers.google.com/storage/browser/archive-measurement-lab/neubot/){:target="_blank"}
* [Reverse Traceroute]({{site.baseurl}}/tests/reverse_traceroute)
* Reverse traceroute measures the network path back to a user from selected network endpoints, and provides a rich source of information on network routing and topology.
* Reverse Traceroute data is not processed by the M-Lab ETL Pipeline.
* More information is available at [Reverse Traceroute](https://research.cs.washington.edu/networking/astronomy/reverse-traceroute.html){:target="_blank"}
* [Reverse Traceroute Raw Data](https://console.cloud.google.com/storage/browser/m-lab_revtr){:target="_blank"}
* [WeHe]({{site.baseurl}}/tests/wehe)
* Wehe uses your device to exchange Internet traffic recorded from real, popular apps like YouTube and Spotify, and attempts to tell you whether your ISP is giving different performance to an app's network traffic.
* More information is available from the [WeHe website](https://dd.meddle.mobi/){:target="_blank"} and [GitHub](https://dd.meddle.mobi/codeanddata.html){:target="_blank"}.
Expand Down Expand Up @@ -93,11 +99,6 @@ There is typically at least a 24-hour delay between data collection and data pub
* Pathload2 measured the available bandwidth of an Internet connection.
* More information is available at [https://code.google.com/p/pathload2-gatech/](https://code.google.com/p/pathload2-gatech/){:target="_blank"}.
* [Pathload2 Raw Data (archived)](https://console.developers.google.com/storage/browser/archive-measurement-lab/pathload2/){:target="_blank"}
* [Reverse Traceroute]({{site.baseurl}}/tests/reverse_traceroute)
* Reverse traceroute measures the network path back to a user from selected network endpoints, and provides a rich source of information on network routing and topology.
* Reverse Traceroute data is not processed by the M-Lab ETL Pipeline.
* More information is available at [Reverse Traceroute](https://research.cs.washington.edu/networking/astronomy/reverse-traceroute.html){:target="_blank"}
* [Reverse Traceroute Raw Data](https://console.cloud.google.com/storage/browser/m-lab_revtr){:target="_blank"}
* [SamKnows]({{site.baseurl}}/tests/samknows)
* The SamKnows performance testing platform is used by the USA's Federal Communications Commission (FCC), European Commission, UK government (Ofcom), Brazilian government (Anatel), Singapore's IDA and other government-backed studies worldwide.
* SamKnows infrastructure includes off-net test servers hosted by M-Lab, and the M-Lab and SamKnows teams coordinate regularly to support the various regulatory reporting periods of data collection conducted by SamKnows.
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
---
layout: page
layout: redirect
title: "Network Diagnostic Tool (NDT) - BigQuery Schema"
permalink: /data/docs/bq/schema/ndt/
breadcrumb: data
redirect_to: "https://measurementlab.net/tests/ndt/"
---

# Network Diagnostic Tool (NDT) BigQuery Schema
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
---
layout: page
layout: redirect
title: "Sidestream - BigQuery Schema"
permalink: /data/docs/bq/schema/sidestream/
breadcrumb: data
redirect_to: "https://measurementlab.net/tests/sidestream/#sidestream-schema"
---

# Sidestream BigQuery Schema
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
---
layout: page
layout: redirect
title: "Traceroute - BigQuery Schema"
permalink: /data/docs/bq/schema/traceroute/
breadcrumb: data
redirect_to: "https://measurementlab.net/tests/traceroute/#traceroute-schema"
---

# Traceroute BigQuery Schema
Expand Down Expand Up @@ -99,5 +100,3 @@ Paris Traceroute collects network path information for every connection used by
| `paris_traceroute_hop.dest_geolocation.rtt` | `float (repeated)` | |

</div>


File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion _pages/bigquery_quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,4 +92,4 @@ M-Lab has confirmed that the following test did not incur billing:
* The service account was added to the M-Lab Discuss Google Group
* The Google Cloud SDK was installed on a Linux computer and configured to use the test GCP project, and its service account
* Several queries were made to datasets within the `measurement-lab` project
* Though these queries appeared in the GCP BigQuery query hostory for the test project, no billing transactions were present for the queries
* Though these queries appeared in the GCP BigQuery query history for the test project, no billing transactions were present for the queries
153 changes: 76 additions & 77 deletions _pages/bigquery_schema.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
---
layout: page
title: "BigQuery Schemas"
title: "M-Lab Dataset Schemas and Changelog"
permalink: /data/docs/bq/schema/
breadcrumb: data
---

# BigQuery Schemas
# M-Lab Dataset Schemas and Changelog

## Background
## How Data is Collected

* Each M-Lab test consists of a **client** and a **server**.
* Whenever an M-Lab user starts a test, the client and server interact to measure different aspects of that user's connection.
Expand All @@ -18,69 +18,75 @@ breadcrumb: data

## M-Lab Hosted BigQuery Datasets, Tables, and Views

M-Lab publishes BigQuery tables and views for tests that have implemented a parser in our [ETL pipeline](https://github.com/m-lab/etl){:target="_blank"}. The following list provides links to schema pages for each test we publish to BigQuery. Please visit the page for each dataset's schema for more information.
M-Lab publishes BigQuery tables and views for tests that have implemented a parser in our [ETL pipeline](https://github.com/m-lab/etl){:target="_blank"}. The following list provides links to schema pages for each test we publish to BigQuery. Additionally, M-Lab publishes some datasets for M-Lab "Core Services and Platform Data", that provide information about the M-Lab platform infrastructure. Please visit the page for each dataset's schema for more information.

* [Network Diagnostic Tool (NDT)]({{ site.base_url }}/data/docs/bq/schema/ndt)
* [Paris Traceroute]({{ site.base_url }}/data/docs/bq/schema/traceroute)
* [Sidestream]({{ site.base_url }}/data/docs/bq/schema/sidestream)

Additionally, M-Lab publishes the following datasets that provide information about M-Lab platform infrastructure.

* [Utilization]({{ site.base_url }}/data/docs/bq/schema/utilization)
### Measurement Data (Active Tests)

## Datasets Hosted by Third-Party Researchers

Researchers who host their tests on the M-Lab platform have the option to host test results on M-Lab infrastructure as described in the previous section, or to host that data elsewhere. The M-Lab hosted tests below do not provide their data on M-Lab infrastructure. Please consult each project's website or contact their maintainers for information about these tests' schemas.
* [Network Diagnostic Tool (NDT)]({{ site.base_url }}/data/docs/bq/schema/ndt)

* [BISmark]({{site.baseurl}}/tests/bismark) - [Project BISmark website](http://projectbismark.net/){:target="_blank"}
* [MobiPerf]({{site.baseurl}}/tests/mobiperf) - [MobiPerf website](http://www.mobiperf.com/){:target="_blank"}
* [SamKnows]({{site.baseurl}}/tests/samknows) - [SamKnows website](https://www.samknows.com/){:target="_blank"}
### Current M-Lab Core Services and Platform Data

## BigQuery Datasets Named Using M-Lab Measurement Services & Data Types
* [Switch Utilization]({{ site.base_url }}/data/docs/bq/schema/utilization)
* [TCP INFO]({{ site.base_url }}/tests/tcp-info/)
* [Traceroute]({{ site.base_url }}/data/docs/bq/schema/traceroute)

Datasets in the `measurement-lab` project in BigQuery are named for each measurement service, and views within each dataset contain data relevant to that service. Prior to May 2019, M-Lab published versioned tables and views in a dataset called `release`. The transition in our naming of datasets, tables, and views is [discussed on our blog]({{ site.baseurl }}/blog/bq-datasets). In brief, the table below summarizes our old and new datasets and tables/views as discussed on the blog.
### Retired Core Services and Platform Data for Historical Analysis

<div class="table-condensed" markdown="1">
* [Sidestream]({{ site.base_url }}/data/docs/bq/schema/sidestream)

| Measurement Service | Old Datasets and Views | New Datasets and Views |
|:--------------------------|:------------------------------|:-------------------|
| NDT | * measurement-lab.base_tables.ndt<br>* measurement-lab.release.ndt_all<br>* measurement-lab.release.ndt_downloads<br>* measurement-lab.release.ndt_uploads<br> |* measurement-lab.ndt.web100<br>* measurement-lab.ndt.recommended<br>* measurement-lab.ndt.downloads<br>* measurement-lab.ndt.uploads |
| Paris Traceroute |* measurement-lab.base_tables.traceroute |* measurement-lab.aggregate.traceroute |
| Sidestream |* measurement-lab.base_tables.sidestream |* measurement-lab.sidestream.web100 |
| Switch |* measurement-lab.base_tables.switch |* measurement-lab.utilization.switch |
## M-Lab BigQuery Schemas - Changelog

</div>
### [v5] - 2020-04

* Following the M-Lab 2.0 platform upgrade completed in November 2019
* NDT data from the now deprecated web100 based [ndt](https://github.com/m-lab/ndt/) has been archived in the dataset `measurement-lab.ndt.web100`
* NDT data from the new, TCP INFO based [ndt-server](https://github.com/m-lab/ndt-server/) is now provided in `measurement-lab.ndt.ndt5`
* associated TCP INFO data for all ndt5 tests is now provided in `measurement-lab.ndt.tcpinfo`
* Views from web100 ndt are now deprecated, superceded by new "unified" views
* The following Views provide access only to data from the web100 legacy platform:
* `measurement-lab.ndt.recommended`
* `measurement-lab.ndt.downloads`
* `measurement-lab.ndt.uploads`
* Unified views of all NDT data published
* Two new historical views of all NDT data are now available, and provide only NDT tests that meet our [criteria] for valid, research quality tests.
* `measurement-lab.ndt.unified_downloads`
* `measurement-lab.ndt.unified_uploads`

## M-Lab BigQuery Schemas - Changelog
### [v4] - 2019-05

### [v2] - 2016-03
* In previous [release convention]({{ site.baseurl }}/blog/etl-pipeline/#new-etl-pipeline-and-transition-to-new-bigquery-tables) a hierarchy of releases, release candidates “rc”, versioned release candidates, and versioned intermediate views were published, but they will cease being updated with new data starting May 6, 2019.
* BigQuery datasets named after M-Lab measurement services & data types.
* Each measurement service (ndt, traceroute, sidestream, utilization) will have a corresponding BigQuery dataset and view in the `measurement-lab` project, managed by our [data reprocessing service](https://github.com/m-lab/etl-gardener){:target="_blank"}.
* LegacySQL support is now deprecated, but a single LegacySQL view of the legacy data may be kept for historical purposes.
* Only StandardSQL is supported in any new views of the comprehensive reprocessed data.
* Views that combine legacy tables and recently parsed data will no longer be offered.
* Historically, Paris Traceroute data was collected for every measurement service. For this data type, a view in the `aggregate` dataset is now provided.
* Over the next year, M-Lab will restructure the traceroute schema to support reprocessing using the [Gardener service](https://github.com/m-lab/etl-gardener), and to unify the schema for historical and future data collection by [Scamper](https://www.caida.org/tools/measurement/scamper/){:target="_blank"}.

* Began the publication of per project "fast tables" for NDT, NPAD, Paris Traceroute, and Sidestream.
* `plx.google:m_lab.ndt.all`
* `plx.google:m_lab.npad.all`
* `plx.google:m_lab.paris-traceroute.all`
* Continued the publication of v1 monthly tables, and published a [migration guide]({{ site.baseurl }}/data/docs/bq/legacymigration/).
* Deprecated fields in v2 "fast tables":
* `type`
* `project`
* `web100_log_entry.is_last_entry`
* `web100_log_entry.group_name`
### [v3.1.1] - 2018-07

### [v2.1] - 2016-11
* Publish official Switch tables from the DISCO dataset.

* The field `blacklist_flags` was added to v2 per project "fast tables", and historical data from 201001-01 to 2015-10-02 was re-parsed to add this annotation, due to a [switch discard issue related to traffic microbursts]({{ site.baseurl }}/blog/traffic-microbursts-and-their-effect-on-internet-measurement/).
Published **tables** and views are:

### [v3] - 2017-05
* **measurement-lab.legacy.ndt** (data ~ 2015-01-01 - 2017-05-10)
* **measurement-lab.legacy.ndt_pre2015** (data ~ 2009-02-18 - 2014-12-31)
* **measurement-lab.base_tables.ndt**
* **measurement-lab.base_tables.switch**

* Began publication to new date partitioned table and updated schema to support the new, open source, ETL pipeline.
* Data publication to v2 tables stopped at this time.
* **measurement-lab.rc**
* **measurement-lab.release_v3_1**
* **measurement-lab.release**
* _measurement-lab.release.ndt_all_
* _measurement-lab.release.ndt_all_legacysql_
* _measurement-lab.release.ndt_downloads_
* _measurement-lab.release.ndt_downloads_legacysql_
* _measurement-lab.release.ndt_uploads_
* _measurement-lab.release.ndt_uploads_legacysql_

### [v3.0.1] - 2017-10
### [v3.1] - 2018-02

* The schema for v3.0.1 tables was updated, removing an alpha feature called deltas, which attempted to log the differences between test snaplogs instead of the final test values. This feature will be revisited in future schema updates.
* Newly released data annotation engine added geolocation and some metadata to tests from 2016 to present.
* Published a series of beta BigQuery views for NDT data, to allow data queries across both v2 and v3.0.x tables.
* Published traceroute and sidestream table to replace v2 versions, migrated data, re-annotated data.
* First official release of v3 tables, with all historical data re-parsed, and annotated with geolocation metadata.

### [v3.0.2] - 2017-12

Expand All @@ -94,38 +100,31 @@ Datasets in the `measurement-lab` project in BigQuery are named for each measure
* Previous versions of our tables will be referenced by versions 1.0, 2.0, etc. in our documentation but actual table names will not be changed.
* Re-ran historical annotations for traceroute, npad, and sidestream data due to a bug where some geolocation annotations was not present in all past test data.

### [v3.1] - 2018-02

* First official release of v3 tables, with all historical data re-parsed, and annotated with geolocation metadata.
### [v3.0.1] - 2017-10

### [v3.1.1] - 2018-07
* The schema for v3.0.1 tables was updated, removing an alpha feature called deltas, which attempted to log the differences between test snaplogs instead of the final test values. This feature will be revisited in future schema updates.
* Newly released data annotation engine added geolocation and some metadata to tests from 2016 to present.
* Published a series of beta BigQuery views for NDT data, to allow data queries across both v2 and v3.0.x tables.
* Published traceroute and sidestream table to replace v2 versions, migrated data, re-annotated data.

* Publish official Switch tables from the DISCO dataset.
### [v3] - 2017-05

Published **tables** and views are:
* Began publication to new date partitioned table and updated schema to support the new, open source, ETL pipeline.
* Data publication to v2 tables stopped at this time.

* **measurement-lab.legacy.ndt** (data ~ 2015-01-01 - 2017-05-10)
* **measurement-lab.legacy.ndt_pre2015** (data ~ 2009-02-18 - 2014-12-31)
* **measurement-lab.base_tables.ndt**
* **measurement-lab.base_tables.switch**
### [v2.1] - 2016-11

* **measurement-lab.rc**
* **measurement-lab.release_v3_1**
* **measurement-lab.release**
* _measurement-lab.release.ndt_all_
* _measurement-lab.release.ndt_all_legacysql_
* _measurement-lab.release.ndt_downloads_
* _measurement-lab.release.ndt_downloads_legacysql_
* _measurement-lab.release.ndt_uploads_
* _measurement-lab.release.ndt_uploads_legacysql_
* The field `blacklist_flags` was added to v2 per project "fast tables", and historical data from 201001-01 to 2015-10-02 was re-parsed to add this annotation, due to a [switch discard issue related to traffic microbursts]({{ site.baseurl }}/blog/traffic-microbursts-and-their-effect-on-internet-measurement/).

### [v4] - 2019-05
### [v2] - 2016-03

* In previous [release convention]({{ site.baseurl }}/blog/etl-pipeline/#new-etl-pipeline-and-transition-to-new-bigquery-tables) a hierarchy of releases, release candidates “rc”, versioned release candidates, and versioned intermediate views were published, but they will cease being updated with new data starting May 6, 2019.
* BigQuery datasets named after M-Lab measurement services & data types.
* Each measurement service (ndt, traceroute, sidestream, utilization) will have a corresponding BigQuery dataset and view in the `measurement-lab` project, managed by our [data reprocessing service](https://github.com/m-lab/etl-gardener){:target="_blank"}.
* LegacySQL support is now deprecated, but a single LegacySQL view of the legacy data may be kept for historical purposes.
* Only StandardSQL is supported in any new views of the comprehensive reprocessed data.
* Views that combine legacy tables and recently parsed data will no longer be offered.
* Historically, Paris Traceroute data was collected for every measurement service. For this data type, a view in the `aggregate` dataset is now provided.
* Over the next year, M-Lab will restructure the traceroute schema to support reprocessing using the [Gardener service](https://github.com/m-lab/etl-gardener), and to unify the schema for historical and future data collection by [Scamper](https://www.caida.org/tools/measurement/scamper/){:target="_blank"}.
* Began the publication of per project "fast tables" for NDT, NPAD, Paris Traceroute, and Sidestream.
* `plx.google:m_lab.ndt.all`
* `plx.google:m_lab.npad.all`
* `plx.google:m_lab.paris-traceroute.all`
* Continued the publication of v1 monthly tables, and published a [migration guide]({{ site.baseurl }}/data/docs/bq/legacymigration/).
* Deprecated fields in v2 "fast tables":
* `type`
* `project`
* `web100_log_entry.is_last_entry`
* `web100_log_entry.group_name`
Loading

0 comments on commit 0ca1cb9

Please sign in to comment.