-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create public request database #33
Comments
All our current datasets are in PostgreSQL databases. How can I dump those to SQLite? Other than StackOverflow answers suggesting to modify the pgdump file to remove unsupported statements, I've found two tools: https://github.com/simonw/db-to-sqlite and https://github.com/caiiiycuk/postgresql-to-sqlite I tried db-to-sqlite to make an SQLite database out of one view in the database from my MT: pipx install 'db-to-sqlite[postgresql]'
db-to-sqlite "postgresql://<user>:<password>@<host>:<port>/<database>" ma.db --table filtered_requests That produced a ~335MB file after 1.5mins. I opened that with Datasette: datasette ma.db And everything is working as expected. But queries take ~1s. EDIT: That's not quite true. For some reason, listing the first page takes almost a second. But the subsequent pages take only ~50ms. Even a query like |
One thing we need to keep in mind: In rowid tables, Datasette (expectedly) uses the rowid to identify (and importantly) link to individual rows. However, those are not stable:
There are also WITHOUT ROWID tables for which one needs to define an explicit primary key in the |
So, now we need to decide on a common schema that we can transform the existing datasets to. Luckily, they all use a very similar schema (e.g. https://github.com/baltpeter/thesis-mobile-consent-dialogs/blob/main/schema.sql, https://github.com/datenanfragen/android-data-safety-label-analysis/blob/main/schema.sql, https://github.com/baltpeter/ios-privacy-violations/blob/main/schema.sql). I think it's a good idea to use TrackHAR's export type Request = {
/** The time when the request was sent. */
startTime: Date;
/** The HTTP method used. */
method: string;
/** The host name of the request. */
host: string;
/** The full path of the request, including the query string. */
path: string;
/** The full URL, but without the query string. This is useful for matching in the adapters. */
endpointUrl: string;
/** The request body, if any. */
content?: string;
/** The port of the request. */
port: string;
/** The scheme of the request. */
scheme: 'http' | 'https';
/** The HTTP version of the request. */
httpVersion: string;
/** The headers included in the request. */
headers?: { name: string; value: string }[];
/** The cookies set through request. */
cookies?: { name: string; value: string }[];
}; That would mean listing headers and cookies as JSON arrays in the rows instead of having separate tables for them. But to be honest, I prefer that for this use case anyway. Makes the schema simpler and otherwise, you'd probably pretty much always be |
We then need to extend that with the information in my I would use a composite primary key: The ID from the original dataset (can be whatever) plus a |
Actually, on second thought, maybe we want to be more flexible. What if we later want to add requests from websites? |
That would leave us with something like this: CREATE TABLE "requests" (
"id" TEXT,
"dataset" TEXT,
"initiator" TEXT,
"platform" TEXT NOT NULL,
"runType" TEXT,
"startTime" TEXT NOT NULL,
"method" TEXT NOT NULL,
"httpVersion" TEXT NOT NULL,
"scheme" TEXT NOT NULL,
"host" TEXT NOT NULL,
"port" INTEGER NOT NULL,
"path" TEXT NOT NULL,
"content" BLOB,
"headers" TEXT NOT NULL,
"cookies" TEXT NOT NULL,
PRIMARY KEY("dataset","id")
) WITHOUT ROWID; Some notes:
|
I think the best way to get the data from the datasets into this format is to construct a temporary view. First step for my master's thesis: select filtered_requests.id as id,
'informed-consent' as dataset,
filtered_requests.name || '@' || version as initiator,
platform,
run_type as runType,
start_time as startTime,
method,
http_version as httpVersion,
scheme,
host,
port,
content_raw as content,
coalesce((select json_agg(json_build_object('name', h.name, 'value', h.values[1]))
from headers h
where h.request = filtered_requests.id), '[]'::json) as headers,
coalesce((select json_agg(json_build_object('name', c.name, 'value', c.values[1]))
from cookies c
where c.request = filtered_requests.id), '[]'::json) as cookies
from filtered_requests; But I still need to add the background traffic with |
Here's the query I used to create the view that includes both types of requests: create view informedConsentRequests as
(select filtered_requests.id as id,
'informed-consent' as dataset,
filtered_requests.name || '@' || version as initiator,
platform,
run_type as runType,
start_time as startTime,
method,
http_version as httpVersion,
scheme,
host,
port,
content_raw as content,
coalesce((select json_agg(json_build_object('name', h.name, 'value', h.values[1]))
from headers h
where h.request = filtered_requests.id), '[]'::json) as headers,
coalesce((select json_agg(json_build_object('name', c.name, 'value', c.values[1]))
from cookies c
where c.request = filtered_requests.id), '[]'::json) as cookies
from filtered_requests)
union all
(select requests.id as id,
'informed-consent' as dataset,
null as initiator,
platform,
run_type as runType,
requests.start_time as startTime,
method,
http_version as httpVersion,
scheme,
host,
port,
content_raw as content,
coalesce((select json_agg(json_build_object('name', h.name, 'value', h.values[1]))
from headers h
where h.request = requests.id), '[]'::json) as headers,
coalesce((select json_agg(json_build_object('name', c.name, 'value', c.values[1]))
from cookies c
where c.request = requests.id), '[]'::json) as cookies
from requests
join runs r on r.id = requests.run
join apps a on r.app = a.id
where not exists(select 1 from filtered_requests where requests.id = filtered_requests.id)); The subqueries for the headers and cookies make this very slow but that's not too much of a problem since I only want to export the full view once and then never touch it again. |
I severely underestimated how long this was going to take. The In the meantime, I tried creating a second materialized view hoping that that would be faster, but I aborted that after almost two hours when I have no idea what is going on. The two queries making up the union "only" take less than two minutes each to run. How can combining them into a view take this ridiculously long? |
Okay, I think I've found the source of my confusion. I'm pretty sure DataGrip helpfully automatically inserted a Even if manually add a |
Okay. But if computing 500 rows already takes 2 mins, I shouldn't be surprised that 220k rows take a long, long time. :D So, the subqueries are a lot worse than I figured. But upon now thinking about this more this does actually make sense. There are 1.6M header rows and 150k cookie rows. It has to go through all these for 220k rows… I confirmed this by running the query with That clearly shows that the headers and cookies (though the latter much less so) are the problem: |
My first idea for optimisation was to precompute the headers and cookies JSON like this: with cookiesJson as (select request, json_agg(json_build_object('name', c.name, 'value', c.values[1])) as json
from cookies c
group by c.request),
headersJson as (select request, json_agg(json_build_object('name', h.name, 'value', h.values[1])) as json
from headers h
group by h.request)
select filtered_requests.id as id,
'informed-consent' as dataset,
filtered_requests.name || '@' || version as initiator,
platform,
run_type as runType,
start_time as startTime,
method,
http_version as httpVersion,
scheme,
host,
port,
content_raw as content,
coalesce((select json from headersJson where headersJson.request = filtered_requests.id), '[]'::json) as headers,
coalesce((select json from cookiesJson where cookiesJson.request = filtered_requests.id), '[]'::json) as cookies
from filtered_requests; However, that didn't help at all. The query still takes the exact same amount of time. |
I don't really understand why that is. Computing Looking at the query plan, it's almost the exact same with just an additional "Subquery Scan" for each of the subqueries. Clearly, I'm misunderstanding something here. |
It appears as though the common table expressions (the If I instead make them into materialized views like this: create materialized view cookiesJson as (select request, json_agg(json_build_object('name', c.name, 'value', c.values[1])) as json
from cookies c
group by c.request);
create materialized view headersJson as (select request, json_agg(json_build_object('name', h.name, 'value', h.values[1])) as json
from headers h
group by h.request);
select filtered_requests.id as id,
'informed-consent' as dataset,
filtered_requests.name || '@' || version as initiator,
platform,
run_type as runType,
start_time as startTime,
method,
http_version as httpVersion,
scheme,
host,
port,
content_raw as content,
coalesce((select json from headersJson where headersJson.request = filtered_requests.id), '[]'::json) as headers,
coalesce((select json from cookiesJson where cookiesJson.request = filtered_requests.id), '[]'::json) as cookies
from filtered_requests limit 501; Now the
I'm sure this can be improved a lot more. We should probably be using Whatever. Considering how few datasets in that legacy format we need to import, I think I'll just take the ~80% improvement. Those also have fewer requests. |
One annoyance of the schema we ended with: Because In the old schema, we had both a DB Browser for SQLite displays this as I would expect: |
I thought that SQLite's weak type system might help us here:
But alas, it doesn't. If I alter the |
But that's okay. If I install the datasette-render-binary plugin ( Still not ideal (because now the content column doesn't wrap and is ridiculously long) but we can fix this with CSS or our own custom plugin. Either way, not a priority right now. |
Here's how I exported the dataset from the data safety label analysis. First I created the views: create materialized view cookiesJson as (
select request, json_agg(json_build_object('name', c.name, 'value', c.values[1])) as json from cookies c group by c.request
);
create materialized view headersJson as (
select request, json_agg(json_build_object('name', h.name, 'value', h.values[1])) as json from headers h group by h.request
);
create materialized view "worryingConfessionsRequests" as
(select filtered_requests.id as id,
'worrying-confessions' as dataset,
filtered_requests.name || '@' || version as initiator,
'android' as platform,
'no-interaction' as "runType",
start_time as "startTime",
method,
http_version as "httpVersion",
scheme,
host,
port,
path,
content_raw as content,
coalesce((select json from headersJson where headersJson.request = filtered_requests.id),
'[]'::json) as headers,
coalesce((select json from cookiesJson where cookiesJson.request = filtered_requests.id), '[]'::json) as cookies
from filtered_requests)
union all
(select requests.id as id,
'worrying-confessions' as dataset,
null as initiator,
'android' as platform,
'no-interaction' as "runType",
requests.start_time as "startTime",
method,
http_version as "httpVersion",
scheme,
host,
port,
path,
content_raw as content,
coalesce((select json from headersJson where headersJson.request = requests.id), '[]'::json) as headers,
coalesce((select json from cookiesJson where cookiesJson.request = requests.id), '[]'::json) as cookies
from requests
join runs r on r.id = requests.run
join apps a on r.app = a.id
where not exists(select 1 from filtered_requests where requests.id = filtered_requests.id)); This only took 23s (but that dataset also only has 20k requests). Then I exported that view to an SQLite database: db-to-sqlite "postgresql://dsl:<pw>@localhost:5432/dsl" worrying-confessions.db --table worryingConfessionsRequests --progress Finally, I renamed the exported table to |
And for the iOS analysis: create materialized view "iosWatchingYouRequests" as
(select filtered_requests.id as id,
'ios-watching-you' as dataset,
a.name || '@' || a.version as initiator,
'ios' as platform,
'no-interaction' as "runType",
filtered_requests.start_time as "startTime",
method,
http_version as "httpVersion",
scheme,
host,
port,
path,
content_raw as content,
coalesce((select json from headersJson where headersJson.request = filtered_requests.id),
'[]'::json) as headers,
coalesce((select json from cookiesJson where cookiesJson.request = filtered_requests.id), '[]'::json) as cookies
from filtered_requests
join runs r on r.id = filtered_requests.run
join apps a on r.app = a.name)
union all
(select requests.id as id,
'ios-watching-you' as dataset,
null as initiator,
'ios' as platform,
'no-interaction' as "runType",
requests.start_time as "startTime",
method,
http_version as "httpVersion",
scheme,
host,
port,
path,
content_raw as content,
coalesce((select json from headersJson where headersJson.request = requests.id), '[]'::json) as headers,
coalesce((select json from cookiesJson where cookiesJson.request = requests.id), '[]'::json) as cookies
from requests
where not exists(select 1 from filtered_requests where requests.id = filtered_requests.id)); Followed by: db-to-sqlite "postgresql://ios:<pw>@localhost:5433/ios" ios-watching-you.db --table iosWatchingYouRequests --progress Here, creating the view took a bit longer (2.5 mins for 41k requests). And finally, I again renamed the table to |
And finally, the first Android analysis. create materialized view cookiesJson as (
select request, json_agg(json_build_object('name', c.name, 'value', c.values[1])) as json from cookies c group by c.request
);
create materialized view headersJson as (
select request, json_agg(json_build_object('name', h.name, 'value', h.values[1])) as json from headers h group by h.request
);
create materialized view "doTheyTrackRequests" as
(select filtered_requests.id as id,
'do-they-track' as dataset,
a.name || '@' || a.version as initiator,
'android' as platform,
'no-interaction' as "runType",
filtered_requests.start_time as "startTime",
method,
null as "httpVersion",
null as scheme,
host,
null as port,
path,
content_raw as content,
coalesce((select json from headersJson where headersJson.request = filtered_requests.id),
'[]'::json) as headers,
coalesce((select json from cookiesJson where cookiesJson.request = filtered_requests.id), '[]'::json) as cookies
from filtered_requests
join runs r on r.id = filtered_requests.run
join apps a on r.app = a.name)
union all
(select requests.id as id,
'do-they-track' as dataset,
null as initiator,
'android' as platform,
'no-interaction' as "runType",
requests.start_time as "startTime",
method,
null as "httpVersion",
null as scheme,
host,
null as port,
path,
content_raw as content,
coalesce((select json from headersJson where headersJson.request = requests.id), '[]'::json) as headers,
coalesce((select json from cookiesJson where cookiesJson.request = requests.id), '[]'::json) as cookies
from requests
where not exists(select 1 from filtered_requests where requests.id = filtered_requests.id)); For that one, we unfortunately hadn't quite settled on the final schema yet and weren't collecting HTTP version, scheme, and port. I had to set those to For 68k requests, creating the view took a little more than 9 mins. Then: db-to-sqlite "postgresql://projprakt:<pw>@localhost:5431/projprakt" do-they-track.db --table doTheyTrackRequests --progress And finally, I again renamed the table to |
The Postgres views threw away my casing and made the column names lower-case because I didn't put them in double quotes. I manually fixed that in the SQLite tables. |
Oh no. I forgot to export the |
Okay, I've recreated the databases above and updated the comments. The one for my master's thesis is still exporting. But I'll use the chance to also document the SQL for that one: create materialized view cookiesJson as (select request, json_agg(json_build_object('name', c.name, 'value', c.values[1])) as json
from cookies c
group by c.request);
create materialized view headersJson as (select request, json_agg(json_build_object('name', h.name, 'value', h.values[1])) as json
from headers h
group by h.request);
create materialized view "informedConsentRequests" as
(select filtered_requests.id as id,
'informed-consent' as dataset,
filtered_requests.name || '@' || version as initiator,
platform,
run_type as "runType",
start_time as "startTime",
method,
http_version as "httpVersion",
scheme,
host,
port,
path,
content_raw as content,
coalesce((select json from headersJson where headersJson.request = filtered_requests.id), '[]'::json) as headers,
coalesce((select json from cookiesJson where cookiesJson.request = filtered_requests.id), '[]'::json) as cookies
from filtered_requests)
union all
(select requests.id as id,
'informed-consent' as dataset,
null as initiator,
platform,
run_type as "runType",
requests.start_time as "startTime",
method,
http_version as "httpVersion",
scheme,
host,
port,
path,
content_raw as content,
coalesce((select json from headersJson where headersJson.request = requests.id), '[]'::json) as headers,
coalesce((select json from cookiesJson where cookiesJson.request = requests.id), '[]'::json) as cookies
from requests
join runs r on r.id = requests.run
join apps a on r.app = a.id
where not exists(select 1 from filtered_requests where requests.id = filtered_requests.id)); And in the SQLite table, I normalized the update requests set runType = 'no-interaction' where runType = 'initial';
update requests set runType = 'consent-dialog-accepted' where runType = 'accepted';
update requests set runType = 'consent-dialog-rejected' where runType = 'rejected'; |
We want to combine those databases into one. We start by opening a new database file: sqlite3 data.db Load the
Then, we can attach all databases: attach 'do-they-track.db' as doTheyTrack;
attach 'informed-consent.db' as informedConsent;
attach 'ios-watching-you.db' as iosWatchingYou;
attach 'worrying-confessions.db' as worryingConfessions;
attach 'monkey-july-2023.db' as monkeyJuly2023; Create the table: create table "requests" (
"id" text,
"dataset" text,
"initiator" text,
"platform" text not null,
"runType" text,
"startTime" text not null,
"method" text not null,
"httpVersion" text,
"endpointUrl" text,
"scheme" text,
"host" text not null,
"port" integer,
"path" text not null,
"content" blob,
"headers" text not null,
"cookies" text not null,
primary key("dataset", "id")
) without rowid; And insert all the data: insert into requests
select id, dataset, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies from (
with vendors as materialized (
select id, dataset, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies,
case
when initiator is null then null
when instr(initiator,'.') = 0 then initiator
else regex_replace('\.[^.]+@.+?$', initiator, '')
end as vendor,
-- For the `do-they-track` requests, we don't know the scheme. But it really doesn't make sense to consider that for this, anyway.
regex_replace('\?.+$', host || path, '') as _endpointUrl
from (
select * from monkeyJuly2023.requests
union all
select * from worryingConfessions.requests
union all
select * from informedConsent.requests
union all
select * from iosWatchingYou.requests
union all
select * from doTheyTrack.requests
)
)
select * from vendors where
-- Only include requests that are made to the same endpointUrl by apps from at least two different vendors (https://github.com/tweaselORG/meta/issues/33#issuecomment-1658348929).
_endpointUrl in (
select _endpointUrl from vendors group by _endpointUrl having count(distinct vendor) >= 2
union
select _endpointUrl from vendors where vendor is null
)
-- Filter out iOS system background traffic as that may contain authentication values (https://github.com/tweaselORG/meta/issues/33#issuecomment-1660099572),
and (not platform = 'ios' or initiator is not null)
); EDIT: Updated according to #33 (comment) |
In #33 (comment), I decided to not include the endpoint URL in the exports from the datasets and to rather dynamically generate that through a view. The problem is that that needs a So instead, I'll modify the schema as such: create table "requests" (
"id" text,
"dataset" text,
"initiator" text,
"platform" text not null,
"runType" text,
"startTime" text not null,
"method" text not null,
"httpVersion" text,
"endpointUrl" text,
"scheme" text,
"host" text not null,
"port" integer,
"path" text not null,
"content" blob,
"headers" text not null,
"cookies" text not null,
primary key("dataset", "id")
) without rowid; And manually populated that for the existing rows by loading sqlite-regex: wget -O regex0.so https://github.com/asg017/sqlite-regex/releases/download/v0.2.2/linux-x86_64-regex0.so
sqlite3 out.db
> .load ./regex0 And then: update requests set endpointUrl = regex_replace('\?.+$', scheme || '://' || host || path, '') where scheme is not null; |
We should also install the |
Actually no, that is not a good idea. I should instead do that in the dataset exports. I want those to be immutable blobs and Instead, I now added the |
It might be a good idea to start Datasette with |
As we don't care too much about endpoints for individual apps and want to avoid making companies unnecessarily angry about exposing their internal APIs, we have decided to only publish requests to endpoints that are contacted by at apps from at least two different vendors. As we only have app IDs, we use Apple's definition for determining the vendor:
Here's the query I came up with (that was quite fiddly and I ended up needing the create view "filteredRequests" as
with vendors as materialized (
select id, dataset, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies,
case
when initiator is null then null
when instr(initiator,'.') = 0 then initiator
else regex_replace('\.[^.]+@.+?$', initiator, '')
end as vendor,
-- For the `do-they-track` requests, we don't know the scheme. But it really doesn't make sense to consider that for this, anyway.
regex_replace('\?.+$', host || path, '') as _endpointUrl
from requests
)
select * from vendors where _endpointUrl in (
select _endpointUrl from vendors group by _endpointUrl having count(distinct vendor) >= 2
union
select _endpointUrl from vendors where vendor is null
)
; |
We'll also need to link our legal notice and privacy policy, of course. We can do that using a custom footer template. For that, we need to run with |
I'll also add a CREATE TABLE "datasets" (
"slug" TEXT,
"title" TEXT NOT NULL,
"description" TEXT NOT NULL,
"url" TEXT NOT NULL,
"sourceCodeUrl" TEXT,
PRIMARY KEY("slug")
) WITHOUT ROWID; Content so far: INSERT INTO "datasets" ("slug", "title", "description", "url", "sourceCodeUrl") VALUES ('do-they-track', 'Do they track? Automated analysis of Android apps for privacy violations', 'To get a grasp of how common and extensive data collection in Android apps really is, Malte and Benni developed a series of scripts to download, install and start Android apps in an emulator and to collect and analyse their network traffic.
The apps were run for 60 seconds in an Android 11 emulator without any user input, meaning that no consent was given.', 'https://benjamin-altpeter.de/doc/presentation-android-privacy.pdf', '');
INSERT INTO "datasets" ("slug", "title", "description", "url", "sourceCodeUrl") VALUES ('informed-consent', 'Informed Consent? A Study of “Consent Dialogs” on Android and iOS', 'Consent dialogs have become ubiquitous with seemingly every website and app pleading users to agree to their personal data being processed and their behaviour being tracked, often with the help of tens or even hundreds of third-party companies. For his master’s thesis, Benni studied consent dialogs on Android and iOS in an automated and dynamic manner, analysing 4,388 popular apps from both platforms. Among other things, he measured the effect of the user’s choice in the consent dialog by comparing the traffic from before any interaction with the traffic after accepting and rejecting the dialog and analysing contacted trackers and transmitted data types.
The apps were initially run for 60 seconds without interaction in an Android 11 emulator and on a physical iPhone 7 running iOS 14.8. In the runs of requests with a `runType` of `no-interaction`, nothing else was done. For the ones with `consent-dialog-accepted`, after 60 seconds, a discovered consent dialog was accepted and then they were left running without interaction for another 60 seconds. The same was done for the ones with `consent-dialog-rejected`, but the consent dialog was rejected here.', 'https://benjamin-altpeter.de/doc/thesis-consent-dialogs.pdf', 'https://github.com/baltpeter/thesis-mobile-consent-dialogs');
INSERT INTO "datasets" ("slug", "title", "description", "url", "sourceCodeUrl") VALUES ('ios-watching-you', 'iOS watching you: Automated analysis of “zero-touch” privacy violations under iOS', 'As a follow-up project to the “Do they track?”, Benni also looked at the iOS ecosystem. He ended up analysing 1,001 apps from the top charts of the German App Store as of May 2021.
The apps were run for 60 seconds without interaction on a physical iPhone 8 running iOS 14.5.1.', 'https://benjamin-altpeter.de/doc/presentation-ios-privacy.pdf', 'https://github.com/baltpeter/ios-privacy-violations');
INSERT INTO "main"."datasets" ("slug", "title", "description", "url", "sourceCodeUrl") VALUES ('monkey-july-2023', 'Traffic collection for TrackHAR adapter work (July 2023)', 'For writing new adapters for TrackHAR and properly documenting the old ones, Benni ran a traffic collection on 804 apps on Android and 1062 apps on iOS from the top charts.
The Android apps were run in an Android 11 emulator for 60 seconds, receving random input from `adb monkey`, as such it is possible/likely that consent was given when requested. The iOS apps were run on a physical iPhone X running iOS 15.6.1 for 60 seconds without any interaction.', 'https://github.com/tweaselORG/experiments/issues/1', 'https://github.com/tweaselORG/experiments/tree/main/monkey-july-2023');
INSERT INTO "datasets" ("slug", "title", "description", "url", "sourceCodeUrl") VALUES ('worrying-confessions', 'Worrying confessions: A look at data safety labels on Android', 'In 2022, the Google Play Store introduced a data safety section in order to give users accessible insights into apps’ data collection practices. To verify the declarations, Benni recorded the network traffic of 500 apps, finding more than one quarter of them transmitting tracking data not declared in their data safety label.
The apps were run for 60 seconds in an Android 11 emulator without any user input.', 'https://www.datarequests.org/blog/android-data-safety-labels-analysis/', 'https://github.com/datenanfragen/android-data-safety-label-analysis'); |
As we had a That's a nice feature but annoyingly space-intensive in this case. We can turn that off using a {
"databases": {
"data": {
"tables": {
"datasets": {
"label_column": "slug"
}
}
}
}
} Now, we still get a link but not the full title in each row: |
I just realized that the iOS datasets also include requests to Apple endpoints with cookies of our actual Apple IDs… :D |
A time limit of 1 second for queries stops us from doing interesting things. Let's allow ten seconds instead: |
I tried going to https://appleid.apple.com/account/manage and changing my password with the "Sign out of Apple devices and websites associated with your Apple ID." checkbox enabled. And while the iPhones now both do poster me to sign in again and many endpoints do return a 401, laughably I can still do quite a lot with the old authentication tokens, including downloading apps and even buying new (free) ones! |
Going through the same process again didn't help either. |
Looks like the only choice we have is to filter out these requests. From looking through the data for a while, I'm pretty confident that using my filtered requests view for that is sufficient. |
I have written a quick script for creating the DB. That seems better than always updating that comment. :D |
I've found what the problem was here: It was the suggested facets. If I set |
I've started setting up the instance on an Uberspace. Unfortunately, I ran into a few problems:
For now, I'll just live with that and remove mentions of the broken features from the homepage. @zner0L can decide what (if anything) we do about this. |
That is an even bigger problem than I initially thought. It's not just annoying for interactive use but it also means that we can't query for requests that match an adapter (since the endpoint URLs there can be regexes). |
Two more useful extensions we could install (but which probably won't work on Uberspace, either): |
There were a few things that bugged me about how our database was rendered by Datasette that I fixed through two custom plugins:
|
Another thing I noticed while doing adapter work: It is really annoying not to have an endpoint URL for all requests. Thus, I decided to backfill that for the |
We've decided to move to a Hetzner server. |
Done. https://data.tweasel.org/ is now running on Hetzner, and all extensions work. |
The barrier of entry for looking into how individual trackers work is currently very high. To be able to reliably reverse-engineer endpoints, you need a large dataset of (mobile app) traffic with hundreds of thousands of requests from thousands of apps. Even though our tools and libraries make it easier to collect such a dataset, this still makes it really hard for anyone who hasn't spent significant prior effort. And they won't have access to historical data either way. That is obviously not in our interest. Our goal is to shine a light on how trackers work and what they collect, and as such we of course want as many people as possible researching them.
This is also relevant for TrackHAR. We need to create documentation on why/how we have concluded what certain values transmitted to a tracking endpoint mean (tweaselORG/TrackHAR#9). To be able to do that in a way that is replicable by others, we need to be able to reference the actual requests we have based our conclusions on.
As such, we want to create a public repository/database of tracking requests. We already have multiple large datasets of mobile tracking requests from previous research projects and will be continuously collecting new requests through our platform.
This will finally also greatly benefit trackers.tweasel.org. We already have static example values for the existing adapters from one of our datasets (tweaselORG/tracker-wiki#3). With a public request database, we could be computing this data on the fly and even provide a long list of real example requests for each adapter.
The text was updated successfully, but these errors were encountered: