Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write README and docstrings #7

Merged
merged 7 commits into from
Apr 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 85 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,12 @@

> Library for detecting tracking data transmissions from traffic in HAR format.

<!-- TODO: A longer introduction to the module. -->
For research into mobile privacy and complaints against tracking, it is important to know what data is being transmitted in a request to a tracking server. But these requests are in a huge variety of different formats and often heavily nested and/or obfuscated, which hinders straightforward automatic analysis. TrackHAR aims to address this problem. It takes recorded traffic in a [HAR files](http://www.softwareishard.com/blog/har-12-spec/) as the input and returns a parsed list of the transmitted data (and, optionally, additional metadata like the tracking company and location in the data) for each request it can handle.

To achieve this, TrackHAR uses adapters written for specific tracking endpoints. In our [research](https://benjamin-altpeter.de/doc/thesis-consent-dialogs.pdf), we have found that generic approaches (like indicator matching in the raw transmitted plain text or [base64-encoded](https://github.com/baltpeter/base64-search) request content) are not sufficient due to the frankly ridiculous nesting and obfuscation we observed. In addition, approaches that search for static honey data values can never capture dynamic data types such as free disk space and current RAM usage, or low-entropy values like the operating system version (e.g. `11`).
However, we have also noticed that there is a comparatively small number of tracking endpoints which make up a large portion of all app traffic. This makes our adapter-based approach feasible to detect most of the transmitted tracking data. But it will never be possible to write an adapter for every request. As such, we plan to implement [support for indicator matching](https://github.com/tweaselORG/TrackHAR/issues/6) as a fallback for requests not covered by any adapter in the future.

An important additional goal of TrackHAR is to produce outputs that make it possible to automatically generate human-readable documentation that allows people to comprehend why we detected each data transmission. This is especially important to submit complaints against illegal tracking to the data protection authorities. The generation of these reports is not handled by TrackHAR itself, but this requirement influences the design of our adapters and return values. As a result, the adapters are not regular functions that know how to handle a request, but implement a specific custom decoding "language" that can more easily be parsed and reasoned about automatically.

## Installation

Expand All @@ -17,15 +22,90 @@ yarn add trackhar

A full API reference can be found in the [`docs` folder](/docs/README.md).

<!--
## Example usage

TODO: Describe the usage example(s).
Use the `process()` function to parse traffic from a HAR file and extract the transmitted data:

```ts
// TODO: Example code.
import { readFile } from 'fs/promises';
import { process as processHar } from 'trackhar';

(async () => {
const har = await readFile(process.argv[2], 'utf-8');

const data = await processHar(JSON.parse(har));
for (const request of data) console.log(request, '\n');
})();
```

The output will look something like this for a HAR file containing two requests:

```ts
undefined

[
{
adapter: 'yandex/appmetrica',
property: 'otherIdentifiers',
context: 'query',
path: 'deviceid',
reasoning: 'obvious property name',
value: 'cc89d0f3866e62c804a5a6f81f4aad3b'
},
{
adapter: 'yandex/appmetrica',
property: 'otherIdentifiers',
context: 'query',
path: 'android_id',
reasoning: 'obvious property name',
value: '355d2c7e339c6855'
},
{
adapter: 'yandex/appmetrica',
property: 'osName',
context: 'query',
path: 'app_platform',
reasoning: 'obvious property name',
value: 'android'
},
{
adapter: 'yandex/appmetrica',
property: 'osVersion',
context: 'query',
path: 'os_version',
reasoning: 'obvious property name',
value: '13'
},
]
```

The first request could not be handled by any adapter, as such it is returned as `undefined`. The second request was handled by the `yandex/appmetrica` adapter, which detected transmissions of two IDs, as well as the operating system name and version.

If you are only interested in the transmitted data and don't need the additional metadata, you can use the `valuesOnly` option:

```ts
import { readFile } from 'fs/promises';
import { process as processHar } from 'trackhar';

(async () => {
const har = await readFile(process.argv[2], 'utf-8');

const data = await processHar(JSON.parse(har), { valuesOnly: true });
for (const request of data) console.log(request, '\n');
})();
```

For our HAR file, this will produce the following output:

```ts
undefined

{
otherIdentifiers: [ 'cc89d0f3866e62c804a5a6f81f4aad3b', '355d2c7e339c6855' ],
osName: [ 'android' ],
osVersion: [ '13' ]
}
```
-->

## License

Expand Down
Loading