Various tools used for AppGoblin. Mostly scripts for crawling the Google Play and Apple App Stores for as many apps as it can and also checks for their app-ads.txt files.
Scrapers:
- pull apps from app store and google play store top lists
- pull apps from some 3rd party stores
- unzip/decompile Android APKs and iOS IPAs to look for 3rd party tracking/advertising tools
- Various SQL ddl for https://github.com/ddxv/app-store-dash
App-ads.txt files are crawled based on the Interactive Advertising Bureau's Tech Lab specs. https://iabtechlab.com/ads-txt/
To get app-ads.txt files, there are several additional steps that need to be taken when compared to regular ads.txt files.
- Get store ids
- Get store ids' developer URL
- Crawl developer URL
These steps make it a bit more challenging to collect app-ads.txt, as the first step is to have your list of store_ids to check. Thus this project comes bunded with two crawlers for both Apple iTunes and Google Play stores to discover new ids. If you do not need to discover new ids, then these steps are not needed and you can directly inject your store_ids into the database's store_apps table.
- Python environment: Python 3.11/3.12
- Requires: NodeJS
- PostgreSQL: 15
- Init db using pg-ddl/db.sql
- Setup python environment
python3.12 -m venv .virtualenv
&source .virtualenv/bin/activate
pip install -r requirements.txt
npm install --save google-play-scraper
- App discovery is quite slow, to speed things up you can inject store_ids to postgresql public.store_apps by finding a pre-existing list of store_ids.
- From your environment run
python main.py
to check setup. This will also help verify the database connection is working.
-t --use-ssh-tunnel
Use to toggle SSH port forwarding for connecting to a remote PostgreSQL database-n --new-apps-check
Crawl stores to discover new app store ids. This is limited and not fast. By default it will check top apps for each category and collection.-u --update-app-details
Update app store details. Requires that you have some app store ids already in store_apps. Will keep ids up to date within a few days for big apps and a week for smaller apps.-a --app-ads-txt-scrape
Requires that store_ids and their details (which include the developer URLs) have been scraped. Will crawl the developer's URL for an app-ads.txt file. Results are stored in app_ads_entrys for each URL.-l --limit-processes
Run only 1 instance of adscrawler. Will check to see if /adscrawler/main.py is already in process. If it already exists the program will not run.