Contact: Alyssa Zhu [email protected]
The UK Biobank Data Parser (ukbb_parser) is a python-based package that allows easy interfacing with the large UK Biobank dataset (Zhu et al., OHBM 2019).
To avoid repetitive calls for standard demographic variables, we created default output columns for sex (self-reported and genetic), age at the time of the neuroimaging scan, race/ethnicity, and education. Self-reported race (Data-Field 21000) is coded first by the larger race group due to inconsistent reporting of ethnicity. Genetic ethnic grouping (Data-Field 22006) and the first 10 genetic principal components (Data-Field 22009) are also provided. Education (Data-Field 6138) is coded according to educational levels in the United Kingdom. These levels are converted to standardized education levels based on the International Standard Classification of Education framework, and estimated years of education is derived using the conversion strategy published by Okbay et al. (2016).
See ReleaseNotes.md for version updates.
Zhu, A.H., Salminen, L.E., Thompson, P.M., Jahanshad, N. "The UK Biobank Data Parser: a tool with built in and customizable filters for brain studies." (2019) Organization for Human Brain Mapping. Rome, Italy, June 9-13, 2019.
Please note that this parser uses the Click package, which sometimes has problems installing with Python 3.7. Any Python version <=3.6 will be compatible.
As this package has not been added to PyPI, please follow the instructions below carefully.
Two-step process:
-
To clone the repository:
git clone https://github.com/USC-IGC/ukbb_parser.git
-
After downloading or cloning (in the same directory that
git clone
was run in):pip install ./ukbb_parser
One-step process:
pip install git+https://github.com/USC-IGC/ukbb_parser.git
N.B. The above installation commands will require administrator/root privileges of the system. If you don't have administrator/root privileges, please try adding the --user
flag.
Example:
pip install --user ./ukbb_parser
ukbb_parser --help
ukbb_parser [subcommand] --help
ukbb_parser [subcommand] [--flags inputs]
The parse subcommand will filter the input CSV file with the given input parameters.
The output csv will contain a set of default columns: eid, age at visits, race, education levels, healthy control cohorts (see output HTML for more details), and sex.
Required Inputs:
-i, --incsv CSV
File path of the downloaded UK Biobank data csv-o, --out TEXT
Output prefix
Optional Inputs:
--incon ICD10Code
ICD10 Diagnosis codes you wish to include. Least common denominators are acceptable, e.g. --incon F. Ranges are also acceptable inputs, e.g., F10-F20, and can span across letters--excon ICD10Code
ICD10 Diagnosis codes you wish to exclude. Least common denominators are acceptable, e.g. --incon F. Ranges are also acceptable inputs, e.g., F10-F20, and can span across letters--insr SRCode
Self-Report codes you wish to include. Ranges are acceptable inputs.--exsr SRCode
Self-Report codes you wish to exclude. Ranges are acceptable inputs.--incat Category
Categories you wish to include. Ranges are acceptable inputs, e.g. 100-200--excat Category
Categories you wish to exclude. Ranges are acceptable inputs, e.g. 100-200--inhdr HeaderName
Columns names you wish to include. Ranges are acceptable inputs, e.g. 100-200 If you wish to have all the available data, please use --inhdr all--exhdr HeaderName
Columns names you wish to exclude. Ranges are acceptable inputs, e.g. 100-200--subjects SubID
A list of participant IDs to include--dropouts dropouts
CSV(s) containing eids of participants who have dropped out of the study--img_subs_only
Use this flag to only keep data of participants with an imaging visit.--img_visit_only
Use this flag to only keep data acquired during the imaging visit.--no_convert
Use this flag if you don't want the demographic conversions run--long_names
Use this flag to replace datafield numbers in column names with datafield titles--rcols
Use this flag if spreadsheet has columns names under R convention (e.g., "X31.0.0")--fillna TEXT
Use this flag to fill blank cells with the flag input, e.g., NA--chunksize INTEGER
Use this flag to set the processing chunk size, Default:1000
The incon/excon/insr/exsr/incat/excat/inhdr/exhdr/combine flags can all be used multiple times, e.g., --incat 110 --incat 134
.
Example usage:
ukbb_parser parse --incsv ukbb_spreadsheet.csv -o ukbb_subset --incat 100 --excat 135 --incon G35
The inventory subcommand will create binary columns indicating the presence of specified data. The data that can be inventoried are ICD-10 conditions, self-report (non-cancer) conditions, and careers.
As can be viewed on the UK Biobank data showcase, these data are organized in a hierarchical tree structure with five levels. At the end of each of the branches are selectable codes, i.e., codes that are used in and can be queried for in the UK Biobank data spreadsheet. By default, the inventory subcommand will aggregate all selectable codes within the requested level/code input into a single binary column. To obtain a column for the aggregate column and an additional column for each of the selectable codes within the requested level/code combination, please use --all_codes
. All selectable codes regardless of level can also be obtained using --level S --code all
.
Code inputs should be derived from the datafield-specific coding pages available on the UK Biobank showcase. If there are spaces in the codes, please use quotes around the entry (see example below). Multiple levels and codes can be used at once, but please ensure that levels and codes are provided in corresponding pairs.
N.B. If --long_names
has been used in the creation of an input spreadsheet for inventory, the code will not work.
The coding tree for the ICD-10 conditions can be viewed in part on the pages of either of the datafields (41202 and 41204) or on the coding page.
Notes
- For top level codes (
--level 0
), use the chapter number as the--code
input, e.g., "Chapter II" to indicate "Neoplasms". - For Level 1 (
--level 1
), add "Block" to the beginning of the--code
input, e.g., "Block A00-09" to indicate "Intestinal infectious diseases"
The coding tree for the self-report (non-cancer) conditions can be viewed on the datafield page or on the coding page.
Notes
- As can be seen on the coding page, the upper levels of the self-report conditions have non-unique codes (-1). Please use the "Meaning" as the
--code
input instead.
The coding tree for careers can be viewed on the datafield page 22617) or on the coding page. Please note that the coding on the coding page is only available to logged-in users.
Also note that this inventory binarization only produces one column per job code, i.e., a job code will be noted as present so long as it has been indicated in any of the visits/instances.
Inputs:
--incsv CSV
File path of downloaded UK Biobank CSV--outcsv CSV
File path to write out to--rcols
Use this flag if spreadsheet has columns names under R convention (e.g., "X31.0.0")--datatype type
Data to inventory; Valid choices include: icd10, self_report, careers--level level
Level to inventory by; N.B. Please input 0 for the Top level or S for selectable codes--code code
Codes to inventory; Use the option 'all' to inventory all categories in the given level; Please use level-appropriate codes; Ranges are allowed for selectable codes--all_codes
(optional) Use this flag if you'd like to additionally obtain individual inventories of all codes--chunksize INTEGER
Use this flag to set the processing chunk size, Default:1000
Example usage:
ukbb_parser inventory --incsv ukbb_spreadsheet.csv --outcsv jobs_inventory.csv --datatype careers --level 3 --code 6113
ukbb_parser inventory --incsv ukbb_spreadsheet.csv --outcsv jobs_inventory.csv --datatype careers --level 3 --code 6113 --level 4 --code 1151010 --all_codes
ukbb_parser inventory --incsv ukbb_spreadsheet.csv --outcsv icd10_inventory.csv --datatype icd10 --level 0 --code "Chapter V" --level 1 --code "Block G00-G09"
ukbb_parser inventory --incsv ukbb_spreadsheet.csv --outcsv jobs_inventory.csv --datatype self_report --level S --code all
The update subcommand will combine and/or update CSV files that were downloaded at different times (e.g., after a refresh).
Inputs:
--previous CSV
File path of previous downloaded UK Biobank CSV--new CSV
File path of new downloaded UK Biobank CSV or a CSV of processed results--outcsv CSV
File path to write newly updated CSV to
Example usage:
ukbb_parser update --previous ukbb_2018_spreadsheet.csv --new ukbb_2019_spreadsheet.csv --outcsv combined_ukbb_2018_2019_spreadsheet.csv
The check subcommand will determine if the queried Datafield or Category is included in the downloaded CSV file.
Inputs:
--incsv CSV
File path of downloaded UK Biobank CSV--datafield df
Datafield to check for--category cat
Category to check for
Example usage:
ukbb_parser check --incsv ukbb_spreadsheet.csv --datafield 31
ukbb_parser check --incsv ukbb_spreadsheet.csv --category 100
In Development
Zhu, A.H., Salminen, L.E., Thompson P.M., Jahanshad N. "The UK Biobank Data Parser: a tool with built in and customizable filters for brain studies." (2019) Organization for Human Brain Mapping. Rome, Italy, June 9-13, 2019.
Okbay, A., Beauchamp, J.P., Fontana, M.A., Lee, J.J., Pers, T.H., Rietveld, C.A., ... & Oskarsson, S. (2016). Genome-wide association study identifies 74 loci associated with educational attainment. Nature, 533(7604), 539.
This package is developed using the UK Biobank Resource under Application Number 11559.
Funding was provided in part by NIH R01-AG059874 (Jahanshad).