NielsenReader defines the classes RetailReader and PanelReader to facilitate easy processing of the Kilts Center's Nielsen IQ Data.
- RetailReader processes Retail Scanner Data
- PanelReader processes Consumer Panel Data
The main advantages of this package are:
- Speed. this is meant to be at least 10x faster (and can be more than 100x faster) than pd.read_csv() or pd.read_table()
- Supports partial reading (you can read in specific products, specific cities, stores, etc.) without having to load the entire file
- It understands the Kilts/Nielsen directory structure -- so you can just download from Kilts and go.
- It saves highly compressed (and fast) .parquet files
- Optional support for various fixes/updates issued by Kilts.
These classes are built on pyarrow 10.0.0
Apache Arrow which you will need to install.
To install with pip simply type:
pip install git+https://github.com/chrisconlon/kiltsnielsen
To install manually from Github (if your pip isn't setup correctly). After cloning, navigate to the correct directory:
git clone https://github.com/chrisconlon/kiltsnielsen.git
sudo python setup.py install
Make sure your python installation has the latest pyarrow (as well as pandas and NumPy)
Using pip:
pip install pyarrow
pip install --upgrade pyarrow
Using pip (requirements.txt):
python -m pip install -r requirements.txt
Using conda:
conda intall pyarrow
conda update pyarrow
Information about the data can be found at the Kilts Center's Website for the Nielsen Dataset.
Check with your institution to gain access to the data. Once you have gained access, download files as follows:
- Construct file extracts using the Kilts File Selection System. Note you must separately gain access to the File Selection System after applying for the data.
- The data are available in .tgz files. Data can be downloaded by group, module, and/or year
- Unzip the .tgz files
- Panelist data can be downloaded directly from Globus. The data are small enough for a typical work machine. The data are available in .tgz files
- Unzip the .tgz files.
Importantly, make sure all files are unzipped and preserved in the original Nielsen structure before using the methods provided here. (Do not rearrange the directory structure.)
This shows how we process the retail scanner data for Backus Conlon Sinkinson (2021)
import pandas as pd
from kiltsreader import RetailReader
from pathlib import Path
# Read these from your nielsen directory -- change this
cereal_dir = Path.cwd()
# save your ouptut here -- change this
save_dir = Path.cwd()
# Specify which dmas /modules /columns to keep
dmas = [506, 517, 556, 602, 751]
cereal = 1344
# supermarkets stores only (no superstores/pharmacies/etc)
channels =['F']
# read cereal -- auto-write
rr = RetailReader(cereal_dir)
rr.filter_years(drop=[2006, 2019])
rr.read_stores()
rr.filter_stores(keep_dmas=dmas, keep_channels=channels)
rr.read_products(keep_modules=[cereal])
rr.read_sales()
# we can access the underlying data from the rr object
# add the dma (city) and retailer_code (chain) info directly to sales data
# then save the data in chunks by dma_code (city)
rr.df_sales = pd.merge(rr.df_sales,
rr.df_stores[['store_code_uc','panel_year','dma_code']],
on=['store_code_uc','panel_year'])
rr.write_data(save_dir, stub="cereal", as_table=True, separator='dma_code')
from kiltsreader import RetailReader, PanelReader
- Locate your Nielsen Retail Scanner and Consumer Panel data separately
- Open
Example.py
- Replace
dir_retail
anddir_panel
with the locations of your Retail Scanner and Consumer Panel Data respectively - Replace the
KEEP_GROUPS
,KEEP_MODULES
,DROP_YEARS
,KEEP_YEARS
,KEEP_STATES
, andKEEP_CHANNEL
with your relevant selection - Run
Example.py
to verify the code works.
class RetailReader(dir_read=path.Path.cwd(), verbose=True)
RetailReader defines the class object used to read in the Nielsen Retail Scanner Data (see above)
- dir_read(pathlib Path object, optional): points to the location of the Retail Scanner Files. Should be named
nielsen_extract
or something similar, containing the subfolderRMS
. Default is the current working directory. - verbose(bool): if
True
, prints updates after processing files. Displays size of files processed. Default is True.
filter_years()
: Selects years for which to process annual scanner data files. Used in pre-processing to limit required memory if desired; otherwise later functions will process all available data.read_products()
: Reads in the set of product characteristics (and filters them).read_stores()
: Reads in the full stores list.filter_stores()
: filter stores by State, DMA, and channel-type. You MUST runread_stores()
first.read_rms()
: the RMS data contains theupc_ver_uc
information for re-used UPCs.read_extra()
: PopulatesRetailReader.df_variations
with brand variations data, typically located inMasterFiles/Latest/brand_variations.tsv
. Lists brand codes, descriptions, and any alternative descriptions.read_sales()
: this reads in the majority of the scanner data.write_data()
: after reading in the data, this writes the tables as.parquet
files.
df_products
(pandas DataFrame): default empty, stores products data after processingdf_sales
(pandas DataFrame): default empty, stores sales data after processingdf_stores
(pandas DataFrame): default empty, stores store data after processingdf_rms
(pandas DataFrame): default empty, stores RMS versions data after processingdf_extra
(pandas DataFrame): default empty, stores extra product data after processingall_years
(list): list of years for which data will be processedfiles_product
(pathlib Path object): stores name of product characteristic filefiles_rms
(list of pathlib Path objects): stores names of annual rms_versions filesfiles_stores
(list of pathlib Path objects): stores names of annual stores filesfiles_extra
(list of pathlib Path objects): stores names of annual extra filesfiles_sales
(list of pathlib Path objects)stores names of annual sales filesall_years
(list): list of years included in data. Updates with filtering
Functions also described with docstrings in the NielsenReader.py
file
RetailReader.filter_years(keep=None, drop=None):
Selects years for which to process annual sales files. Used in pre-processing to limit required memory if desired; otherwise later functions will process all available data. Populates RetailReader.all_years
- keep(list of integers, optional): list of years to keep, e.g.
range(2004, 2013)
. Can only include years that are already present in the data, e.g. specifyingkeep=[1999]
will result in an empty set of years - drop(list of integers, optional): list of years to remove, e.g. [2006, 2009, 2013]
RetailReader.filter_sales(keep_groups=None, drop_groups=None, keep_modules=None, drop_modules=None): Selects product groups (outer category) and product modules (inner category) for which to process annual sales files. Used in pre-processing to limit required memory if desired; otherwise later functions will process all available data.
- keep_groups(list of integers, optional): list of product groups to keep, e.g.
keep_groups=[1508, 1048]
- drop_groups(list of integers, optional): list of product groups to exclude, e.g.
drop_groups = [1046]
. Takes precedence if there is any overlap with keep_groups - keep_groups(list of integers, optional): list of product modules to keep, e.g.
keep_modules=[1481, 1482]
- drop_groups(list of integers, optional): list of product modules to exclude, e.g.
drop_groups = [1483]
. Takes precedence if there is any overlap with keep_module
RetailReader.read_rms():
Populates RetailReader.df_rms
Processes the annual RMS versions files, which map reused UPCs to the appropriate version based on year
RetailReader.read_products(upc_list=None, keep_groups=None, drop_groups=None, keep_modules=None, drop_modules=None):
Populates RetailReader.df_products
.
Reads in the set of product characteristics, typically located in Master_Files/Latest/products.tsv
. Optionally elects product groups (outer category) and product modules (inner category) for which to process annual sales files.
Note that the function does NOT carry over the filtered set of groups and modules from RetailReader.filter_sales()
. The RetailReader.read_products()
function is redundant with the PanelReader.read_products()
function, and therefore allows the user to read in the full set of products and their characteristics even while reading only a subset of sales.
- upc_list(list of integers, optional): list of Universal Product Codes to keep, e.g.
upc_list=[002111039080, 009017445929]
(leading zeros not required) - keep_groups(list of integers, optional): list of product groups to keep, e.g.
keep_groups=[1508, 1048]
- drop_groups(list of integers, optional): list of product groups to exclude, e.g.
drop_groups = [1046]
. Takes precedence if there is any overlap with keep_groups - keep_groups(list of integers, optional): list of product modules to keep, e.g.
keep_modules=[1481, 1482]
- drop_groups(list of integers, optional): list of product modules to exclude, e.g.
drop_groups = [1483]
. Takes precedence if there is any overlap with keep_module
RetailReader.read_extra(years=None, upc_list=None)
Populates RetailReader.df_extra
Selects annual Products Extra files for all post-filtering years. Redundant with PanelReader.read_extra()
Product group and module filtering are not possible.
Note that UPCs may be repeated for different years. Use RMS versions to select appropriate years. Differences between multiple years for a single UPC may be due not to actual product changes but rather due to Nielsen filling in previously missing data. See Nielsen documentation for an in-depth description
- upc_list(list of integers, optional): list of Universal Product Codes to keep, e.g.
upc_list=[002111039080, 009017445929]
(leading zeros not required) - years(list of integers, optional): selects years for which to. Note that previous year-filtering of the RetailReader object will carry over unless a new set of years is specified.
RetailReader.read_stores():
Populates RetailReader.df_stores
Stores files are common to all product groups and modules, so processing will be unaffected by RetailReader.filter_years
or RetailReader.filter_sales
RetailReader.filter_stores(keep_dma=None, drop_dma=None, keep_states=None, drop_states=None, keep_channel=None, drop_dma=None):
Updates RetailReader.df_stores
- keep_dmas(list of integers, optional): list of DMAs (Designated Market Areas) to keep, e.g.
keep_dma=[801, 503]
- drop_dmas(list of integers, optional): list of DMAs (Designated Market Areas) to exclude, e.g.
drop_dma=[602]
. Takes precedence if there is any overlap with keep_dma. - keep_states(list of strings, optional): list of states to keep, with list in two-character format, e.g.
keep_states=['TX', 'CA']
- drop_states(list of integers, optional): list of states to exclude, with list in two-character format, e.g.
drop_states=['NJ', 'NY']
Takes precedence if there is any overlap with keep_states. - keep_channels(list of characters, optional): list of channels (store types) to keep, e.g.
keep_channels=['F', 'G']
. See Nielsen documentation for explanation of channels and list of options. - drop_channels(list of characters, optional): list of channels (store types) to exclude, e.g.
drop_channels = ['C']
. Takes precedence if there is any overlap with keep_channels. See Nielsen documentation for explanation of channels and list of options.
Note: Must be run AFTER RetailReader.read_stores()
. Pre-filtering is not possible.
RetailReader.read_sales(incl_promo = True): This is the main function to read scanner data.
- Populates
RetailReader.df_sales
- Reads in the weekly, store x upc level sales data, post-filter if any have been applied.
- Uses pyarrow methods to filter and read the data to minimize memory and time use.
- Warning: May still require large amounts of memory/CPU.
- incl_promo(boolean, optional): Setting to
False
skips the promo and display fields.
RetailReader.write_data(dir_write = path.Path.cwd(), stub = 'out', compr = 'brotli', as_table = False, separator = 'panel_year') Writes the pandas DataFrames of the RetailReader class to parquet format (see class description abvove).
- dir_write(pathlib Path object, optional): folder within which to write the parquets. Defalt is current working directory.
- stub(str): initial string to name all files. Files will be named 'stub'_'[file type].parquet', e.g. 'out_stores.parquet'
- compr(str): type of compression used for generating parquets. Default is brotli
- as_table(bool): whether to write as pyarrow separated row-tables. See
Example.py
for instance of how to write and read row-groups. Requires a separator to generate rows for the row-tables. Useful if you seek to preserve space when reading in files by using only one row-group at a time. - separator(column name): variable on which to separate row-groups when saving as a pyarrow table. Note that [for now] the separator must be common to all files. Default is
panel_year
. If separator is not present in the file, it cannot be saved as a pyarrow table. If you are looking to save just a single file with a specific separator, modify the following snippet:
RetailReader.read_{FILE_TYPE}()
RR.write_data(dir_write, separator = {VARIABLE_NAME})
RetailReader.get_module(file_sales=): returns product module (inner category) corresponding to particular sales file.
- file_sales (pathlib Path object): Retail Scanner sales file, e.g. 1046_2006.tsv
RetailReader.get_group(file_sales=): returns product group (outer category) corresponding to particular sales file.
- file_sales (pathlib Path object): Retail Scanner sales file, e.g. 1046_2006.tsv
class PanelReader(dir_read=path.Path.cwd(), verbose=True) PanelReader defines the class object used to read in the Nielsen Consumer Panel data.
- dir_read(pathlib Path object, optional): points to the location of the Consumer Panel data files. Likely named
Panel
or something similar. Subfolders should be years. Default is the current working directory. - verbose(bool): if
True
, prints updates after processing files. Displayes size of files processed. Default is true.
filter_years()
: Selects years for which to process annual panelist, purchase, trips, and extra files. Used in pre-processing to limit required memory if desired; otherwise later functions will process all available data.read_retailers()
: Reads the retailers file, which list retailer codes and channels (and filters them).read_products()
: Reads in the set of product characteristics (and filters them).read_extra()
: Selects annual Products Extra files for all post-filtering years.read_variations()
: PopulatesPanelReader.df_variations
with brand variations data, typically located inMasterFiles/Latest/brand_variations.tsv
. Lists brand codes, descriptions, and any alternative descriptions.read_year()
: this does most of the work, and reads in panelists, trips, purchases for a single yearread_annual()
: this is the main function repeatedly callsread_year()
to read in panelists, trips, purchases for multiple yearswrite_data()
: after reading in the data, this writes the tables as.parquet
filesread_revised_panelists()
: Corrects the Panelist data using errata provided by Nielsen for issues not yet incorporated into the data as of October 2021. Must have already calledPanelReader.read_annual()
,PanelReader.read_products()
,PanelReader.read_variations()
,PanelReader.read_retailers()
process_open_issues()
: Corrects the product extra and panelist data using errata provided by Nielsen for two specific issues:ExtraAttributes_FlavorCode
: adds missing flavor code and flavor description to 2010 products extra characteristics filePanelist_maleHeadBirth_femaleHeadBirth
: corrects issue with male head of household birth month
df_products
(pandas DataFrame): default empty, stores products data after from Master Files processingdf_variations
(pandas DataFrame): default empty, stores brand_variations data from Master Files after processingdf_retailers
(pandas DataFrame): default empty, stores retailers data from Master Files after processingdf_trips
(pandas DataFrame): default empty, stores annual trips data after processingdf_panelists
(pandas DataFrame): default empty, stores annual panelists data after processingdf_purchases
(pandas DataFrame): default empty, stores annual purchases after processingdf_extra
(pandas DataFrame): default empty, stores annual products extra characteristics data after processingfiles_annual
(list of pathlib Path objects)files_product
(list of pathlib Path objects): stores name of unrevised products filefiles_variation
(list of pathlib Path objects): stores name of unrevised brand variations filefiles_retailers
(list of pathlib Path objects): stores name of unrevised retailers filefiles_trips
(list of pathlib Path objects): stores names of annual tripes filesfiles_panelists
(list of pathlib Path objects): stores names of annual panelists filesfiles_purchases
(list of pathlib Path objects): stores names of annual purchases filesfiles_extra
(list of pathlib Path objects): stores names of annual products extra filesall_years
(list): list of years included in data. Updates with filtering
PanelReader.filter_years(keep=None, drop=None)
Selects years for which to process annual panelist, purchase, trips, and extra files. Used in pre-processing to limit required memory if desired; otherwise later functions will process all available data. Updates PanelReader.all_years
- keep(list of integers, optional): list of years to keep, e.g. range(2004, 2013). Can only include years that are already present in the data, e.g. specifying
keep=[1999]
will result in an empty set of years - drop(list of integers, optional): list of years to remove, e.g. [2006, 2009, 2013]
PanelReader.read_retailers():
Populates PanelReader.df_retailers
Processes the Master retailers file, which list retailer codes and channels.
Note that the file may be later revised following a call to PanelReader.read_revised_panelists()
or PanelReader.process_open_issues()
PanelReader.read_products(upc_list=None, keep_groups=None, drop_groups=None, keep_modules=None, drop_modules=None)
Populates PanelReader.df_products
, which should be identical to RetailReader.df_products
following a call to RetailReader.read_products()
.
Reads in the set of product characteristics, typically located in Master_Files/Latest/products.tsv
. Optionally elects product groups (outer category) and product modules (inner category) for which to process annual sales files.
Note that the file may be later revised following a call to PanelReader.read_revised_panelists()
or PanelReader.process_open_issues()
. Such updating is only possible through PanelReader
; the Retail Scanner files do not contain any revisions.
- upc_list(list of integers, optional): list of Universal Product Codes to keep, e.g.
upc_list=[002111039080, 009017445929]
(leading zeros not required) - keep_groups(list of integers, optional): list of product groups to keep, e.g.
keep_groups=[1508, 1048]
- drop_groups(list of integers, optional): list of product groups to exclude, e.g.
drop_groups = [1046]
. Takes precedence if there is any overlap with keep_groups - keep_groups(list of integers, optional): list of product modules to keep, e.g.
keep_modules=[1481, 1482]
- drop_groups(list of integers, optional): list of product modules to exclude, e.g.
drop_groups = [1483]
. Takes precedence if there is any overlap with keep_module
PanelReader.read_extra(years=None, upc_list=None)
Populates PanelReader.df_extra
Selects annual Products Extra files for all post-filtering years. Redundant with RetailReader.read_extra(). Does not have any revisions as of October 2021.
Product group and module filtering are not possible; only UPC and year filtering are available.
Note that UPCs may be repeated for different years. Use RMS versions to select appropriate years. Differences between multiple years for a single UPC may be due not to actual product changes but rather due to Nielsen filling in previously missing data. See Nielsen documentation for an in-depth description.
- years (list of integers, optional): selects years for which to. Note that previous year-filtering of the PanelReader object will carry over unless a new set of years is specified
- upc_list (list of integers, optional): list of Universal Product Codes to keep, e.g.
upc_list=[002111039080, 009017445929]
(leading zeros not required)
PanelReader.read_variations()
Populates PanelReader.df_variations
with brand variations data, typically located in MasterFiles/Latest/brand_variations.tsv
.
Lists brand codes, descriptions, and any alternative descriptions.
PanelReader.read_year(year, keep_dmas=None, drop_dmas=None, keep_states=None, drop_states=None,)
Populates PanelReader.df_panelists
, PanelReader.df_purchases
, and PanelReader.df_trips
Processes a single year of annual data (panelists, purchases, and trips data). Useful if you seek to only process one year at a time; otherwise use read_annual
as described below
- year (int): single year to process
- keep_dmas (list of integers): list of DMAs (Designated Market Areas) to keep, e.g.
keep_dma=[801, 503]
- drop_dmas (list of integers): list of DMAs (Designated Market Areas) to exclude, e.g.
drop_dma=[602]
. Takes precedence if there is any overlap with keep_dma. - keep_states (list of strings): list of states to keep, with list in two-character format, e.g.
keep_states=['TX', 'CA']
- drop_states (list of strings): list of states to exclude, with list in two-character format, e.g.
drop_states=['NJ', 'NY']
Takes precedence if there is any overlap with keep_states.
PanelReader.read_annual(keep_states=None, drop_states=None, keep_dmas=None, drop_dmas=None)
Processes all years (post-PanelReader.filter_years()
) with repeated calls to PanelReader.read_year()
- keep_dmas (list of integers): list of DMAs (Designated Market Areas) to keep, e.g.
keep_dma=[801, 503]
- drop_dmas (list of integers): list of DMAs (Designated Market Areas) to exclude, e.g.
drop_dma=[602]
. Takes precedence if there is any overlap with keep_dma. - keep_states (list of strings): list of states to keep, with list in two-character format, e.g.
keep_states=['TX', 'CA']
- drop_states (list of strings): list of states to exclude, with list in two-character format, e.g.
drop_states=['NJ', 'NY']
Takes precedence if there is any overlap with keep_states.
PanelReader.write_data(dir_write = path.Path.cwd(), stub = 'out', compr = 'brotli', as_table = False, separator = 'panel_year') Writes the pandas DataFrames of the RetailReader class to parquet format (see class description abvove).
- dir_write(pathlib Path object, optional): folder within which to write the parquets. Defalt is current working directory.
- stub(str): initial string to name all files. Files will be named 'stub'_'[file type].parquet', e.g. 'out_stores.parquet'
- compr(str): type of compression used for generating parquets. Default is brotli
- as_table(bool): whether to write as pyarrow separated row-tables. See
Example.py
for instance of how to write and read row-groups. Requires a separator to generate rows for the row-tables. Useful if you seek to preserve space when reading in files by using only one row-group at a time. - separator(column name): variable on which to separate row-groups when saving as a pyarrow table. Note that [for now] the separator must be common to all files. Default is
panel_year
. If separator is not present in the file, it cannot be saved as a pyarrow table. If you are looking to save just a single file with a specific separator, modify the following snippet:
RetailReader.read_{FILE_TYPE}()
RR.write_data(dir_write, separator = {VARIABLE_NAME})
PanelReader.read_revised_panelists()
Updates PanelReader.df_products
, PanelReader.df_variations
, PanelReader.df_retailers
, and PanelReader.df_panelists
Corrects the Panelist data using errata provided by Nielsen for issues not yet incorporated into the data as of October 2021. Must have already called PanelReader.read_annual()
, PanelReader.read_products()
, PanelReader.read_variations()
, PanelReader.read_retailers()
PanelReader.process_open_issues()
Updates PanelReader.df_panelists
and PanelReader.df_extra
Corrects the product extra and panelist data using errata provided by Nielsen for two specific issues:
ExtraAttributes_FlavorCode
: adds missing flavor code and flavor description to 2010 products extra characteristics filePanelist_maleHeadBirth_femaleHeadBirth
: corrects issue with male head of household birth month