- 1. Stata to Pandas Cross-Walk
- 2. Introduction
- 3. Pendng & Unresolved Issues
- 4. Data Management
- 5. Exporting Pandas data to Stata
- 6. Also useful
- 7. Questions, Comments, & Contributions
This repo uses Stata example data sets to crosswalk a variety of Stata-Pandas equivalent code.
auto2.dta
available at http://www.stata-press.com/data/r15/auto2.dtahbp2.dta
available at http://www.stata-press.com/data/r15/hbp2.dtaautoexpense.dta
available at http://www.stata-press.com/data/r15/autoexpense.dtaautosize.dta
available at http://www.stata-press.com/data/r15/autosize.dta
For pandas, this crosswalk assumes the following import statements:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
# Additional optional setting
pd.set_option('display.max_rows', 8)
- If anyone knows how to make multi-line code blocks in a markdown table with syntax highlighting. Let me know.
- At least one programmer has provided the option to execute Stata commands from Python.
- Also helpful is the Jupyter Notebooks Kernel.
- Adding a hilight for those having trouble Exporting Pandas data to Stata below.
Description | Stata Code | Pandas Code |
---|---|---|
Load example data | use http://www.stata-press.com/data/r15/auto2.dta |
exfile = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta') |
Display data | list |
exfile |
Display subset of observations | list in 1/10 or list in 10/20 |
exfile.head(n=10) or exfile[9:20] |
Display subset of variables | list make price mpg trunk |
exfile[['make','price','mpg','trunk']] |
Display subset of obs & vars | list make price mpg trunk in 1/10 |
exfile[['make','price','mpg','trunk']].head(n=10) |
Display specific observation | list make price mpg trunk if _n == 10 |
The most Stata-like will be exfile.iloc[9] also available is df.loc when label-location based index is available. |
Display observation(s) by logic | list make price mpg trunk if mpg > 30 |
exfile[exfile['mpg'] > 30] |
List variable names and/or get variable information | desc or describe |
exfile.dtypes and exfile.describe() or for var in exfile.columns: print(var) |
Generate new text variable | gen newtxt = "Some text here" |
exfile['newtxt'] = 'Some text here' |
Change text variable value | replace newtxt = "Newer text that is really looooooong." |
exfile['newtxt'] = 'Newer text that is really looooooong.' |
Replace or change text variable based on existing variable | replace newtxt = substr(newtxt, 1, 10) or gen newest = substr(newtxt, 1, 10) |
exfile['newtxt'] = exfile['newtxt'].str.slice(0,10) or exfile['newest'] = exfile['newtxt'].str.slice(0,10) or exfile['newest'] = exfile['newtxt'].apply(lambda x: x[0:10]) |
Split text variables | split make, parse(" ") |
exfile['ms'] = exfile['make'].str.split(' ') * |
Replace text in variables | replace make = subinstr(make," ","-",.) |
exfile['make'] = exfile['make'].str.replace(' ', '-') |
Replace or change value based on existing values | replace headroom == 9.0 if headroom == 3.0 |
exfile['headroom'] = exfile['headroom'].replace(3.0,9.0) or exfile['foreign'] = exfile['foreign'].replace('Domestic','USA') |
Generate new int variable | gen newnum = 10 |
exfile['newnum'] = 10 |
Change int variable value | replace newnum = 222 |
exfile['newnum'] = 222 |
Generate dummy based on text variable | gen fgn = (foreign == "Foreign") |
exfile['fgn'] = np.where(exfile['foreign']=='Foreign', 1, 0) or exfile['fgn'] = exfile['foreign'].apply(lambda x: 1 if x == 'Foreign' else 0) |
Generate new variable relative to other variable | gen rtd_disp = displacement / 10 or gen mpg2 = mpg * mpg or gen lprice = ln(price) |
exfile['rtd_disp'] = exfile['displacement'] / 10 or exfile['mpg2'] = exfile['mpg'] * exfile['mpg'] or exfile['lprice'] = np.log10(exfile['price']) or exfile['mpg2'] = [i * i for i in exfile['mpg']] or exfile['mpg2'] = exfile['mpg'].apply(lambda x: x * x) or exfile['mpg_weight'] = exfile.apply(lambda row: row['mpg']+row['weight'],axis=1) |
Generate new variable equal to _n or index | gen sorter = _n |
exfile['sorter'] = np.arange(0,len(exfile.index)) |
Rename a variable | rename mpg milespg |
Two steps: exfile['milespg'] = exfile['mpg'] del exfile['mpg'] One steps: exfile.rename(columns = {'mpg':'milespg'}, inplace=True) |
Delete variable(s) | drop newtxt newnum |
exfile = exfile.drop(columns=['newtxt','newnum']) or exfile = exfile.drop(['newtxt','newnum'], axis=1) or exfile.drop(columns="newtxt", inplace = True) or del exfile['newtxt'] del exfile['newnum'] |
Keep variable(s) | keep make price mpg |
exfile = exfile[['make','price','mpg']] |
Delete specific observation | drop if _n = 10 |
exfile = exfile.drop(9) |
Delete observation(s) by logic | drop if mpg > 30 or keep if mpg < 31 |
exfile = exfile[exfile['mpg'] < 31] or exfile = exfile[exfile['mpg'] > 30] |
Sorting by a variable ascending | sort price |
exfile = exfile.sort_values(by=['price']) |
Sorting by a variable descending | gsort -price |
exfile = exfile.sort_values(by=['price'], ascending=False) |
Display summary statistics (specific variables) | sum price mpg weight |
exfile[['price','mpg','weight']].describe() or exfile[['price','mpg','weight']].describe().T |
Enstring numbers | tostring price, gen(pricestr) |
exfile['price_str'] = exfile['price'].astype(str) |
Destring strings | destring pricestr, gen(pricenum) |
exfile['price_num'] = exfile['price_str'].astype(int) |
- Splits behave differently in Stata & Pandas. More development needed here.
A function that Stata does quickly and simplistically count
, is not well implemented in Pandas.
set more off
clear all
use http://www.stata-press.com/data/r15/auto2.dta
count if foreign == 1
# Counting a categorical. Map the category you want to count into a series.
# Then use `series.count()` to get non NaN instatnces.
exfile = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
exfile['foreign'].map({1:1}).count()
Description | Stata Code | Pandas Code |
---|---|---|
Load example data | use http://www.stata-press.com/data/r15/hbp2.dta |
exfile = pd.read_stata('http://www.stata-press.com/data/r15/hbp2.dta') |
One-way tabulation | tab year or tab race |
exfile['year'].value_counts() or exfile['race'].value_counts() |
Two-way tabulation | tab year race |
pd.crosstab(exfile['year'], exfile['race']) |
Two-way tagulation with row option that normalizes by row |
tab year race, row |
pd.crosstab(exfile['year'], exfile['race']).apply(lambda r: r/r.sum(), axis=1) Also Consider pd.pivot_table() Crosstab Or Pivot Table (In Pandas) Deciding When to Use Which |
Two-way tabulation with col option that normalizes by column |
tab year race, col |
pd.crosstab(exfile['year'], exfile['race']).apply(lambda r: r/r.sum(), axis=0) |
Three-way tabulation | table year race sex |
pd.crosstab(exfile['year'], [exfile['sex'], exfile['race']]) |
Encode a categorical (That was originally string) | encode sex, gen(sex_cat) |
exfile['sex_cat'] = exfile['sex'].astype('category') then exfile['sex_cat_code'] = exfile['sex_cat'].cat.codes |
Create an array of dummies from categorical | tab sex, gen(sex_) |
exfile = pd.get_dummies(exfile, columns=['sex']) |
Collapse an array of dummies back to a single categorical | StataList Post: https://www.statalist.org/forums/forum/general-stata-discussion/general/1413286-convert-dummy-variables-into-a-categorical-variable?p=1413305#post1413305 | df[['sex_','sex_female','sex_male']].idxmax(axis='columns') Also, SOF Post: https://stackoverflow.com/a/51275990/9572143 |
Practice tip for those transitioning from Stata to Python. Where Stata lets you reference rows and columns with the very human readable optional arguments row
and col
, Python wants an axis number. To make Python code more human readable, possibly easier to read it is an option to declare a row and a col variable. An example that builds on the two-way tabulation examples above.
>>> import pandas as pd
>>> exfile = pd.read_stata('http://www.stata-press.com/data/r15/hbp2.dta')
>>> row = 1; col = 0
>>>
>>> # Now user row and col variables instead of axis index.
>>> pd.crosstab(exfile['year'], exfile['race']).apply(lambda r: r/r.sum(), axis=row)
race White Black Hispanic
year
1988 0.104167 0.812500 0.083333
1989 0.077670 0.796117 0.126214
1990 0.195652 0.686957 0.117391
1991 0.207547 0.664151 0.128302
1992 0.200873 0.620087 0.179039
1993 0.147410 0.701195 0.151394
>>> pd.crosstab(exfile['year'], exfile['race']).apply(lambda r: r/r.sum(), axis=col)
race White Black Hispanic
year
1988 0.025510 0.050453 0.025478
1989 0.040816 0.106080 0.082803
1990 0.229592 0.204398 0.171975
1991 0.280612 0.227684 0.216561
1992 0.234694 0.183700 0.261146
1993 0.188776 0.227684 0.242038
Description | Stata Code | Pandas Code |
---|---|---|
Load example data | use http://www.stata-press.com/data/r15/autoexpense.dta and use http://www.stata-press.com/data/r15/autosize.dta |
autoexp = pd.read_stata('http://www.stata-press.com/data/r15/autoexpense.dta') and autosiz = pd.read_stata('http://www.stata-press.com/data/r15/autosize.dta') |
Merge autoexpense autosize (using make as the key variable) | After loading autosize.dta merge 1:1 make using http://www.stata-press.com/data/r15/autoexpense.dta |
pd.merge(autoexp,autosiz, on='make', how='outer') |
Mismatched Defaults. By default Stata performs what Pandas would refer to as an outer
merge. Meaning "use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically." cite. More simply, the result will include all records from both datasets.
The default in Pandas performs an inner
merge which means "use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys." cite. Again more simply, the result will only include records that matched in both datasets.
For Stata users looking to replicate Stata's behavior on a merge operation it is necessary to specify the how='outer'
argument in the pd.merge()
statement.
The missing ouput. By default Stata performs the merge operation while also adding a variable called _merge
which indicates for each observation if that observation was from the master dataset (equivalent to the left side), the using dataset (equivalent to the right side), or both datasets. Additionally, Stata provides output which can help verify the merge operation was successful. To replicate Stata's behavior in Pandas users can add the indicator=True
argument to the pd.merge()
statement. This indicator argument then adds a variable that Pandas also calls _merge
which can then be tabulated or cross-tabulated with other variables to assess merge results. A full example, using the above:
autoexp = pd.read_stata('http://www.stata-press.com/data/r15/autoexpense.dta')
autosiz = pd.read_stata('http://www.stata-press.com/data/r15/autosize.dta')
df = pd.merge(autoexp,autosiz, on='make', how='outer', indicator=True)
df['_merge'].value_counts()
both 5
right_only 1
left_only 0
Name: _merge, dtype: int64
Another example using mock data:
import random
data1 = {'var1':[1,2,3,4,5],
'dat1':[random.randrange(10,99,1) for i in range(5)]}
data2 = {'var1':[2,3,4,5,6],
'dat2':[random.randrange(10,99,1) for i in range(5)]}
df = pd.merge(pd.DataFrame(data1), pd.DataFrame(data2), on='var1',
how='outer', indicator=True)
df['_merge'].value_counts()
both 4
right_only 1
left_only 1
Name: _merge, dtype: int64
Description | Stata Code | Pandas Code |
---|---|---|
Load example data | use http://www.stata-press.com/data/r15/capop.dta and use http://www.stata-press.com/data/r15/txpop.dta |
capop = pd.read_stata('http://www.stata-press.com/data/r15/capop.dta') and txpop = pd.read_stata('http://www.stata-press.com/data/r15/txpop.dta') |
Append CA population with TX population | After loading txpop.dta append using http://www.stata-press.com/data/r15/capop.dta |
pd.concat([capop,txpop]) or pd.concat([capop,txpop]).reset_index() |
Append and mark sources | append using http://www.stata-press.com/data/r15/capop.dta, generate(source) |
pd.concat([capop,txpop],keys=['ca','tx']) |
Description | Stata Code | Pandas Code |
---|---|---|
Load example data | use http://www.stata-press.com/data/r15/reshape1.dta |
exfile = pd.read_stata('http://www.stata-press.com/data/r15/reshape1.dta') |
Reshape from wide to long | reshape long inc ue, i(id) j(year) |
exfile = pd.wide_to_long(exfile, stubnames=['inc','ue'], i=['id','sex'], j='year') for more Stata-like then: exfile = exfile.reset_index(0).reset_index(0).reset_index(0) |
Reshape long to wide | reshape wide inc ue, i(id) j(year) |
Quick version: exfile2 = exfile.pivot_table(values=['sex','inc','ue'], columns='year', index='id') |
While Pandas provides wide_to_long
option, it does not provide a long_to_wide
option. Below is code that will produce a long to wide reshape more consistent with Stata's results.
# Load example data that in long format.
exfile = pd.read_stata('http://www.stata-press.com/data/r15/reshape6.dta')
# Perpare wide dataframes for each variable that changes over j.
exfile['inc_idx'] = 'inc' + exfile.year.astype(str)
inc = exfile.pivot(index='id',columns='inc_idx',values='inc')
exfile['ue_idx'] = 'ue' + exfile.year.astype(str)
ue = exfile.pivot(index='id',columns='ue_idx',values='ue')
# Concatenate / Append individual wide datasets.
exfile2 = pd.concat([inc,ue],axis=1).reset_index()
# Gather values for varaibles that do not change over j.
exfile_sex = pd.DataFrame(exfile[['id','sex']])
exfile_sex = exfile_sex.pivot_table(index='id', values='sex').reset_index()
# Merge variables that do not change over j.
exfile3 = pd.merge(exfile_sex, exfile2, on='id')
exfile3
Another example from Dean McGrath who writes Pivoting A Pandas DataFrame at Towards Data Science. Version with adjustments:
import pandas as pd
# Example Long Data
df = pd.DataFrame(data={'staff_no': [9999] * 5,
'name': ['Dean'] * 5,
'year': [2016, 2017, 2018, 2019, 2020],
'hours': [349, 231, 876, 679, 976]})
# Convert Long To Wide
df = df.pivot(index='name', columns='year', values='hours').reset_index()
A wide to long example based on an article from Soner Yildirim who writes 8 Ways to Transform Pandas Dataframes at Towards Data Science.
import pandas as pd
# Example Long Data
df = pd.DataFrame(data={'staff_no': [9999] * 5,
'name': ['Dean'] * 5,
'year': [2016, 2017, 2018, 2019, 2020],
'hours': [349, 231, 876, 679, 976]})
# Convert Long To Wide
df = df.pivot(index='name', columns='year', values='hours').reset_index()
# Convert Wide To Long
pd.melt(df, id_vars='name').head()
Stata Version
use http://www.stata-press.com/data/r15/auto2.dta
foreach var in price mpg weight length {
sum `var'
gen z`var' = (`var' - r(mean)) / r(sd)
}
list price mpg weight length zprice zmpg zweight zlength in 1/5
Python Version
import pandas as pd
from scipy.stats import zscore
exfile = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
for var in ['price','mpg','weight','length']:
exfile['z{}'.format(var)] = exfile[[var]].apply(zscore)
exfile.head()
Stata Version
clear all
input str10 Observer Day1 Day2 Day3
"Adam" 3 7 8
"Ken" 6 4 6
"Zita" 7 6 4
"Sam" 4 6 2
end
forvalues i = 1/3 {
sum Day`i'
gen zDay`i' = (Day`i' - r(mean)) / r(sd)
}
list
Python Version
import pandas as pd
from scipy.stats import zscore
df = pd.DataFrame({'Observer':['Adam','Ken','Zita','Ari','Sam'],
'Day1':[3,6,7,8,4],
'Day2':[7,4,6,5,6],
'Day3':[8,6,4,3,2]})
for i in range(1,4):
df['zDay{}'.format(i)] = df[['Day{}'.format(i)]].apply(zscore)
df
Stata Version - Loop Through Files In Directory
// Get list of files that are csv
local files : dir . files "*.csv"
// Loop through file list and save each one in Stata .dta
foreach file in `files' {
import delimited using "`file'"
save "`file'.dta", replace
clear
}
Python Version - Loop Through Files In Directory
import os
import pandas as pd
# Loop through files that are csv and save each in Stata .dta
for filename in os.listdir():
if filename[-4:] == '.csv':
pd.read_csv(filename).to_stata('{}.dta'.format(filename[:-4]))
Stata Version - Loop Through Levels Of A Categorical
sysuse auto
// Use the repair record categorical as an example
levelsof rep78, local(levels)
foreach lev of local levels {
// Preform any function specifc to each lev (summary stats for price)
sum price if rep78 == `lev'
}
Python Version - Loop Through Levels Of A Categorical
import pandas as pd
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
# Use the repair record categorical as an example
for lev in df['rep78'].unique():
# Preform any function specifc to each lev (summary stats for price)
print(f'\n\nDisplay results for repair record {lev}')
df[df['rep78'] == lev]['price'].describe()
A problem that happens when saving to Stata is that pandas.DataFrame.to_stata
sometimes writes unicode characters even though the format used by pandas.DataFrame.to_stata
does not support unicode. Documented on this issue. A simplistc explanation is that unicode chracters can throw off the expected chracter count in Stata data. A crude solution is to make sure each chracter is only one chracter space:
# Define function that finds and replaces offensive characters.
def fix_char_ct(bad_text):
ret_txt = ''
for item in bad_text:
ret_txt += item if len(item.encode(encoding='utf_8')) == 1 else ''
return(ret_txt)
# Use apply to clean problematic text.
df['Problematic_Txt'] = df['Problematic_Txt'].apply(fix_char_ct)
Another frequent problem is that pandas.DataFrame.to_stata
seems to have trouble writing the object data type. A solution to this trouble is:
# Define function that finds object data types, converts to string.
def obj_to_string(df):
for obj_col in list(df.select_dtypes(include=['object']).columns):
df[obj_col] = df[obj_col].astype(str)
return(df)
# Pass dataframe with object data types to function.
df = obj_to_string(df)
Acceptable variable names in Stata is more limited than those in Pandas. To help pandas.DataFrame.to_stata
is able to make corrections. However, sometimes the default corrections might not be preferred. A solution is to rename columns before writing to Stata:
# This function cleans a string so that only letters a-z and digits 0-9 remain.
# Also removes spaces. Optional case argument controls variable name character case.
def clean_word(word, *, case='lower'):
import re
if case == 'lower':
return(''.join(re.findall(r'[a-z|A-Z|0-9]', word.lower())))
elif case == 'upper':
return(''.join(re.findall(r'[a-z|A-Z|0-9]', word.upper())))
elif case == 'asis':
return(''.join(re.findall(r'[a-z|A-Z|0-9]', word)))
else:
raise Exception('Argument (case) incorrectly specified. \
Default is "lower" Alternate options \
are "upper" and "asis".')
# This funciton cleans list of column names so that only letters a-z and digits 0-9
# remain. Also removes spaces. Makes sure each column name is unique. If duplicats
# present will print warning with explanation.
def clean_cols(clst, *, case='lower'):
import warnings
newcols = []
for col in clst:
newcols.append(clean_word(col, case=case))
if len(clst) != len(set(newcols)):
warnings.warn('\nDuplicates in column list. \
\nDuplicates appended with location.')
newestcols = []
suffix = 0
for i in newcols:
if newcols.count(i) > 1:
newestcols.append(i + str(suffix))
else:
newestcols.append(i)
suffix += 1
return(newestcols)
else:
return(newcols)
# Using the above functions.
df.columns = clean_cols(df.columns)
# or
df.columns = clean_cols(df.columns, case='upper')
# or
df.columns = clean_cols(df.columns, case='asis')
See here for a demonstration of writing Variable Labels.
Quickly display table (a Pandas DataFrame) that lists variables, variable descriptions (variable labels), and summary statistics.
import pandas as pd
auto = 'http://www.stata-press.com/data/r15/auto.dta'
reader = pd.io.stata.StataReader(auto)
exfile = pd.read_stata(auto)
pd.merge(pd.DataFrame(reader.variable_labels(), index=['Label']).transpose(),
exfile.describe(include='all').transpose(),
left_index=True,
right_index=True)
Label | count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
make | Make and Model | 74 | 74 | Pont. Le Mans | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
price | Price | 74 | NaN | NaN | NaN | 6165.26 | 2949.5 | 3291 | 4220.25 | 5006.5 | 6332.25 | 15906 |
mpg | Mileage (mpg) | 74 | NaN | NaN | NaN | 21.2973 | 5.7855 | 12 | 18 | 20 | 24.75 | 41 |
rep78 | Repair Record 1978 | 69 | NaN | NaN | NaN | 3.4058 | 0.989932 | 1 | 3 | 3 | 4 | 5 |
headroom | Headroom (in.) | 74 | NaN | NaN | NaN | 2.99324 | 0.845995 | 1.5 | 2.5 | 3 | 3.5 | 5 |
trunk | Trunk space (cu. ft.) | 74 | NaN | NaN | NaN | 13.7568 | 4.2774 | 5 | 10.25 | 14 | 16.75 | 23 |
weight | Weight (lbs.) | 74 | NaN | NaN | NaN | 3019.46 | 777.194 | 1760 | 2250 | 3190 | 3600 | 4840 |
length | Length (in.) | 74 | NaN | NaN | NaN | 187.932 | 22.2663 | 142 | 170 | 192.5 | 203.75 | 233 |
turn | Turn Circle (ft.) | 74 | NaN | NaN | NaN | 39.6486 | 4.39935 | 31 | 36 | 40 | 43 | 51 |
displacement | Displacement (cu. in.) | 74 | NaN | NaN | NaN | 197.297 | 91.8372 | 79 | 119 | 196 | 245.25 | 425 |
gear_ratio | Gear Ratio | 74 | NaN | NaN | NaN | 3.01486 | 0.456287 | 2.19 | 2.73 | 2.955 | 3.3525 | 3.89 |
foreign | Car type | 74 | 2 | Domestic | 52 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
This is an implementation of 10 Simple Hacks To Speed... Data Analysis... but with Stata's auto.dta
. The article's implemenation was pre version 2.0.0. The imlemenation below is 2.0.0+ syntax.
pip install pandas-profiling
# or
conda install -c anaconda pandas-profiling
import pandas as pd
import pandas_profiling
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
profile = df.profile_report()
rejected = profile.get_rejected_variables()
df.profile_report(title='Stata Auto.dta Pandas Profiled',
correlation_overrides=[rejected])
profile.to_file(output_file='Stata_Auto.dta_Profile.html')
The Stata_Auto.dta_Profile.html
output is available here.
- Pandas cheatsheet
- Stata cheesheets
- Guide to Encoding Categorical Values in Python
- Ordinary Least Squares Regression Starting Points
Send me your questions, comments, contributions, and tell me what I did wrong.
Fork and pull requests welcome.