Though the above may be a great way to do it, it certainly seems complex. There is a much easier way to achieve similar results that is easier on the eyes (and brain!). This is using panda’s aggregate() method with tuple assignments. This results in the most easy-to-understand way, by using the aggregate method after grouping since this would allow us to follow a very simple format of new_column_name = ('old_column', 'agg_funct'). So, for example:
+
importpandasaspd
+
+# Pull in data on storms
+storms=pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv')
+
+# Use groupby and group the columns and perform group calculations
+
+# The below calculations aren't particularly indicative of a good analysis,
+# but give a quick look at a few of the calculations you can do
+df=(
+ storms
+ .groupby(by=['name','year','month','day'])#group
+.aggregate(
+ avg_wind=('wind','mean'),
+ max_wind=('wind','max'),
+ med_wind=('wind','median'),
+ std_pressure=('pressure','std'),
+ first_year=('year','first')
+ )
+ .reset_index()# Somewhat similar to ungroup. Removes the grouping from the index
+)
+
R
diff --git a/Data_Manipulation/creating_categorical_variables.html b/Data_Manipulation/creating_categorical_variables.html
index 48f7cb71..fb5696e4 100644
--- a/Data_Manipulation/creating_categorical_variables.html
+++ b/Data_Manipulation/creating_categorical_variables.html
@@ -262,6 +262,59 @@
axis=1)
There’s quite a bit to unpack here! .apply(lambda x: ..., axis=1) applies a lambda function rowwise to the entire dataframe, with individual columns accessed by, for example, x['mpg']. (You can apply functions on an index using axis=0.) The next keyword returns the next entry in a list that evaluates to true or exists (so in this case it will just return the first entry that exists). Finally, key for key, value in conds_dict.items() if value(x) iterates over the pairs in the dictionary and returns only the condition names (the ‘keys’ in the dictionary) for conditions (the ‘values’ in the dictionary) that evaluate to true.
+
Once again, just like R, Python has many ways of doing the same thing. Some with more complex, but efficient (runtime) manners, while others being slightly slower but many times easier to understand and follow-along with it’s closeness of natural-language syntax. So, for this example, we will use numpy and pandas together, to achieve both an efficient runtime and a relatively simple syntax.
+
fromseabornimportload_dataset
+importpandasaspd
+importnumpyasnp
+
+mtcars=load_dataset('mpg')
+
+# Create our list of index selections
+conditionList=[
+ (mtcars['mpg']<=19)&(mtcars['horsepower']<=123),
+ (mtcars['mpg']>19)&(mtcars['horsepower']<=123),
+ (mtcars['mpg']<=19)&(mtcars['horsepower']>123),
+ (mtcars['mpg']>19)&(mtcars['horsepower']>123)
+]
+
+# Create the results we will pair with the above index selections
+resultList=[
+ 'Efficient and Non-powerful',
+ 'Inefficient and Non-powerful',
+ 'Efficient and Powerful',
+ 'Inefficient and Powerful'
+]
+
+
+df=(
+ mtcars
+ .assign(
+ # Run the numpy select
+classification=np.select(condlist=conditionList,
+ choicelist=resultList,
+ default='Not Considered'
+ )
+ )
+ # Convert from object to categorical
+.astype({'classification':'category'})
+)
+
+
+
+"""
+Be a more purposeful programmer/analyst/data scientist:
+
+Using the default parameter in np.select() allows you to
+fill in the values with that specific text wherever your criteria
+is not considered. For example, if you search this data, you will see
+there are a few rows where horesepower is null.
+The original criteria we built does not considering null, so
+it would be populated with "Not Considered" allowing you to find those
+values and correct them, or set checks for them in a pipeline.
+
+"""
+
+
A very simple way to also get a very basic KNN down in Python is leverage the knowledge of the many smart people that contribute to sci-kit learn library (sklean) as it is a powerhouse of machine learning models, as well as other very useful tools like data splitting, model evaluation, and feature selections.
+
#Import Libraries
+fromseabornimportload_dataset
+importseabornassns
+fromsklearn.model_selectionimporttrain_test_split
+fromsklearn.neighborsimportKNeighborsClassifier
+fromsklearn.metricsimportaccuracy_score
+
+# Load a sample dataset
+iris_df=load_dataset('iris')
+
+
+# Quick and rough sketch comparing the petal feature to species
+sns.scatterplot(data=iris_df,x='petal_length',y='petal_width',hue='species')
+
+
+# Quick and rough sketch comparing the sepals feature to species
+sns.scatterplot(data=iris_df,x='sepal_length',y='sepal_width',hue='species')
+
+
+
+# Let's seperate the data into X and Y (features and target)
+X=iris_df.drop(columns='species')
+Y=iris_df['species']
+
+
+# Split the data into training and testing for model evaluations
+X_train,X_test,y_train,y_test=train_test_split(X,Y,train_size=.70,shuffle=True,
+ random_state=777)
+
+
+# Iterate through different neighbors to find the best accuracy with N neighbors.
+accuracies={}
+errors={}
+foriinrange(1,15):
+ clf=KNeighborsClassifier(n_neighbors=i)
+
+ clf.fit(X=X_train,y=y_train)
+ y_pred=clf.predict(X_test)
+
+ accu_score=accuracy_score(y_true=y_test,y_pred=y_pred)
+ accuracies[i]=accu_score
+
+sns.lineplot(x=accuracies.keys(),y=accuracies.values()).set_title('Accuracies by N-Neighbors')
+
+# Looks like about 8 is the first best accuracy, so we'll go with that.
+print(f"{accuracies[8]:.1%}")#100% accuracy for 8 neighbors.
+
R
diff --git a/Other/import_a_foreign_data_file.html b/Other/import_a_foreign_data_file.html
index efd548b8..0e968dd5 100644
--- a/Other/import_a_foreign_data_file.html
+++ b/Other/import_a_foreign_data_file.html
@@ -271,6 +271,30 @@
usingXLSX# load data from the specified sheet in the file and convert it to a DataFrame for analysisdf=DataFrame(XLSX.readtable("filename.xlsx","mysheet"))
+
+
+ Python
+
+
You’ll most often be relying on Pandas to read in data. Though many other forms exist, the reason you’ll be pulling in data is usually to work with the data, transform, and manipulate it. Panda lends itself extremely well for this purpose. Sometime you may have to work with much more messy data with APIs where you’ll navigate through hierarchies of dictionaries using the .keys() method and selecting levels, but that is handled on a case-by-case basis and impossible to cover here. However, some of the most common will be covered. Those are csv, excel (xlsx), and .RData files.
+
You, of course, always have the default open() function, but that can get much more complex.
+
# Reading .RData files
+importpyreadr
+
+rds_data=pyreadr.read_r('sales_data.Rdata')#Object is a dictionary
+
+#Sales is the name of the dataframe, if unnamed, you may have to pass "None" as the name (no quotes)
+df_r=rds_data['sales']
+df_r.head()
+
+
+# Other common file reads, all use pandas. Most common two shown (csv/xlsx)
+importpandasaspd
+
+csv_file=pd.read_csv('filename.csv')
+xlsx_file=pd.read_excel('filename.xlsx',sheet_name='Sheet1')
+
+#Pandas can also read html, jsons, etc....
+
R
diff --git a/assets/js/search-data.json b/assets/js/search-data.json
index ce6b1b76..a2f39cc2 100644
--- a/assets/js/search-data.json
+++ b/assets/js/search-data.json
@@ -871,7 +871,7 @@
},"145": {
"doc": "K-Nearest Neighbor Matching",
"title": "Python",
- "content": "For KNN, it is not required to import packages other than numpy. You can basically do KNN with one package because it is mostly about computing distance and normalization. You would need TensorFlow and Keras as you try more advanced algorithms such as convolutional neural network. import argparse import numpy as np from collections import Counter # Process arguments for k-NN classification def handle_args(): parser = argparse.ArgumentParser(description= 'Make predictions using the k-NN algorithms.') parser.add_argument('-k', type=int, default=1, help='Number of nearest neighbors to consider') parser.add_argument('--varnorm', action='store_true', help='Normalize features to zero mean and unit variance') parser.add_argument('--rangenorm', action='store_true', help='Normalize features to the range [-1,+1]') parser.add_argument('--exnorm', action='store_true', help='Normalize examples to unit length') parser.add_argument('train', help='Training data file') parser.add_argument('test', help='Test data file') return parser.parse_args() # Load data from a file def read_data(filename): data = np.genfromtxt(filename, delimiter=',', skip_header=1) x = data[:, 0:-1] y = data[:, -1] return (x,y) # Distance between instances x1 and x2 def dist(x1, x2): euclidean_distance = np.linalg.norm(x1 - x2) return euclidean_distance # Predict label for instance x, using k nearest neighbors in training data def classify(train_x, train_y, k, x): dists = np.sqrt(np.sum((x - train_x) ** 2, axis=1)) idx = np.argsort(dists, 0)[:k] k_labels = [train_y[index] for index in idx] prediction = list() prediction.append(max(k_labels, key=k_labels.count)) prediction = np.array(prediction) return prediction # Process the data to normalize features and/or examples. # NOTE: You need to normalize both train and test data the same way. def normalize_data(train_x, test_x, rangenorm, varnorm, exnorm): if rangenorm: train_x = 2 * (train_x - np.min(train_x, axis=0)) / np.nan_to_num(np.ptp(train_x, axis=0)) - 1 test_x = 2 * (test_x - np.min(test_x, axis=0)) / np.nan_to_num(np.ptp(train_x, axis=0)) - 1 pass if varnorm: train_x = (train_x - np.mean(train_x, axis=0)) / np.nan_to_num(np.std(train_x, axis=0)) test_x = (test_x - np.mean(test_x, axis=0)) / np.nan_to_num(np.std(test_x, axis=0)) pass if exnorm: for i in train_x: train_x = i / np.linalg.norm(i) for k in test_x: test_x = k / np.linalg.norm(k) pass return train_x, test_x # Run classifier and compute accuracy def runTest(test_x, test_y, train_x, train_y, k): correct = 0 for (x,y) in zip(test_x, test_y): if classify(train_x, train_y, k, x) == y: correct += 1 acc = float(correct)/len(test_x) return acc # Load train and test data. Learn model. Report accuracy. def main(): args = handle_args() # Read in lists of examples. Each example is a list of attribute values, # where the last element in the list is the class value. (train_x, train_y) = read_data(args.train) (test_x, test_y) = read_data(args.test) # Normalize the training data (train_x, test_x) = normalize_data(train_x, test_x, args.rangenorm, args.varnorm, args.exnorm) acc = runTest(test_x, test_y,train_x, train_y,args.k) print(\"Accuracy: \",acc) if __name__ == \"__main__\": main() . ",
+ "content": "For KNN, it is not required to import packages other than numpy. You can basically do KNN with one package because it is mostly about computing distance and normalization. You would need TensorFlow and Keras as you try more advanced algorithms such as convolutional neural network. import argparse import numpy as np from collections import Counter # Process arguments for k-NN classification def handle_args(): parser = argparse.ArgumentParser(description= 'Make predictions using the k-NN algorithms.') parser.add_argument('-k', type=int, default=1, help='Number of nearest neighbors to consider') parser.add_argument('--varnorm', action='store_true', help='Normalize features to zero mean and unit variance') parser.add_argument('--rangenorm', action='store_true', help='Normalize features to the range [-1,+1]') parser.add_argument('--exnorm', action='store_true', help='Normalize examples to unit length') parser.add_argument('train', help='Training data file') parser.add_argument('test', help='Test data file') return parser.parse_args() # Load data from a file def read_data(filename): data = np.genfromtxt(filename, delimiter=',', skip_header=1) x = data[:, 0:-1] y = data[:, -1] return (x,y) # Distance between instances x1 and x2 def dist(x1, x2): euclidean_distance = np.linalg.norm(x1 - x2) return euclidean_distance # Predict label for instance x, using k nearest neighbors in training data def classify(train_x, train_y, k, x): dists = np.sqrt(np.sum((x - train_x) ** 2, axis=1)) idx = np.argsort(dists, 0)[:k] k_labels = [train_y[index] for index in idx] prediction = list() prediction.append(max(k_labels, key=k_labels.count)) prediction = np.array(prediction) return prediction # Process the data to normalize features and/or examples. # NOTE: You need to normalize both train and test data the same way. def normalize_data(train_x, test_x, rangenorm, varnorm, exnorm): if rangenorm: train_x = 2 * (train_x - np.min(train_x, axis=0)) / np.nan_to_num(np.ptp(train_x, axis=0)) - 1 test_x = 2 * (test_x - np.min(test_x, axis=0)) / np.nan_to_num(np.ptp(train_x, axis=0)) - 1 pass if varnorm: train_x = (train_x - np.mean(train_x, axis=0)) / np.nan_to_num(np.std(train_x, axis=0)) test_x = (test_x - np.mean(test_x, axis=0)) / np.nan_to_num(np.std(test_x, axis=0)) pass if exnorm: for i in train_x: train_x = i / np.linalg.norm(i) for k in test_x: test_x = k / np.linalg.norm(k) pass return train_x, test_x # Run classifier and compute accuracy def runTest(test_x, test_y, train_x, train_y, k): correct = 0 for (x,y) in zip(test_x, test_y): if classify(train_x, train_y, k, x) == y: correct += 1 acc = float(correct)/len(test_x) return acc # Load train and test data. Learn model. Report accuracy. def main(): args = handle_args() # Read in lists of examples. Each example is a list of attribute values, # where the last element in the list is the class value. (train_x, train_y) = read_data(args.train) (test_x, test_y) = read_data(args.test) # Normalize the training data (train_x, test_x) = normalize_data(train_x, test_x, args.rangenorm, args.varnorm, args.exnorm) acc = runTest(test_x, test_y,train_x, train_y,args.k) print(\"Accuracy: \",acc) if __name__ == \"__main__\": main() . A very simple way to also get a very basic KNN down in Python is leverage the knowledge of the many smart people that contribute to sci-kit learn library (sklean) as it is a powerhouse of machine learning models, as well as other very useful tools like data splitting, model evaluation, and feature selections. #Import Libraries from seaborn import load_dataset import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score # Load a sample dataset iris_df = load_dataset('iris') # Quick and rough sketch comparing the petal feature to species sns.scatterplot(data=iris_df, x='petal_length', y='petal_width', hue='species') # Quick and rough sketch comparing the sepals feature to species sns.scatterplot(data=iris_df, x='sepal_length', y='sepal_width', hue='species') # Let's seperate the data into X and Y (features and target) X = iris_df.drop(columns='species') Y = iris_df['species'] # Split the data into training and testing for model evaluations X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=.70, shuffle=True, random_state=777) # Iterate through different neighbors to find the best accuracy with N neighbors. accuracies = {} errors = {} for i in range(1, 15): clf = KNeighborsClassifier(n_neighbors=i) clf.fit(X=X_train, y=y_train) y_pred = clf.predict(X_test) accu_score = accuracy_score(y_true=y_test, y_pred=y_pred) accuracies[i] = accu_score sns.lineplot(x=accuracies.keys(), y=accuracies.values()).set_title('Accuracies by N-Neighbors') # Looks like about 8 is the first best accuracy, so we'll go with that. print(f\"{accuracies[8]:.1%}\") #100% accuracy for 8 neighbors. ",
"url": "/Machine_Learning/Nearest_Neighbor.html#python",
"relUrl": "/Machine_Learning/Nearest_Neighbor.html#python"
},"146": {
@@ -1789,7 +1789,7 @@
},"298": {
"doc": "Creating a Variable with Group Calculations",
"title": "Python",
- "content": "pandas accomplishes this by using the groupby-transform approach. We can either call numpy’s mean function or use a lambda and apply the .mean() method to each group . import pandas as pd # Pull in data on storms storms = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv') # Use groupby and agg to perform a group calculation # Here it's a mean, but it could be any function storms['mean_wind'] = storms.groupby(['name','year','month','day'])['wind'].transform(lambda x: x.mean()) # this tends to be a bit faster because it uses an existing function instead of a lambda import numpy as np storms['mean_wind'] = storms.groupby(['name','year','month','day'])['wind'].transform(np.mean) . ",
+ "content": "pandas accomplishes this by using the groupby-transform approach. We can either call numpy’s mean function or use a lambda and apply the .mean() method to each group . import pandas as pd # Pull in data on storms storms = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv') # Use groupby and agg to perform a group calculation # Here it's a mean, but it could be any function storms['mean_wind'] = storms.groupby(['name','year','month','day'])['wind'].transform(lambda x: x.mean()) # this tends to be a bit faster because it uses an existing function instead of a lambda import numpy as np storms['mean_wind'] = storms.groupby(['name','year','month','day'])['wind'].transform(np.mean) . Though the above may be a great way to do it, it certainly seems complex. There is a much easier way to achieve similar results that is easier on the eyes (and brain!). This is using panda’s aggregate() method with tuple assignments. This results in the most easy-to-understand way, by using the aggregate method after grouping since this would allow us to follow a very simple format of new_column_name = ('old_column', 'agg_funct'). So, for example: . import pandas as pd # Pull in data on storms storms = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv') # Use groupby and group the columns and perform group calculations # The below calculations aren't particularly indicative of a good analysis, # but give a quick look at a few of the calculations you can do df = ( storms .groupby(by=['name', 'year', 'month', 'day']) #group .aggregate( avg_wind = ('wind', 'mean'), max_wind = ('wind', 'max'), med_wind = ('wind', 'median'), std_pressure = ('pressure', 'std'), first_year = ('year', 'first') ) .reset_index() # Somewhat similar to ungroup. Removes the grouping from the index ) . ",
"url": "/Data_Manipulation/creating_a_variable_with_group_calculations.html#python",
"relUrl": "/Data_Manipulation/creating_a_variable_with_group_calculations.html#python"
},"299": {
@@ -1825,7 +1825,7 @@
},"304": {
"doc": "Creating Categorical Variables",
"title": "Python",
- "content": "We can use the filtering operation in pandas to only assign the categorical value to the rows that satisfy the condition. import pandas as pd # and purely for the dataset import statsmodels.api as sm mtcars = sm.datasets.get_rdataset('mtcars').data # Now we go through each pair of conditions and group assignments, # using loc to only send that group assignment to observations # satisfying the given condition mtcars.loc[(mtcars.mpg <= 19) & (mtcars.hp <= 123), 'classification'] = 'Efficient and Non-Powerful' mtcars.loc[(mtcars.mpg > 19) & (mtcars.hp <= 123), 'classification'] = 'Inefficient and Non-Powerful' mtcars.loc[(mtcars.mpg <= 19) & (mtcars.hp > 123), 'classification'] = 'Efficient and Powerful' mtcars.loc[(mtcars.mpg > 19) & (mtcars.hp > 123), 'classification'] = 'Inefficient and Powerful' . There’s another way to achieve the same outcome using lambda functions. In this case, we’ll create a dictionary of pairs of classification names and conditions, for example 'Efficient': lambda x: x['mpg'] <= 19. We’ll then find the first case where the condition is true for each row and create a new column with the paired classification name. # Dictionary of classification names and conditions expressed as lambda functions conds_dict = { 'Efficient and Non-Powerful': lambda x: (x['mpg'] <= 19) & (x['hp'] <= 123), 'Inefficient and Non-Powerful': lambda x: (x['mpg'] > 19) & (x['hp'] <= 123), 'Efficient and Powerful': lambda x: (x['mpg'] <= 19) & (x['hp'] > 123), 'Inefficient and Powerful': lambda x: (x['mpg'] > 19) & (x['hp'] > 123), } # Find name of first condition that evaluates to True mtcars['classification'] = mtcars.apply(lambda x: next(key for key, value in conds_dict.items() if value(x)), axis=1) . There’s quite a bit to unpack here! .apply(lambda x: ..., axis=1) applies a lambda function rowwise to the entire dataframe, with individual columns accessed by, for example, x['mpg']. (You can apply functions on an index using axis=0.) The next keyword returns the next entry in a list that evaluates to true or exists (so in this case it will just return the first entry that exists). Finally, key for key, value in conds_dict.items() if value(x) iterates over the pairs in the dictionary and returns only the condition names (the ‘keys’ in the dictionary) for conditions (the ‘values’ in the dictionary) that evaluate to true. ",
+ "content": "We can use the filtering operation in pandas to only assign the categorical value to the rows that satisfy the condition. import pandas as pd # and purely for the dataset import statsmodels.api as sm mtcars = sm.datasets.get_rdataset('mtcars').data # Now we go through each pair of conditions and group assignments, # using loc to only send that group assignment to observations # satisfying the given condition mtcars.loc[(mtcars.mpg <= 19) & (mtcars.hp <= 123), 'classification'] = 'Efficient and Non-Powerful' mtcars.loc[(mtcars.mpg > 19) & (mtcars.hp <= 123), 'classification'] = 'Inefficient and Non-Powerful' mtcars.loc[(mtcars.mpg <= 19) & (mtcars.hp > 123), 'classification'] = 'Efficient and Powerful' mtcars.loc[(mtcars.mpg > 19) & (mtcars.hp > 123), 'classification'] = 'Inefficient and Powerful' . There’s another way to achieve the same outcome using lambda functions. In this case, we’ll create a dictionary of pairs of classification names and conditions, for example 'Efficient': lambda x: x['mpg'] <= 19. We’ll then find the first case where the condition is true for each row and create a new column with the paired classification name. # Dictionary of classification names and conditions expressed as lambda functions conds_dict = { 'Efficient and Non-Powerful': lambda x: (x['mpg'] <= 19) & (x['hp'] <= 123), 'Inefficient and Non-Powerful': lambda x: (x['mpg'] > 19) & (x['hp'] <= 123), 'Efficient and Powerful': lambda x: (x['mpg'] <= 19) & (x['hp'] > 123), 'Inefficient and Powerful': lambda x: (x['mpg'] > 19) & (x['hp'] > 123), } # Find name of first condition that evaluates to True mtcars['classification'] = mtcars.apply(lambda x: next(key for key, value in conds_dict.items() if value(x)), axis=1) . There’s quite a bit to unpack here! .apply(lambda x: ..., axis=1) applies a lambda function rowwise to the entire dataframe, with individual columns accessed by, for example, x['mpg']. (You can apply functions on an index using axis=0.) The next keyword returns the next entry in a list that evaluates to true or exists (so in this case it will just return the first entry that exists). Finally, key for key, value in conds_dict.items() if value(x) iterates over the pairs in the dictionary and returns only the condition names (the ‘keys’ in the dictionary) for conditions (the ‘values’ in the dictionary) that evaluate to true. Once again, just like R, Python has many ways of doing the same thing. Some with more complex, but efficient (runtime) manners, while others being slightly slower but many times easier to understand and follow-along with it’s closeness of natural-language syntax. So, for this example, we will use numpy and pandas together, to achieve both an efficient runtime and a relatively simple syntax. from seaborn import load_dataset import pandas as pd import numpy as np mtcars = load_dataset('mpg') # Create our list of index selections conditionList = [ (mtcars['mpg'] <= 19) & (mtcars['horsepower'] <= 123), (mtcars['mpg'] > 19) & (mtcars['horsepower'] <= 123), (mtcars['mpg'] <= 19) & (mtcars['horsepower'] > 123), (mtcars['mpg'] > 19) & (mtcars['horsepower'] > 123) ] # Create the results we will pair with the above index selections resultList = [ 'Efficient and Non-powerful', 'Inefficient and Non-powerful', 'Efficient and Powerful', 'Inefficient and Powerful' ] df = ( mtcars .assign( # Run the numpy select classification = np.select(condlist=conditionList, choicelist=resultList, default='Not Considered' ) ) # Convert from object to categorical .astype({'classification' :'category'}) ) \"\"\" Be a more purposeful programmer/analyst/data scientist: Using the default parameter in np.select() allows you to fill in the values with that specific text wherever your criteria is not considered. For example, if you search this data, you will see there are a few rows where horesepower is null. The original criteria we built does not considering null, so it would be populated with \"Not Considered\" allowing you to find those values and correct them, or set checks for them in a pipeline. \"\"\" . ",
"url": "/Data_Manipulation/creating_categorical_variables.html#python",
"relUrl": "/Data_Manipulation/creating_categorical_variables.html#python"
},"305": {
@@ -2819,1674 +2819,1680 @@
"url": "/Other/import_a_foreign_data_file.html#julia",
"relUrl": "/Other/import_a_foreign_data_file.html#julia"
},"470": {
+ "doc": "Import a Foreign Data File",
+ "title": "Python",
+ "content": "You’ll most often be relying on Pandas to read in data. Though many other forms exist, the reason you’ll be pulling in data is usually to work with the data, transform, and manipulate it. Panda lends itself extremely well for this purpose. Sometime you may have to work with much more messy data with APIs where you’ll navigate through hierarchies of dictionaries using the .keys() method and selecting levels, but that is handled on a case-by-case basis and impossible to cover here. However, some of the most common will be covered. Those are csv, excel (xlsx), and .RData files. You, of course, always have the default open() function, but that can get much more complex. # Reading .RData files import pyreadr rds_data = pyreadr.read_r('sales_data.Rdata') #Object is a dictionary #Sales is the name of the dataframe, if unnamed, you may have to pass \"None\" as the name (no quotes) df_r = rds_data['sales'] df_r.head() # Other common file reads, all use pandas. Most common two shown (csv/xlsx) import pandas as pd csv_file = pd.read_csv('filename.csv') xlsx_file = pd.read_excel('filename.xlsx', sheet_name='Sheet1') #Pandas can also read html, jsons, etc.... ",
+ "url": "/Other/import_a_foreign_data_file.html#python",
+ "relUrl": "/Other/import_a_foreign_data_file.html#python"
+ },"471": {
"doc": "Import a Foreign Data File",
"title": "R",
"content": "# Generally, you may use the rio package to import any tabular data type to be read in fluently without requiring a specification of the file type. library(rio) data <- import('filename.xlsx') data <- import('filename.dta') data <- import('filename.sav') library(readxl) data <- read_excel('filename.xlsx') # Read Stata, SAS, and SPSS files with the haven package # install.packages('haven') library(haven) data <- read_stata('filename.dta') data <- read_spss('filename.sav') # read_sas also supports .sas7bcat, or read_xpt supports transport files data <- read_sas('filename.sas7bdat') # Read lots of other types with the foreign package # install.packages('foreign') library(foreign) data <- read.arff('filename.arff') data <- read.dbf('filename.dbf') data <- read.epiinfo('filename.epiinfo') data <- read.mtb('filename.mtb') data <- read.octave('filename.octave') data <- read.S('filename.S') data <- read.systat('filename.systat') . ",
"url": "/Other/import_a_foreign_data_file.html#r",
"relUrl": "/Other/import_a_foreign_data_file.html#r"
- },"471": {
+ },"472": {
"doc": "Import a Foreign Data File",
"title": "Stata",
"content": "Stata can import foreign files using the File -> Import menu. Alternately, you can use the import command: . import type using filename . where type can be excel, spss, sas, haver, or dbase (import can also be used to download data directly from sources like FRED). ",
"url": "/Other/import_a_foreign_data_file.html#stata",
"relUrl": "/Other/import_a_foreign_data_file.html#stata"
- },"472": {
+ },"473": {
"doc": "Import a Delimited Data File (CSV, TSV)",
"title": "Import a Delimited Data File (CSV, TSV)",
"content": "Often, data is stored in delimited files. In delimited files, each record has its own line, but the columns or variables are separated by a character, or delimiter. The most common delimiter used is a comma. As a result, you will often encounter comma-delimited files by their more common name, comma-separated values or CSV files. Importing these files is often the first step of any data analysis project, so we show you how to import CSVs (and other delimited files) below. ",
"url": "/Other/importing_delimited_files.html",
"relUrl": "/Other/importing_delimited_files.html"
- },"473": {
+ },"474": {
"doc": "Import a Delimited Data File (CSV, TSV)",
"title": "Keep in Mind",
"content": ". | Sometimes delimiting characters also appear in strings in the data - this can cause your program to read the data improperly since it assumes that a new column is starting every time it sees that character. Good data stewards won’t let this happen, but when it does happen it can be a real headache. Be on the lookout for that if your data seems to be reading in improperly. | When starting out, it can be confusing to know that you are working with a CSV file because you can open CSVs in Excel and they look like normal spreadsheets. Because many software packages have different procedures for importing CSVs and Excel workbooks, the ability to open CSVs in Excel (and the fact that they often appear in your GUI with an Excel icon next to them because that is the default program used to open them) often leads users to want to use the import commands appropriate for Excel. Don’t be caught up by this pitfall; the failsafe way to look at the extension connected with your file name. CSV files will have a .csv extension, while Excel files end in .xls or .xlsx | Other common delimiters include tabs (TSV) and pipes: |. | . ",
"url": "/Other/importing_delimited_files.html#keep-in-mind",
"relUrl": "/Other/importing_delimited_files.html#keep-in-mind"
- },"474": {
+ },"475": {
"doc": "Import a Delimited Data File (CSV, TSV)",
"title": "Also Consider",
"content": ". | Before doing this you will probably find it useful to Set a Working Directory | Import a foreign data file | Import a fixed-width data file | . ",
"url": "/Other/importing_delimited_files.html#also-consider",
"relUrl": "/Other/importing_delimited_files.html#also-consider"
- },"475": {
+ },"476": {
"doc": "Import a Delimited Data File (CSV, TSV)",
"title": "Implementations",
"content": " ",
"url": "/Other/importing_delimited_files.html#implementations",
"relUrl": "/Other/importing_delimited_files.html#implementations"
- },"476": {
+ },"477": {
"doc": "Import a Delimited Data File (CSV, TSV)",
"title": "Julia",
"content": "import Pkg; Pkg.add(\"CSV\") # This line and the next add the packages CSV and DataFrames to your Julia installation Pkg.add(\"DataFrames\") # They need to only be run once and not at all if you have previously installed the packages # Initialize the CSV and DataFrames packages (import also works in place of using, to make the analogy to Python's import more direct) using CSV, DataFrames # Import a CSV File from your local computer, if Scorecard.csv is in your working directory df = CSV.read(\"Scorecard.csv\", DataFrame) # Note, the DataFrame argument tells Julia to read the dataset into a DataFrame object # Read a CSV File from the web using HTTP # Bring in Julia's HTTP package to pull from the web df_web = CSV.read(HTTP.get(\"https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Model_Estimation/Data/Fixed_Effects_in_Linear_Regression/Scorecard.csv\").body, DataFrame) . ",
"url": "/Other/importing_delimited_files.html#julia",
"relUrl": "/Other/importing_delimited_files.html#julia"
- },"477": {
+ },"478": {
"doc": "Import a Delimited Data File (CSV, TSV)",
"title": "Python",
"content": "The approach in Python uses pandas’s read_csv function and looks quite similar to Julia’s syntax. # Import a CSV File from your local machine df = pd.read_csv(\"Scorecard.csv\") # Import a CSV File from the web import pandas as pd # Make pandas available to your Python session df = pd.read_csv(\"https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Model_Estimation/Data/Fixed_Effects_in_Linear_Regression/Scorecard.csv\") . ",
"url": "/Other/importing_delimited_files.html#python",
"relUrl": "/Other/importing_delimited_files.html#python"
- },"478": {
+ },"479": {
"doc": "Import a Delimited Data File (CSV, TSV)",
"title": "R",
"content": "# Import a CSV file with the base-R (utils package) read.csv function df <- read.csv('Scorecard.csv') # If you are working in the tidyverse, there is the improved read_csv library(tidyverse) df <- read_csv('Scorecard.csv') # The fastest way to read in large CSV files is fread() in the data.table package library(data.table) df <- fread('Scorecard.csv') # In each of these cases you can open a CSV on the internet by just putting the URL in place of the file path df <- read.csv('https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Model_Estimation/Data/Fixed_Effects_in_Linear_Regression/Scorecard.csv') df <- read_csv('https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Model_Estimation/Data/Fixed_Effects_in_Linear_Regression/Scorecard.csv') df <- fread('https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Model_Estimation/Data/Fixed_Effects_in_Linear_Regression/Scorecard.csv') . ",
"url": "/Other/importing_delimited_files.html#r",
"relUrl": "/Other/importing_delimited_files.html#r"
- },"479": {
+ },"480": {
"doc": "Import a Delimited Data File (CSV, TSV)",
"title": "Stata",
"content": "* Import a CSV File from your local machine import delimited Scorecard.csv, clear * Note that the \", clear\" option on all Stata import commands clears any data in memory before importing the dataset * Import a CSV File from the web import delimited \"https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Model_Estimation/Data/Fixed_Effects_in_Linear_Regression/Scorecard.csv\", clear . ",
"url": "/Other/importing_delimited_files.html#stata",
"relUrl": "/Other/importing_delimited_files.html#stata"
- },"480": {
+ },"481": {
"doc": "Instrumental Variables",
"title": "Instrumental Variables",
"content": "In the regression model . \\[Y = \\beta_0 + \\beta_1 X + \\epsilon\\] where \\(\\epsilon\\) is an error term, the estimated \\(\\hat{\\beta}_1\\) will not give the causal effect of \\(X\\) on \\(Y\\) if \\(X\\) is endogenous - that is, if \\(X\\) is related to \\(\\epsilon\\) and so determined by forces within the model (endogenous). One way to recover the causal effect of \\(X\\) on \\(Y\\) is to use instrumental variables. If there exists a variable \\(Z\\) that is related to \\(X\\) but is completely unrelated to \\(\\epsilon\\) (perhaps after adding some controls), then you can use instrumental variables estimation to isolate only the part of the variation in \\(X\\) that is explained by \\(Z\\). Naturally, then, this part of the variation is unrelated to \\(\\epsilon\\) because \\(Z\\) is unrelated to \\(\\epsilon\\), and you can get the causal effect of that part of \\(X\\). For more information, see Wikipedia: Instrumental variables estimation. ",
"url": "/Model_Estimation/Research_Design/instrumental_variables.html",
"relUrl": "/Model_Estimation/Research_Design/instrumental_variables.html"
- },"481": {
+ },"482": {
"doc": "Instrumental Variables",
"title": "Keep in Mind",
"content": ". | Technically, all the variables in the model except for the dependent variable and the endogenous variables are “instruments”, including controls. However, it is also common to refer to only the excluded instruments (i.e., variables that are only used to predict the endogenous variable, not the dependent variable) as instruments. This page will follow that convention. | For instrumental variables to work, it must be the case that the instrument is only related to the outcome variable through other variables already included in the model like the endogenous variables or the controls. This is called the “validity” assumption and it cannot be verified in the data, only theoretically. Give serious consideration as to whether validity applies to your instrument before using instrumental variables. | You can check for the relevance of your instrument, which is how strongly related it is to your endogenous variable. A rule of thumb is that an joint F-test of the instruments should be at least 10, but this is only a rule of thumb, and imprecise (see Stock and Yogo 2005 for a more precise version of this test). In general, if the instruments are not very strong predictors of the endogenous variables, you should consider whether your analysis fits the assumptions necessary to run a weak-instrument-robust estimation method. See Hahn & Hausman 2003 for an overview. | Instrumental variables estimates a local average treatment effect - in other words, a weighted average of each individual observation’s treatment effect, where the weights are based on the strength of the effect of the instrument on the endogenous variable. Note both that this is not the same thing as an average treatment effect, which is an average of each individual’s treatment effect, which is usually what is desired, and also that if the instrumental variable has effects of different signs for different people (non-monotonicity), then the estimate isn’t really anything of interest. Be sure that monotonicity makes sense in your context before using instrumental variables. | Instrumental variables is a consistent estimator of a causal effect, but it is biased in finite samples. Be wary of using instrumental variables in small samples. | . ",
"url": "/Model_Estimation/Research_Design/instrumental_variables.html#keep-in-mind",
"relUrl": "/Model_Estimation/Research_Design/instrumental_variables.html#keep-in-mind"
- },"482": {
+ },"483": {
"doc": "Instrumental Variables",
"title": "Also Consider",
"content": ". | Instrumental variables methods generally rely on linearity assumptions, and if your dependent or endogenous variables are not continuous, their assumptions may not hold. Consider methods specially designed for nonlinear instrumental variables estimation. | There are many ways to estimate instrumental variables, not just two stage least squares. Different estimators such as GMM or k-class limited-information maximum likelihood estimators perform better or worse depending on heterogeneous treatment effects, heteroskedasticity, and sample size. Many instrumental variables estimation commands allow for multiple different estimation methods, described below. Note that in the just-identified case (where the number of instruments is the same as the number of endogenous variables), several common estimators produce identical results. | . ",
"url": "/Model_Estimation/Research_Design/instrumental_variables.html#also-consider",
"relUrl": "/Model_Estimation/Research_Design/instrumental_variables.html#also-consider"
- },"483": {
+ },"484": {
"doc": "Instrumental Variables",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/Research_Design/instrumental_variables.html#implementations",
"relUrl": "/Model_Estimation/Research_Design/instrumental_variables.html#implementations"
- },"484": {
+ },"485": {
"doc": "Instrumental Variables",
"title": "Python",
"content": "The easiest way to run instrument variables regressions in Python is probably the linearmodels package, although there are other packages available. # Conda install linearmodels, pandas, and numpy, if you don't have them already from linearmodels.iv import IV2SLS import pandas as pd import numpy as np df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/AER/CigarettesSW.csv', index_col=0) # We will use cigarette taxes as an instrument for cigarette prices # to evaluate the effect of cigarette price on log number of packs smoked # With income per capita as a control # Adjust everything for inflation df['rprice'] = df['price']/df['cpi'] df['rincome'] = df['income']/df['population']/df['cpi'] df['tdiff'] = (df['taxs'] - df['tax'])/df['cpi'] # Specify formula in format of 'y ~ exog + [endog ~ instruments]'. # The '1' on the right-hand side of the formula adds a constant. formula = 'np.log(packs) ~ 1 + np.log(rincome) + [np.log(rprice) ~ tdiff]' # Specify model and data mod = IV2SLS.from_formula(formula, df) # Fit model res = mod.fit() # Show model summary res.summary . ",
"url": "/Model_Estimation/Research_Design/instrumental_variables.html#python",
"relUrl": "/Model_Estimation/Research_Design/instrumental_variables.html#python"
- },"485": {
+ },"486": {
"doc": "Instrumental Variables",
"title": "R",
"content": "There are several ways to run instrumental variables in R. Here we will cover two - AER::ivreg(), which is probably the most common, and lfe::felm(), which is more flexible and powerful. You may also want to consider looking at estimatr::iv_robust, which combines much of the flexibility of lfe::felm() with the simple syntax of AER::ivreg(), although it is not as powerful. # If necessary, install both packages. # install.packages(c('AER','lfe')) # Load AER library(AER) # Load the Cigarettes data from ivreg, following the example data(CigarettesSW) # We will be using cigarette taxes as an instrument for cigarette prices # to evaluate the effect of cigarette price on log number of packs smoked # With income per capita as a control # Adjust everything for inflation CigarettesSW$rprice <- CigarettesSW$price/CigarettesSW$cpi CigarettesSW$rincome <- CigarettesSW$income/CigarettesSW$population/CigarettesSW$cpi CigarettesSW$tdiff <- (CigarettesSW$taxs - CigarettesSW$tax)/CigarettesSW$cpi # The regression formula takes the format # dependent.variable ~ endogenous.variables + controls | instrumental.variables + controls ivmodel <- ivreg(log(packs) ~ log(rprice) + log(rincome) | tdiff + log(rincome), data = CigarettesSW) summary(ivmodel) # Now we will run the same model with lfe::felm library(lfe) # The regression formula takes the format # dependent vairable ~ # controls | # fixed.effects | # (endogenous.variables ~ instruments) | # clusters.for.standard.errors # So if need be it is straightforward to adjust this example to account for # fixed effects and clustering. # Note the 0 indicating no fixed effects ivmodel2 <- felm(log(packs) ~ log(rincome) | 0 | (log(rprice) ~ tdiff), data = CigarettesSW) summary(ivmodel2) # felm can also use several k-class estimation methods; see help(felm) for the full list. # Let's run it with a limited-information maximum likelihood estimator with # the fuller adjustment set to minimize squared error (4). ivmodel3 <- felm(log(packs) ~ log(rincome) | 0 | (log(rprice) ~ tdiff), data = CigarettesSW, kclass = 'liml', fuller = 4) summary(ivmodel3) . ",
"url": "/Model_Estimation/Research_Design/instrumental_variables.html#r",
"relUrl": "/Model_Estimation/Research_Design/instrumental_variables.html#r"
- },"486": {
+ },"487": {
"doc": "Instrumental Variables",
"title": "Stata",
"content": "Instrumental variables estimation in Stata typically uses the built-in ivregress command. This command can be used to implement linear instrumental variables regression using two-stage least squares, GMM, or LIML . * Get Stock and Watson Cigarette data import delimited \"https://vincentarelbundock.github.io/Rdatasets/csv/Zelig/CigarettesSW.csv\", clear * Adjust everything for inflation g rprice = price/cpi g rincome = (income/population)/cpi g tdiff = (taxs - tax)/cpi * And take logs g lpacks = ln(packs) g lrincome = ln(rincome) g lrprice = ln(rprice) * The syntax for the regression is * name_of_estimator dependent_variable controls (endogenous_variable = instruments) * where name_of_estimator can be two stage least squares (2sls), * limited information maximum likelihood (liml, note that ivregress doesn't support k-class estimators), * or generalized method of moments (gmm) * Here we can run two stage least squares ivregress 2sls lpacks rincome (lrprice = tdiff) * Or gmm. ivregress gmm lpacks rincome (lrprice = tdiff) . ",
"url": "/Model_Estimation/Research_Design/instrumental_variables.html#stata",
"relUrl": "/Model_Estimation/Research_Design/instrumental_variables.html#stata"
- },"487": {
+ },"488": {
"doc": "Interaction Terms and Polynomials",
"title": "Interaction Terms and Polynomials",
"content": "Regression models generally assume that the outcome variable is a function of an index, which is a linear function of the independent variables, for example in ordinary least squares: . \\[Y = \\beta_0+\\beta_1X_1+\\beta_2X_2\\] However, if the independent variables have a nonlinear effect on the outcome, the model will be incorrectly specified. This is fine as long as that nonlinearity is modeled by including those nonlinear terms in the index. The two most common ways this occurs is by including interactions or polynomial terms. With an interaction, the effect of one variable varies according to the value of another: . \\[Y = \\beta_0+\\beta_1X_1+\\beta_2X_2 + \\beta_3X_1X_2\\] and with polynomial terms, the effect of one variable one the outcome is allowed to take a non-linear shape: . \\[Y = \\beta_0+\\beta_1X_1+\\beta_2X_2 + \\beta_3X_2^2 + \\beta_4X_2^3\\] ",
"url": "/Model_Estimation/OLS/interaction_terms_and_polynomials.html",
"relUrl": "/Model_Estimation/OLS/interaction_terms_and_polynomials.html"
- },"488": {
+ },"489": {
"doc": "Interaction Terms and Polynomials",
"title": "Keep in Mind",
"content": ". | When you have interaction terms or polynomials, the effect of a variable can no longer be described with a single coefficient, and in some senses the individual coefficients lose meaning without the others. You can understand the effect of a single variable by taking the derivative of the index with respect to that variable. For example, in \\(Y = \\beta_0+\\beta_1X_1+\\beta_2X_2 + \\beta_3X_1X_2\\), the effect of \\(X_2\\) on \\(Y\\) is \\(\\partial Y/\\partial X_2 = \\beta_2 + \\beta_3X_1\\). You must plug in the value of \\(X_1\\) to get the effect of \\(X_2\\). Or in \\(Y = \\beta_0+\\beta_1X_1+\\beta_2X_2 + \\beta_3X_2^2 + \\beta_4X_2^3\\), the effect of \\(X_2\\) is \\(\\partial Y/\\partial X_2 = \\beta_2 + 2\\beta_3X_2 + 3\\beta_4X_2^2\\). You must plug in a value of \\(X_2\\) to get the marginal effect of \\(X_2\\) at that value. | In almost all cases, if you are including an interaction term, you should also include each of the interacted variables on their own. Otherwise, the coefficients become very difficult to interpret. | In almost all cases, if you are including a polynomial, you should include all terms of the polynomial. In other words, include the linear and squared term, not just the squared term. | . ",
"url": "/Model_Estimation/OLS/interaction_terms_and_polynomials.html#keep-in-mind",
"relUrl": "/Model_Estimation/OLS/interaction_terms_and_polynomials.html#keep-in-mind"
- },"489": {
+ },"490": {
"doc": "Interaction Terms and Polynomials",
"title": "Also Consider",
"content": ". | Interaction terms tend to have low statistical power. Consider performing a power analysis of interaction terms before running your analysis. | Polynomials are not the only way to model a nonlinear relationship. You could, for example, run one of many kinds of nonparametric regression. | You may want to get the average marginal effects or the marginal effects at the mean of your variables after running your model. | One common way to display the effects of a model with interactions is to graph them. See marginal effects plots for interactions with continuous variables and Marginal effects plots for interactions with continuous variables | . ",
"url": "/Model_Estimation/OLS/interaction_terms_and_polynomials.html#also-consider",
"relUrl": "/Model_Estimation/OLS/interaction_terms_and_polynomials.html#also-consider"
- },"490": {
+ },"491": {
"doc": "Interaction Terms and Polynomials",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/OLS/interaction_terms_and_polynomials.html#implementations",
"relUrl": "/Model_Estimation/OLS/interaction_terms_and_polynomials.html#implementations"
- },"491": {
+ },"492": {
"doc": "Interaction Terms and Polynomials",
"title": "Julia",
"content": "Thanks to StatsModels.jl and GLM packages from the JuliaStats project we can match R and Python code very closely. using StatsModels, GLM, DataFrames, CSV # Load the R mtcars dataset from a URL mtcars = CSV.read(download(\"https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Data/mtcars.csv\"), DataFrame) # Here we specify a model with linear, quadratic and cubic `hp` terms. # We can use any Julia functions and operators, including user-defined ones, # in a `@formula` expression. # We also specify `dropcollinear=false` otherwise `lm` function will drop # the intercept during fitting, as soon as the model's terms are not linearly # independent. That's a dubious thing to have in a presumably linear model, # but here we show only how to write down a particular model, and not what model # is the right one for the given data. :) model1 = lm(@formula(mpg ~ hp + hp^2 + hp^3 + cyl), mtcars, dropcollinear=false) print(model1) # Include an interaction term and the variables by themselves using `*` # The interaction term is represented by hp:cyl model2 = lm(@formula(mpg ~ hp * cyl), mtcars) print(model2) # Include only the interaction term and not the variables themselves with `&` # Hard to interpret! Occasionally useful though. model3 = lm(@formula(mpg ~ hp&cyl), mtcars) print(model3) . ",
"url": "/Model_Estimation/OLS/interaction_terms_and_polynomials.html#julia",
"relUrl": "/Model_Estimation/OLS/interaction_terms_and_polynomials.html#julia"
- },"492": {
+ },"493": {
"doc": "Interaction Terms and Polynomials",
"title": "Python",
"content": "Using the statsmodels package, we can use a similar formulation as the R example below. # Standard imports import numpy as np import pandas as pd import statsmodels.formula.api as sms from matplotlib import pyplot as plt # Load the R mtcars dataset from a URL df = pd.read_csv('https://raw.githubusercontent.com/LOST-STATS/lost-stats.github.io/source/Data/mtcars.csv') # Include a linear, squared, and cubic term using the I() function. # N.B. Python uses ** for exponentiation (^ means bitwise xor) model1 = sms.ols('mpg ~ hp + I(hp**2) + I(hp**3) + cyl', data=df) print(model1.fit().summary()) # Include an interaction term and the variables by themselves using * # The interaction term is represented by hp:cyl model2 = sms.ols('mpg ~ hp * cyl', data=df) print(model2.fit().summary()) # Equivalently, you can request \"all quadratic interaction terms\" by doing model3 = sms.ols('mpg ~ (hp + cyl) ** 2', data=df) print(model3.fit().summary()) # Include only the interaction term and not the variables themselves with : # Hard to interpret! Occasionally useful though. model4 = sms.ols('mpg ~ hp : cyl', data=df) print(model4.fit().summary()) . ",
"url": "/Model_Estimation/OLS/interaction_terms_and_polynomials.html#python",
"relUrl": "/Model_Estimation/OLS/interaction_terms_and_polynomials.html#python"
- },"493": {
+ },"494": {
"doc": "Interaction Terms and Polynomials",
"title": "R",
"content": "# Load mtcars data data(mtcars) # Include a linear, squared, and cubic term using the I() function model1 <- lm(mpg ~ hp + I(hp^2) + I(hp^3) + cyl, data = mtcars) # Include a linear, squared, and cubic term using the poly() function # The raw = TRUE option will give the exact same result as model1 # Omitting this will give you orthogonal polynomial terms, # which are not correlated with each other but are more difficult to interpret model2 <- lm(mpg ~ poly(hp, 3, raw = TRUE) + cyl, data = mtcars) # Include an interaction term and the variables by themselves using * model3 <- lm(mpg ~ hp*cyl, data = mtcars) # Include only the interaction term and not the variables themselves with : # Hard to interpret! Occasionally useful though. model4 <- lm(mpg ~ hp:cyl, data = mtcars) . ",
"url": "/Model_Estimation/OLS/interaction_terms_and_polynomials.html#r",
"relUrl": "/Model_Estimation/OLS/interaction_terms_and_polynomials.html#r"
- },"494": {
+ },"495": {
"doc": "Interaction Terms and Polynomials",
"title": "Stata",
"content": "Stata allows interaction and polynomial terms using hashtags ## to join together variables to make interactions, or joining a variable with itself to get a polynomial. You must also specify whether each variable is continuous (prefix the variable with c.) or a factor (prefix with i.). * Load auto data sysuse auto.dta, clear * Use ## to interact variables together and also include the variables individually * foreign is a factor variable so we prefix it with i. * weight is continuous so we prefix it with c. reg mpg c.weight##i.foreign * Use # to include just the interaction term and not the variables themselves * If one is a factor, this will include the effect of the continuous variable * For each level of the factor reg mpg c.weight#i.foreign * Interact a variable with itself to create a polynomial term reg mpg c.weight##c.weight##c.weight foreign . It is also possible to use other type of functions and obtain correct marginal effects. For example: Say that you want to estimate the model: . \\[y = a_0 + a_1 * x + a_2 * 1/x + e\\] and you want to estimate the marginal effects with respect to \\(x\\). You can do this as follows: . * requires package f_able ssc install f_able * Load auto data sysuse auto.dta, clear * create function using \"fgen\" fgen _1_price = 1/price reg mpg _1_price price * indicates which variable is a \"constructed\" variable f_able _1_price, auto * estimate marginal effects margins, dydx(price) * How do you know it works? Use NL to verify nl (mpg = {a0} + {a1} * price + {a2}*1/price), var(price) margins, dydx(price) . ",
"url": "/Model_Estimation/OLS/interaction_terms_and_polynomials.html#stata",
"relUrl": "/Model_Estimation/OLS/interaction_terms_and_polynomials.html#stata"
- },"495": {
+ },"496": {
"doc": "Line Graph with Labels at the Beginning or End of Lines",
"title": "Line Graph with Labels at the Beginning or End of Lines",
"content": "A line graph is a common way of showing how a value changes over time (or over any other x-axis where there’s only one observation per x-axis value). It is also common to put several line graphs on the same set of axes so you can see how multiple values are changing together. When putting multiple line graphs on the same set of axes, a good idea is to label the different lines on the lines themselves, rather than in a legend, which generally makes things easier to read. ",
"url": "/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.html",
"relUrl": "/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.html"
- },"496": {
+ },"497": {
"doc": "Line Graph with Labels at the Beginning or End of Lines",
"title": "Keep in Mind",
"content": ". | Check the resulting graph to make sure that labels are legible, visible in the graph area, and don’t overlap. | . ",
"url": "/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.html#keep-in-mind",
"relUrl": "/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.html#keep-in-mind"
- },"497": {
+ },"498": {
"doc": "Line Graph with Labels at the Beginning or End of Lines",
"title": "Also Consider",
"content": ". | More generally, see Line graph and Styling line graphs. In particular, consider Styling line graphs in order to distinguish the lines by color, pattern, etc. in addition to labels | If there are too many lines to be able to clearly follow them, labels won’t help too much. Instead, consider Faceted graphs. | . ",
"url": "/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.html#also-consider",
"relUrl": "/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.html#also-consider"
- },"498": {
+ },"499": {
"doc": "Line Graph with Labels at the Beginning or End of Lines",
"title": "Implementations",
"content": " ",
"url": "/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.html#implementations",
"relUrl": "/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.html#implementations"
- },"499": {
+ },"500": {
"doc": "Line Graph with Labels at the Beginning or End of Lines",
"title": "Python",
"content": "There isn’t a quick, declarative way to add text labels to lines with the most popular libraries. So, in the example below, we’ll add labels to lines using the imperative (build what you want) tools of plotting library matplotlib, creating the lines themselves with declarative plotting library seaborn. You may need to install the packages using pip install packagename or conda install packagename before you begin. import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np import matplotlib.dates as mdates # Read in the data df = pd.read_csv('https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Presentation/Figures/Data/Line_Graph_with_Labels_at_the_Beginning_or_End_of_Lines/Research_Nobel_Google_Trends.csv', parse_dates=['date']) # Create the column we wish to plot title = 'Log of Google Trends Index' df[title] = np.log(df['hits']) # Set a style for the plot plt.style.use('ggplot') # Make a plot fig, ax = plt.subplots() # Add lines to it sns.lineplot(ax=ax, data=df, x=\"date\", y=title, hue=\"name\", legend=None) # Add the text--for each line, find the end, annotate it with a label, and # adjust the chart axes so that everything fits on. for line, name in zip(ax.lines, df.columns.tolist()): y = line.get_ydata()[-1] x = line.get_xdata()[-1] if not np.isfinite(y): y=next(reversed(line.get_ydata()[~line.get_ydata().mask]),float(\"nan\")) if not np.isfinite(y) or not np.isfinite(x): continue text = ax.annotate(name, xy=(x, y), xytext=(0, 0), color=line.get_color(), xycoords=(ax.get_xaxis_transform(), ax.get_yaxis_transform()), textcoords=\"offset points\") text_width = (text.get_window_extent( fig.canvas.get_renderer()).transformed(ax.transData.inverted()).width) if np.isfinite(text_width): ax.set_xlim(ax.get_xlim()[0], text.xy[0] + text_width * 1.05) # Format the date axis to be prettier. ax.xaxis.set_major_formatter(mdates.DateFormatter('%b-%d')) ax.xaxis.set_minor_locator(mdates.DayLocator()) ax.xaxis.set_major_locator(mdates.AutoDateLocator(interval_multiples=False)) plt.tight_layout() plt.show() . ",
"url": "/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.html#python",
"relUrl": "/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.html#python"
- },"500": {
+ },"501": {
"doc": "Line Graph with Labels at the Beginning or End of Lines",
"title": "R",
"content": "# If necessary, install ggplot2, lubridate, and directlabels # install.packages(c('ggplot2','directlabels', 'lubridate')) library(ggplot2) library(directlabels) # Load in Google Trends Nobel Search Data # Which contains the Google Trends global search popularity index for the four # research-based Nobel prizes over a month. df <- read.csv('https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Presentation/Figures/Data/Line_Graph_with_Labels_at_the_Beginning_or_End_of_Lines/Research_Nobel_Google_Trends.csv') # Properly treat our date variable as a date # Not necessary in all applications of this technique. df$date <- lubridate::ymd(df$date) # Construct our standard ggplot line graph # Drawing separate lines by name # And using the log of hits for visibility ggplot(df, aes(x = date, y = log(hits), color = name)) + labs(x = \"Date\", y = \"Log of Google Trends Index\")+ geom_line()+ # Since we are about to add line labels, we don't need a legend theme(legend.position = \"none\") + # Add, from the directlabels package, # geom_dl, using method = 'last.bumpup' to put the # labels at the end, and make sure that if they intersect, # one is bumped up geom_dl(aes(label = name), method = 'last.bumpup') + # Extend the x axis so the labels are visible - # Try the graph a few times until you find a range that works scale_x_date(limits = c(min(df$date), lubridate::ymd('2019-10-25'))) . This results in: . ",
"url": "/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.html#r",
"relUrl": "/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.html#r"
- },"501": {
+ },"502": {
"doc": "Line Graph with Labels at the Beginning or End of Lines",
"title": "Stata",
"content": "Unfortunately, performing this technique in Stata requires placing each text() label on the graph. However, this can be automated with the use of a for loop to build the code using locals. * Load in Google Trends Nobel Search Data * Which contains the Google Trends global search popularity index for the four * research-based Nobel prizes over a month. import delimited \"https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Presentation/Figures/Data/Line_Graph_with_Labels_at_the_Beginning_or_End_of_Lines/Research_Nobel_Google_Trends.csv\", clear * Convert the date variable to an actual date * (not necessary in all implementations) g ymddate = date(date, \"YMD\") * Format the new variable as a date so we see it properly on the x-axis format ymddate %td * Graph log(hits) for visibility g loghits = log(hits) * Get the different prize types to graph levelsof name, l(names) * Figure out the last time period in the data set quietly summarize ymddate local lastday = r(max) * Start constructing a local that contains all the line graphs to graph local lines * Start constructing a local that contains the text labels to add local textlabs * Loop through each one foreach n in `names' { * Add in the line graph code * by building on the local we already have (`lines') and adding a new twoway segment local lines `lines' (line loghits ymddate if name == \"`n'\") * Figure out the value this line hits on the last point on the graph quietly summ loghits if name == \"`n'\" & ymddate == `lastday' * The text command takes the y-value (from the mean we just took) * the x-value (the last day on the graph), * and the text label (the name we are working with) * Plus place(r) to put it to the RIGHT of that point local textlabs `textlabs' text(`r(mean)' `lastday' \"`n'\", place(r)) } * Finally, graph our lines * with the twoway lines we've specified, followed by the text labels * We're sure to remove the legend with legend(off) * and extend the x-axis so we can see the labels with xscale(range()) quietly summarize ymddate local start = r(min) local end = r(max) + 5 twoway `lines', `textlabs' legend(off) xscale(range(`start' `end')) xtitle(\"Date\") ytitle(\"Log of Google Trends Index\") . This results in: . ",
"url": "/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.html#stata",
"relUrl": "/Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.html#stata"
- },"502": {
+ },"503": {
"doc": "Line Graphs",
"title": "Line Graphs",
"content": "A line graph is a visualization tool that shows how a value changes over time. A line graph can contain a single line or multiple lines in order to compare how multiple different values change over time. ",
"url": "/Presentation/Figures/line_graphs.html",
"relUrl": "/Presentation/Figures/line_graphs.html"
- },"503": {
+ },"504": {
"doc": "Line Graphs",
"title": "Keep in Mind",
"content": ". | Keep things simple. With line graphs, more is not always better. It’s important that line graphs are kept clean and concise so that they can be interpreted quickly and easily. Including too many lines or axis tick marks can make your graph messy and difficult to read. | The time variable should be on the x-axis for straightforward interpretation. | . ",
"url": "/Presentation/Figures/line_graphs.html#keep-in-mind",
"relUrl": "/Presentation/Figures/line_graphs.html#keep-in-mind"
- },"504": {
+ },"505": {
"doc": "Line Graphs",
"title": "Also Consider",
"content": ". | To enhance a basic line graph, see Styling Line Graphs and Line Graph with Labels at the Beginning or End of Lines. | . ",
"url": "/Presentation/Figures/line_graphs.html#also-consider",
"relUrl": "/Presentation/Figures/line_graphs.html#also-consider"
- },"505": {
+ },"506": {
"doc": "Line Graphs",
"title": "Implementations",
"content": " ",
"url": "/Presentation/Figures/line_graphs.html#implementations",
"relUrl": "/Presentation/Figures/line_graphs.html#implementations"
- },"506": {
+ },"507": {
"doc": "Line Graphs",
"title": "Python",
"content": "Here we will use seaborn.lineplot from the seaborn package, which builds on top of matplotlib. # Load packages import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Load in data Orange = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/Orange.csv') # Specify a line plot in Seaborn using # age and circumference on the x and y axis # and picking just Tree 1 from the data sns.lineplot(x = 'age', y = 'circumference', data = Orange.loc[Orange.Tree == 1]) # And title the axes plt.xlabel('Age (days since 12/31/1968)') plt.ylabel('Circumference') . The result is: . If we want to include all the trees on the graph, with color to distinguish them, we add a hue argument: . # Add on a hue axis to add objects of different color by tree # So we can graph all the trees sns.lineplot(x = 'age', y = 'circumference', hue = 'Tree', data = Orange) # And title the axes plt.xlabel('Age (days since 12/31/1968)') plt.ylabel('Circumference') . Which results in: . ",
"url": "/Presentation/Figures/line_graphs.html#python",
"relUrl": "/Presentation/Figures/line_graphs.html#python"
- },"507": {
+ },"508": {
"doc": "Line Graphs",
"title": "R",
"content": "Basic Line Graph in R . To make a line graph in R, we’ll be using a dataset that’s already built in to R, called ‘Orange’. This dataset tracks the growth in circumference of several trees as they age. library(dplyr) library(lubridate) library(ggplot2) #load in dataset data(Orange) . This dataset has measurements for four different trees. To start off, we’ll only be graphing the growth of Tree #1, so we first need to subset our data. #subset data to just tree #1 tree_1_df <- Orange %>% filter(Tree == 1) . Then we will construct our plot using ggplot(). We’ll create our line graph using the following steps: . | First, call ggplot() and specify the tree_1_df dataset. Next, we need to specify the aesthetics of our graph (what variables go where). Do so with the aes() function, setting x = age and y = circumference. | To make the actual line of the line graph, we will add the line geom_line() to our ggplot line using the + symbol. Using the + symbol allows us to add different lines of code to the same graph in order to create new elements within it. | Putting those steps together, we get the following code resulting in our first line graph: | . ggplot(tree_1_df, aes(x = age, y = circumference)) + geom_line() . This does show us how the tree grows over time, but it’s rather plain and lacks important identifying information like a title and units of measurements for the axes. In order to enhance our graph, we again use the + symbol to add additional elements like line color, titles etc. and to change things like axis labels and title/label position. | We can specify the color of our line within the geom_line() function. | The function labs() allows us to add a title and also change the labels for the axes | Using the function theme() allows us to manipulate the apperance of our labels through the element_text function | Let’s change the line color, add a title and center it, and also add more information to our axes labels. | . ggplot(tree_1_df, aes(x = age, y = circumference)) + geom_line(color = \"orange\") + labs(x = \"Age (days since 12/31/1968)\", y = \"Circumference (mm)\", title = \"Orange Tree Circumference Growth by Age\") + theme(plot.title = element_text(hjust = 0.5)) . Line Graph with Multiple Lines in R . A great way to employ line graphs is to compare the changes of different values over the same time period. For this instance, we’ll be looking at how the four trees differ in their growth over time. We will be employing the full Orange dataset for this graph. To add multiple lines using data from the same dataframe, simply add the color argument to the aes() function within our ggplot() line. Set the color argument to the identifying variable within your data set, here, that variable is Tree, so we will set color = Tree. ggplot(Orange, aes(x = age, y = circumference, color = Tree)) + geom_line() + labs(x = \"Age (days since 12/31/1968)\", y = \"Circumference (mm)\", title = \"Orange Tree Circumference Growth by Age\") + theme(plot.title = element_text(hjust = 0.5)) . The steps will get you started with creating graphs in R. For more information on styling your graphs, again, visit Styling Line Graphs and Line Graph with Labels at the Beginning or End of Lines. Another great resource for line graph styling tips is this blog post created by Jodie Burchell. ",
"url": "/Presentation/Figures/line_graphs.html#r",
"relUrl": "/Presentation/Figures/line_graphs.html#r"
- },"508": {
+ },"509": {
"doc": "Line Graphs",
"title": "Stata",
"content": "We can create a line graph in Stata using the twoway function with the line setting. * Load data on orange trees import delimited \"https://vincentarelbundock.github.io/Rdatasets/csv/datasets/Orange.csv\", clear * Let's just graph the first tree using if Tree == 1 * We specify the y-axis variable circumference first followed by the x-axis variable age * We can add axis labels with xtitle and ytitle * And specify a color with lcolor (for line color) twoway line circumference age if tree == 1, xtitle(\"Age (days since 12/31/1968)\") ytitle(\"Circumference\") lcolor(red) . The result is: . We can also include all the trees on the same line graph: . * If we want all of our trees graphed on the same axis * We can specify each line separately using () * Use legend() so we know which line is which * Or label the lines directly using /Presentation/Figures/line_graph_with_labels_at_the_beginning_or_end.html twoway (line circumference age if tree == 1) (line circumference age if tree == 2) (line circumference age if tree == 3) (line circumference age if tree == 4) (line circumference age if tree == 5), xtitle(\"Age (days since 12/31/1968)\") ytitle(\"Circumference\") legend(lab(1 \"Tree 1\") lab(2 \"Tree 2\") lab(3 \"Tree 3\") lab(4 \"Tree 4\") lab(5 \"Tree 5\")) . The result is: . ",
"url": "/Presentation/Figures/line_graphs.html#stata",
"relUrl": "/Presentation/Figures/line_graphs.html#stata"
- },"509": {
+ },"510": {
"doc": "Linear Hypothesis Tests",
"title": "Linear Hypothesis Tests",
"content": "Most regression output will include the results of frequentist hypothesis tests comparing each coefficient to 0. However, in many cases, you may be interested in whether a linear sum of the coefficients is 0. For example, in the regression . \\[Outcome = \\beta_0 + \\beta_1\\times GoodThing + \\beta_2\\times BadThing\\] You may be interested to see if \\(GoodThing\\) and \\(BadThing\\) (both binary variables) cancel each other out. So you would want to do a test of \\(\\beta_1 - \\beta_2 = 0\\). Alternately, you may want to do a joint significance test of multiple linear hypotheses. For example, you may be interested in whether \\(\\beta_1\\) or \\(\\beta_2\\) are nonzero and so would want to jointly test the hypotheses \\(\\beta_1 = 0\\) and \\(\\beta_2=0\\) rather than doing them one at a time. Note the and here, since if either one or the other is rejected, we reject the null. ",
"url": "/Model_Estimation/Statistical_Inference/linear_hypothesis_tests.html",
"relUrl": "/Model_Estimation/Statistical_Inference/linear_hypothesis_tests.html"
- },"510": {
+ },"511": {
"doc": "Linear Hypothesis Tests",
"title": "Keep in Mind",
"content": ". | Be sure to carefully interpret the result. If you are doing a joint test, rejection means that at least one of your hypotheses can be rejected, not each of them. And you don’t necessarily know which ones can be rejected! | Generally, linear hypothesis tests are performed using F-statistics. However, there are alternate approaches such as likelihood tests or chi-squared tests. Be sure you know which on you’re getting. | Conceptually, what is going on with linear hypothesis tests is that they compare the model you’ve estimated against a more restrictive one that requires your restrictions (hypotheses) to be true. If the test you have in mind is too complex for the software to figure out on its own, you might be able to do it on your own by taking the sum of squared residuals in your original unrestricted model (\\(SSR_{UR}\\)), estimate the alternate model with the restriction in place (\\(SSR_R\\)) and then calculate the F-statistic for the joint test using \\(F_{q,n-k-1} = ((SSR_R - SSR_{UR})/q)/(SSR_{UR}/(n-k-1))\\). | . ",
"url": "/Model_Estimation/Statistical_Inference/linear_hypothesis_tests.html#keep-in-mind",
"relUrl": "/Model_Estimation/Statistical_Inference/linear_hypothesis_tests.html#keep-in-mind"
- },"511": {
+ },"512": {
"doc": "Linear Hypothesis Tests",
"title": "Also Consider",
"content": ". | The process for testing a nonlinear combination of your coefficients, for example testing if \\(\\beta_1\\times\\beta_2 = 1\\) or \\(\\sqrt{\\beta_1} = .5\\), is generally different. See Nonlinear hypothesis tests. | . ",
"url": "/Model_Estimation/Statistical_Inference/linear_hypothesis_tests.html#also-consider",
"relUrl": "/Model_Estimation/Statistical_Inference/linear_hypothesis_tests.html#also-consider"
- },"512": {
+ },"513": {
"doc": "Linear Hypothesis Tests",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/Statistical_Inference/linear_hypothesis_tests.html#implementations",
"relUrl": "/Model_Estimation/Statistical_Inference/linear_hypothesis_tests.html#implementations"
- },"513": {
+ },"514": {
"doc": "Linear Hypothesis Tests",
"title": "R",
"content": "Linear hypothesis test in R can be performed for most regression models using the linearHypothesis() function in the car package. See this guide for more information. # If necessary # install.packages('car') library(car) data(mtcars) # Run our model m1 <- lm(mpg ~ hp + disp + am + wt, data = mtcars) # Test a linear combination of coefficients linearHypothesis(m1, c('hp + disp = 0')) # Test joint significance of multiple coefficients linearHypothesis(m1, c('hp = 0','disp = 0')) # Test joint significance of multiple linear combinations linearHypothesis(m1, c('hp + disp = 0','am + wt = 0')) . ",
"url": "/Model_Estimation/Statistical_Inference/linear_hypothesis_tests.html#r",
"relUrl": "/Model_Estimation/Statistical_Inference/linear_hypothesis_tests.html#r"
- },"514": {
+ },"515": {
"doc": "Linear Hypothesis Tests",
"title": "Stata",
"content": "Tests of coefficients in Stata can generally be performed using the built-in test command. * Load data sysuse auto.dta reg mpg headroom trunk prince rep78 * Make sure to run tests while the previous regression is still in memory * Test joint significance of multiple coefficients test headroom trunk * testparm does the same thing but allows wildcards to select coefficients * this will test the joint significance of every variable with an e in it testparm *e* * Test a linear combination of the coefficients test headroom + trunk = 0 * Test multiple linear combinations by accumulating them one at a time test headroom + trunk = 0 test price + rep78 = 0, accumulate . ",
"url": "/Model_Estimation/Statistical_Inference/linear_hypothesis_tests.html#stata",
"relUrl": "/Model_Estimation/Statistical_Inference/linear_hypothesis_tests.html#stata"
- },"515": {
+ },"516": {
"doc": "Linear Mixed-Effects Regression",
"title": "Linear Mixed-Effects Regression",
"content": "Mixed-effects regression goes by many names, including hierarchical linear model, random coefficient model, and random parameter models. In a mixed-effects regression, some of the parameters are “random effects” which are allowed to vary over the sample. Others are “fixed effects”, which are not. Note that this use of the term “fixed effects” is not the same as in fixed effects regression. For example, consider the model . \\[y_{ij} = \\beta_{0j} + \\beta_{1j}X_{1ij} + \\beta_{2}X_{2ij} + e_{ij}\\] The intercept \\(\\beta_{0j}\\) has a \\(j\\) subscript and is allowed to vary over the sample at the \\(j\\) level, where \\(j\\) may indicate individual or group, depending on context. The slope on \\(X_{1ij}\\), \\(\\beta_{1j}\\), is similarly allowed to vary over the sample. These are random effects. \\(\\beta_{2}\\) is not allowed to vary over the sample and so is fixed. The random parameters have their own “level-two” equations, which may or may not include level-two covariates. \\[\\beta_{0j} = \\gamma_{00} + \\gamma_{01}W_j + u_{0j}\\] \\[\\beta_{1j} = \\gamma_{10} + u_{1j}\\] For more information see Wikipedia. ",
"url": "/Model_Estimation/Multilevel_Models/linear_mixed_effects_regression.html",
"relUrl": "/Model_Estimation/Multilevel_Models/linear_mixed_effects_regression.html"
- },"516": {
+ },"517": {
"doc": "Linear Mixed-Effects Regression",
"title": "Keep in Mind",
"content": ". | The assumptions necessary to use a mixed-effects model in general are the same as for most linear models. However, in addition, mixed-effects models assume that the error terms at different levels are unrelated. | At the second level, statistical power depends on the number of different \\(j\\) values there are. Mixed-effects models may perform poorly if the coefficient is allowed to vary over only a few groups. | There’s no need to stop at two levels - the second-level coefficients can also be allowed to vary at a higher level. | . ",
"url": "/Model_Estimation/Multilevel_Models/linear_mixed_effects_regression.html#keep-in-mind",
"relUrl": "/Model_Estimation/Multilevel_Models/linear_mixed_effects_regression.html#keep-in-mind"
- },"517": {
+ },"518": {
"doc": "Linear Mixed-Effects Regression",
"title": "Also Consider",
"content": ". | There are many variations of mixed-effects models for working with non-linear data, see nonlinear mixed-effects models. | If the goal is making predictions within subgroups, you may want to consider multi-level regression with poststratification. | . ",
"url": "/Model_Estimation/Multilevel_Models/linear_mixed_effects_regression.html#also-consider",
"relUrl": "/Model_Estimation/Multilevel_Models/linear_mixed_effects_regression.html#also-consider"
- },"518": {
+ },"519": {
"doc": "Linear Mixed-Effects Regression",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/Multilevel_Models/linear_mixed_effects_regression.html#implementations",
"relUrl": "/Model_Estimation/Multilevel_Models/linear_mixed_effects_regression.html#implementations"
- },"519": {
+ },"520": {
"doc": "Linear Mixed-Effects Regression",
"title": "R",
"content": "One common way to fit mixed-effects models in R is with the lmer function in the lme4 package. To fit fully Bayesian models you may want to consider instead using STAN with the rstan package. See the multi-level regression with poststratification page for more information. # Install lme4 if necessary # install.packages('lme4') # Load up lme4 library(lme4) # Load up university instructor evaluations data from lme4 data(InstEval) # We'll be treating lecture age as a numeric variable InstEval$lectage <- as.numeric(InstEval$lectage) # Let's look at the relationship between lecture ratings andhow long ago the lecture took place # with a control for whether the lecture was a service lecture ols <- lm(y ~ lectage + service, data = InstEval) summary(ols) # Now we will use lmer to allow the intercept to vary at the department level me1 <- lmer(y ~ lectage + service + (1 | dept), data = InstEval) summary(me1) # Now we will allow the slope on lectage to vary at the department level me2 <- lmer(y ~ lectage + service + (-1 + lectage | dept), data = InstEval) summary(me2) # Now both the intercept and lectage slope will vary at the department level me3 <- lmer(y ~ lectage + service + (lectage | dept), data = InstEval) summary(me3) . ",
"url": "/Model_Estimation/Multilevel_Models/linear_mixed_effects_regression.html#r",
"relUrl": "/Model_Estimation/Multilevel_Models/linear_mixed_effects_regression.html#r"
- },"520": {
+ },"521": {
"doc": "Linear Mixed-Effects Regression",
"title": "Stata",
"content": "Stata has a family of functions based around the mixed command that can estimate mixed-effects models. * Load NLS-W data sysuse nlsw88.dta, clear * We are going to estimate the relationship between hourly wage and job tenure * with a contorl for marital status * Without mixed effects reg wage tenure married * Now we will allow the intercept to vary with occupation mixed wage tenure married || occupation: * Next we will allow the slope on tenure to vary with occupation mixed wage tenure married || occupation: tenure, nocons * Now, both! mixed wage tenure married || occupation: tenure * Finally we will allow the intercept and tenure slope to vary over both occupation * and age mixed wage tenure married || occupation: tenure || age: tenure . ",
"url": "/Model_Estimation/Multilevel_Models/linear_mixed_effects_regression.html#stata",
"relUrl": "/Model_Estimation/Multilevel_Models/linear_mixed_effects_regression.html#stata"
- },"521": {
+ },"522": {
"doc": "Logit Model",
"title": "Logit Regressions",
"content": "A logistical regression (Logit) is a statistical method for a best-fit line between a binary [0/1] outcome variable \\(Y\\) and any number of independent variables. Logit regressions follow a logistical distribution and the predicted probabilities are bounded between 0 and 1. For more information about Logit, see Wikipedia: Logit. ",
"url": "/Model_Estimation/GLS/logit_model.html#logit-regressions",
"relUrl": "/Model_Estimation/GLS/logit_model.html#logit-regressions"
- },"522": {
+ },"523": {
"doc": "Logit Model",
"title": "Keep in Mind",
"content": ". | The beta coefficients from a logit model are maximum likelihood estimations. They are not the marginal effect, as you would see in an OLS estimation. So you cannot interpret the beta coefficient as a marginal effect of \\(X\\) on \\(Y\\). | To obtain the marginal effect, you need to perform a post-estimation command to discover the marginal effect. In general, you can ‘eye-ball’ the marginal effect by dividing the logit beta coefficient by 4. | . ",
"url": "/Model_Estimation/GLS/logit_model.html#keep-in-mind",
"relUrl": "/Model_Estimation/GLS/logit_model.html#keep-in-mind"
- },"523": {
+ },"524": {
"doc": "Logit Model",
"title": "Also Consider",
"content": ". | See Marginal Effects in Nonlinear Regression for more details on the different kinds of marginal effects. | . ",
"url": "/Model_Estimation/GLS/logit_model.html#also-consider",
"relUrl": "/Model_Estimation/GLS/logit_model.html#also-consider"
- },"524": {
+ },"525": {
"doc": "Logit Model",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/GLS/logit_model.html#implementations",
"relUrl": "/Model_Estimation/GLS/logit_model.html#implementations"
- },"525": {
+ },"526": {
"doc": "Logit Model",
"title": "Gretl",
"content": "# Load auto data open auto.gdt # Run logit using the auto data, with mpg as the outcome variable # and headroom, trunk, and weight as predictors logit mpg const headroom trunk weight . ",
"url": "/Model_Estimation/GLS/logit_model.html#gretl",
"relUrl": "/Model_Estimation/GLS/logit_model.html#gretl"
- },"526": {
+ },"527": {
"doc": "Logit Model",
"title": "Python",
"content": "There are a number of Python packages that can perform logit regressions but the most comprehensive is probably statsmodels. The code below is an example of how to use it. # Install pandas and statsmodels using pip or conda, if you don't already have them. import pandas as pd import statsmodels.formula.api as smf df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv', index_col=0) # Specify the model, regressing vs on mpg and cyl mod = smf.logit('vs ~ mpg + cyl', data=df) # Fit the model res = mod.fit() # Look at the results res.summary() # Compute marginal effects marg_effect = res.get_margeff(at='mean', method='dydx') # Show marginal effects marg_effect.summary() . ",
"url": "/Model_Estimation/GLS/logit_model.html#python",
"relUrl": "/Model_Estimation/GLS/logit_model.html#python"
- },"527": {
+ },"528": {
"doc": "Logit Model",
"title": "R",
"content": "R can run a logit regression using the glm() function. However, to get marginal effects you will need to calculate them by hand or use a package. We will use the mfx package, although the margins package is another good option, which produces tidy model output. # If necessary, install the mfx package # install.packages('mfx') # mfx is only needed for the marginal effect, not the regression itself library(mfx) # Load mtcars data data(mtcars) # Use the glm() function to run logit # Here we are predicting engine type using # miles per gallon and number of cylinders as predictors my_logit <- glm(vs ~ mpg + cyl, data = mtcars, family = binomial(link = 'logit')) # The family argument says we are working with binary data # and using a logit link function (rather than, say, probit) # The results summary(my_logit) # Marginal effects logitmfx(vs ~ mpg + cyl, data = mtcars) . ",
"url": "/Model_Estimation/GLS/logit_model.html#r",
"relUrl": "/Model_Estimation/GLS/logit_model.html#r"
- },"528": {
+ },"529": {
"doc": "Logit Model",
"title": "Stata",
"content": "* Load auto data sysuse auto.dta * Logit Estimation logit foreign mpg weight headroom trunk * Recover the Marginal Effects (Beta Coefficient in OLS) margins, dydx(*) . ",
"url": "/Model_Estimation/GLS/logit_model.html#stata",
"relUrl": "/Model_Estimation/GLS/logit_model.html#stata"
- },"529": {
+ },"530": {
"doc": "Logit Model",
"title": "Logit Model",
"content": " ",
"url": "/Model_Estimation/GLS/logit_model.html",
"relUrl": "/Model_Estimation/GLS/logit_model.html"
- },"530": {
+ },"531": {
"doc": "Marginal effects plots for interactions with categorical variables",
"title": "Marginal effects plots for interactions with categorical variables",
"content": "In many contexts, the effect of one variable on another might be allowed to vary. For example, the relationship between income and mortality might be different between someone with no degree, a high school degree, or a college degree. A marginal effects plot for a categorical interaction displays the effect of $X$ on $Y$ on the y-axis for different values of a categorical variable $Z$ on the x-axis. The plot will often include confidence intervals as well. In some cases the categorical variable may be ordered, so you’d want the $Z$ values to show up in that order. ",
"url": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html",
"relUrl": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html"
- },"531": {
+ },"532": {
"doc": "Marginal effects plots for interactions with categorical variables",
"title": "Keep in Mind",
"content": ". | Some versions of these graphs normalize the effect of one of the categories to 0, and shows the effect for other values relative to that one. | . ",
"url": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html#keep-in-mind",
"relUrl": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html#keep-in-mind"
- },"532": {
+ },"533": {
"doc": "Marginal effects plots for interactions with categorical variables",
"title": "Also Consider",
"content": ". | Consider performing a power analysis of interaction terms before running your analysis to see whether you have the statistical power for your interactions | Marginal effects plots for interactions with continuous variables | . ",
"url": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html#also-consider",
"relUrl": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html#also-consider"
- },"533": {
+ },"534": {
"doc": "Marginal effects plots for interactions with categorical variables",
"title": "Implementations",
"content": "In each of these examples, we will be using data on organ donation rates by state from Kessler & Roth 2014. The example is of a 2x2 difference-in-difference model extended to estimate dynamic treatment effects, where treatment is interacted with the number of time periods until/since treatment goes into effect. All of these examples directly retrieve effect and confidence interval information from the regression by hand rather than relying on a package; packages for graphing interactions often focus on continuous interactions. The original code snippets for the Python, R, and Stata examples comes from the textbook The Effect. ",
"url": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html#implementations",
"relUrl": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html#implementations"
- },"534": {
+ },"535": {
"doc": "Marginal effects plots for interactions with categorical variables",
"title": "Python",
"content": "# PYTHON CODE import pandas as pd import matplotlib.pyplot as plt import linearmodels as lm # Read in data od = pd.read_csv('https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Figures/Data/Marginal_Effects_Plots_For_Interactions_With_Categorical_Variables/organ_donation.csv') # Create Treatment Variable od['California'] = od['State'] == 'California' # PanelOLS requires a numeric time variable od['Qtr'] = 1 od.loc[od['Quarter'] == 'Q12011', 'Qtr'] = 2 od.loc[od['Quarter'] == 'Q22011', 'Qtr'] = 3 od.loc[od['Quarter'] == 'Q32011', 'Qtr'] = 4 od.loc[od['Quarter'] == 'Q42011', 'Qtr'] = 5 od.loc[od['Quarter'] == 'Q12012', 'Qtr'] = 6 # Create our interactions by hand, # skipping quarter 3, the last one before treatment for i in range(1, 7): name = f\"INX{i}\" od[name] = 1 * od['California'] od.loc[od['Qtr'] != i, name] = 0 # Set our individual and time (index) for our data od = od.set_index(['State','Qtr']) mod = lm.PanelOLS.from_formula('''Rate ~ INX1 + INX2 + INX4 + INX5 + INX6 + EntityEffects + TimeEffects''',od) # Specify clustering when we fit the model clfe = mod.fit(cov_type = 'clustered', cluster_entity = True) # Get coefficients and CIs res = pd.concat([clfe.params, clfe.std_errors], axis = 1) # Scale standard error to CI res['ci'] = res['std_error']*1.96 # Add our quarter values res['Qtr'] = [1, 2, 4, 5, 6] # And add our reference period back in reference = pd.DataFrame([[0,0,0,3]], columns = ['parameter', 'lower', 'upper', 'Qtr']) res = pd.concat([res, reference]) # For plotting, sort and add labels res = res.sort_values('Qtr') res['Quarter'] = ['Q42010','Q12011', 'Q22011','Q32011', 'Q42011','Q12012'] # Plot the estimates as connected lines with error bars plt.errorbar(x = 'Quarter', y = 'parameter', yerr = 'ci', data = res) # Add a horizontal line at 0 plt.axhline(0, linestyle = 'dashed') . ",
"url": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html#python",
"relUrl": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html#python"
- },"535": {
+ },"536": {
"doc": "Marginal effects plots for interactions with categorical variables",
"title": "R",
"content": "If you happen to be using the fixest package to run your model, there is actually a single convenient command coefplot that will make the graph for you. However, this requires your analysis to use some other tools from fixest too. So below I’ll show both the fixest approach as well as a more general approach (which also uses a fixest model but doesn’t need to). First, prepare the data: . library(tidyverse) library(fixest) library(broom) od <- read_csv('https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Figures/Data/Marginal_Effects_Plots_For_Interactions_With_Categorical_Variables/organ_donation.csv') # Treatment variable od <- od %>% mutate(Treated = State == 'California' & Quarter %in% c('Q32011','Q42011','Q12012')) %>% # Create an ordered version of Quarter so we can graph it # and make sure we drop the last pre-treatment interaction, # which is quarter 2 of 2011 mutate(Quarter = relevel(factor(Quarter), ref = 'Q22011')) %>% # The treated group is the state of California # The 1* is only necessary for the first fixest method below; optional for the second, more general method mutate(California = 1*(State == 'California')) . Next, our steps to do the fixest-specific method: . # in the *specific example* of fixest, there is a simple and easy method: od <- od %>% mutate(fQuarter = factor(Quarter, levels = c('Q42010','Q12011','Q22011', 'Q32011','Q42011','Q12012'))) femethod <- feols(Rate ~ i(California, fQuarter, drop = 'Q22011') | State + Quarter, data = od) coefplot(femethod, ref = c('Q22011' = 3), pt.join = TRUE) . However, for other packages this may not work, so I will also do it by hand in a way that will work with models more generally (even though we’ll still run the model in fixest): . # Interact quarter with being in the treated group clfe <- feols(Rate ~ California*Quarter | State, data = od) coefplot(clfe, ref = 'Q22011') # Use broom::tidy to get the coefficients and SEs res <- tidy(clfe) %>% # Keep only the interactions filter(str_detect(term, ':')) %>% # Pull the quarter out of the term mutate(Quarter = str_sub(term, -6)) %>% # Add in the term we dropped as 0 add_row(estimate = 0, std.error = 0, Quarter = 'Q22011') %>% # and add 95% confidence intervals mutate(ci_bottom = estimate - 1.96*std.error, ci_top = estimate + 1.96*std.error) %>% # And put the quarters in order mutate(Quarter = factor(Quarter, levels = c('Q42010','Q12011','Q22011', 'Q32011','Q42011','Q12012'))) # And graph # \"group = 1\" is necessary to get ggplot to add the line graph # when the x-axis is a factor ggplot(res, aes(x = Quarter, y = estimate, group = 1)) + # Add points for each estimate and connect them geom_point() + geom_line() + # Add confidence intervals geom_linerange(aes(ymin = ci_bottom, ymax = ci_top)) + # Add a line so we know where 0 is geom_hline(aes(yintercept = 0), linetype = 'dashed') + # Always label! labs(caption = '95% Confidence Intervals Shown') . ",
"url": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html#r",
"relUrl": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html#r"
- },"536": {
+ },"537": {
"doc": "Marginal effects plots for interactions with categorical variables",
"title": "Stata",
"content": "* For running the model: * ssc install reghdfe import delimited using https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Figures/Data/Marginal_Effects_Plots_For_Interactions_With_Categorical_Variables/organ_donation.csv, clear * Create value-labeled version of quarter * So we can easily graph it g Qtr = 1 replace Qtr = 2 if quarter == \"Q12011\" replace Qtr = 3 if quarter == \"Q22011\" replace Qtr = 4 if quarter == \"Q32011\" replace Qtr = 5 if quarter == \"Q42011\" replace Qtr = 6 if quarter == \"Q12012\" label def quarters 1 \"Q42010\" 2 \"Q12011\" 3 \"Q22011\" 4 \"Q32011\" 5 \"Q42011\" 6 \"Q12012\" label values Qtr quarters * Interact being in the treated group * with Qtr, using ib3 to drop the third * quarter (the last one before treatment) g California = state == \"California\" reghdfe rate California##ib3.Qtr, a(state Qtr) vce(cluster state) * Pull out the coefficients and SEs g coef = . g se = . forvalues i = 1(1)6 { replace coef = _b[1.California#`i'.Qtr] if Qtr == `i' replace se = _se[1.California#`i'.Qtr] if Qtr == `i' } * Make confidence intervals g ci_top = coef+1.96*se g ci_bottom = coef - 1.96*se * Limit ourselves to one observation per quarter keep Qtr coef se ci_* duplicates drop * Create connected scatterplot of coefficients * with CIs included with rcap * and a line at 0 from function twoway (sc coef Qtr, connect(line)) (rcap ci_top ci_bottom Qtr) (function y = 0, range(1 6)), xtitle(\"Quarter\") caption(\"95% Confidence Intervals Shown\") . ",
"url": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html#stata",
"relUrl": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html#stata"
- },"537": {
+ },"538": {
"doc": "Marginal Effects Plots for Interactions with Continuous Variables",
"title": "Marginal Effects Plots for Interactions with Continuous Variables",
"content": "In many contexts, the effect of one variable on another might be allowed to vary. For example, the relationship between income and mortality is nonlinear, so the effect of an additional dollar of income on mortality is different for someone earning $20,000/year than for someone earning $100,000/year. Or maybe the relationship between income and mortality differs depending on how many years of education you have. A marginal effects plot displays the effect of \\(X\\) on \\(Y\\) for different values of \\(Z\\) (or \\(X\\)). The plot will often include confidence intervals as well. The same code will often work if there’s not an explicit interaction, but you are, for example, estimating a logit model where the effect of one variable changes with the values of the others. ",
"url": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_continuous_variables.html",
"relUrl": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_continuous_variables.html"
- },"538": {
+ },"539": {
"doc": "Marginal Effects Plots for Interactions with Continuous Variables",
"title": "Keep in Mind",
"content": ". | Interactions often have poor statistical power, and you will generally need a lot of observations to tell if the effect of $X$ on \\(Y\\) is different for two given different values of \\(Z\\). | Make sure your graph has clearly labeled axes, so readers can tell whether your y-axis is the predicted value of $Y$ or the marginal effect of \\(X\\) on \\(Y\\). | . ",
"url": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_continuous_variables.html#keep-in-mind",
"relUrl": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_continuous_variables.html#keep-in-mind"
- },"539": {
+ },"540": {
"doc": "Marginal Effects Plots for Interactions with Continuous Variables",
"title": "Also Consider",
"content": ". | Consider performing a power analysis of interaction terms before running your analysis to see whether you have the statistical power for your interactions | Average marginal effects or marginal effects at the mean can be used to get a single marginal effect averaged over your sample, rather than showing how it varies across the sample. | Marginal effects plots for interactions with categorical variables | . ",
"url": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_continuous_variables.html#also-consider",
"relUrl": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_continuous_variables.html#also-consider"
- },"540": {
+ },"541": {
"doc": "Marginal Effects Plots for Interactions with Continuous Variables",
"title": "Implementations",
"content": " ",
"url": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_continuous_variables.html#implementations",
"relUrl": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_continuous_variables.html#implementations"
- },"541": {
+ },"542": {
"doc": "Marginal Effects Plots for Interactions with Continuous Variables",
"title": "R",
"content": "The interplot package can plot the marginal effect of a variable \\(X\\) (y-axis) against different values of some variable. If instead you want the predicted values of \\(Y\\) on the y-axis, look at the ggeffects package. # Install relevant packages, if necessary: # install.packages(c('ggplot2', 'interplot')) # Load in ggplot2 and interplot library(ggplot2) library(interplot) # Load in the txhousing data data(txhousing) # Estimate a regression with a nonlinear term cubic_model <- lm(sales ~ listings + I(listings^2) + I(listings^3), data = txhousing) # Get the marginal effect of var1 (listings) # at different values of var2 (listings), with confidence ribbon. # This will return a ggplot object, so you can # customize using ggplot elements like labs(). interplot(cubic_model, var1 = \"listings\", var2 = \"listings\")+ labs(x = \"Number of Listings\", y = \"Marginal Effect of Listings\") # Try setting adding listings*date to the regression model # and then in interplot set var2 = \"date\" to get the effect of listings at different values of date . This results in: . ",
"url": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_continuous_variables.html#r",
"relUrl": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_continuous_variables.html#r"
- },"542": {
+ },"543": {
"doc": "Marginal Effects Plots for Interactions with Continuous Variables",
"title": "Stata",
"content": "We will use the marginsplot command, which requires Stata 12 or higher. * Load in the National Longitudinal Survey of Youth - Women sample sysuse nlsw88.dta * Perform a regression with a nonlinear term regress wage c.tenure##c.tenure * Use margins to calculate the marginal effects * Put the variable we're interested in getting the effect of in dydx() * And the values we want to evaluate it at in at() margins, dydx(tenure) at(tenure = (0(1)26)) * (If we had interacted with another variable, say age, we would specify similarly, * with at(age = (start(count-by)end))) * Then, marginsplot * The recast() and recastci() options make the effect/CI show up as a line/area * Remove to get points/lines instead. marginsplot, xtitle(\"Tenure\") ytitle(\"Marginal Effect of Tenure\") recast(line) recastci(rarea) . This results in: . ",
"url": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_continuous_variables.html#stata",
"relUrl": "/Presentation/Figures/marginal_effects_plots_for_interactions_with_continuous_variables.html#stata"
- },"543": {
+ },"544": {
"doc": "Matching",
"title": "Matching",
"content": " ",
"url": "/Model_Estimation/Matching/matching.html",
"relUrl": "/Model_Estimation/Matching/matching.html"
- },"544": {
+ },"545": {
"doc": "McFadden's Choice Model (Alternative-Specific Conditional Logit)",
"title": "McFadden’s Choice Model (Alternative-Specific Conditional Logit)",
"content": "Discrete choice models are a regression method used to predict a categorical dependent variable with more than two categories. For example, a discrete choice model might be used to predict whether someone is going to take a train, car, or bus to work. McFadden’s Choice Model is a discrete choice model that uses conditional logit, in which the variables that predict choice can vary either at the individual level (perhaps tall people are more likely to take the bus), or at the alternative level (perhaps the train is cheaper than the bus). For more information, see Wikipedia: Discrete Choice . ",
"url": "/Model_Estimation/GLS/mcfaddens_choice_model.html#mcfaddens-choice-model-alternative-specific-conditional-logit",
"relUrl": "/Model_Estimation/GLS/mcfaddens_choice_model.html#mcfaddens-choice-model-alternative-specific-conditional-logit"
- },"545": {
+ },"546": {
"doc": "McFadden's Choice Model (Alternative-Specific Conditional Logit)",
"title": "Keep in Mind",
"content": ". | Just like other regression methods, the McFadden model does not guarantee that the estimates will be causal. Similarly, while the McFadden model is designed so that the results can be interpreted in terms of a “random utility” function, making inferences about utility functions does require additional assumptions. | The standard McFadden model assumes that the choice follows the Independence of Irrelevant Alternatives, which may be a strong assumption. There are variants of the McFadden model that relax this assumption. | If you are working with an estimation command that only allows alternative-specific predictors and not case-specific predictors, you can add them yourself by interacting the case-specific predictors with binary variables for the different alternatives. If \\(Income\\) is your case-specific variable and your alternatives are “train”, “bus”, and “car”, you’d add \\(Income \\times (mode == \"train\")\\), \\(Income \\times (mode == \"bus\")\\), and \\(Income \\times (mode == \"car\")\\) to your model. These are your case-specific predictors. | Choice model regressions often have specific demands on how your data is structured. These vary across estimation commands and software packages. However, a common one is this (others will be pointed out in specific Implementations below): The data must contain a variable indicating the choice cases (i.e. you choose a car, that’s one case, then I choose a car, that’s a different case), a variable with the alternatives being chosen between, a binary variable equal to 1 for the alternative actually chosen (this should be 1 or TRUE exactly once within each choice case), and then variables that are case-specific or alternative-specific. | . In the below table, \\(I\\) gives the choice case, \\(Alts\\) gives the options, \\(Chose\\) gives the choice, \\(X\\) is a variable that varies at the alternative level, and \\(Y\\) is a variable that varies at the case level. | I | Alts | Chose | X | Y | . | 1 | A | 1 | 10 | 3 | . | 1 | B | 0 | 20 | 3 | . | 1 | C | 0 | 10.5 | 3 | . | 2 | A | 0 | 8 | 5 | . | 2 | B | 1 | 9 | 5 | . | 3 | C | 0 | 1 | 5 | . This might be referred to as “long” choice data. “Wide” choice data is also common, and looks like: . | I | Chose | Y | XA | XB | XC | . | 1 | A | 3 | 10 | 20 | 10.5 | . | 2 | B | 5 | 8 | 9 | 1 | . ",
"url": "/Model_Estimation/GLS/mcfaddens_choice_model.html#keep-in-mind",
"relUrl": "/Model_Estimation/GLS/mcfaddens_choice_model.html#keep-in-mind"
- },"546": {
+ },"547": {
"doc": "McFadden's Choice Model (Alternative-Specific Conditional Logit)",
"title": "Also Consider",
"content": ". | In order to relax the independence of irrelevant alternatives assumption and/or more closely model individual preferences, consider the mixed logit, nested logit or hierarchical Bayes conditional logit models. | . ",
"url": "/Model_Estimation/GLS/mcfaddens_choice_model.html#also-consider",
"relUrl": "/Model_Estimation/GLS/mcfaddens_choice_model.html#also-consider"
- },"547": {
+ },"548": {
"doc": "McFadden's Choice Model (Alternative-Specific Conditional Logit)",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/GLS/mcfaddens_choice_model.html#implementations",
"relUrl": "/Model_Estimation/GLS/mcfaddens_choice_model.html#implementations"
- },"548": {
+ },"549": {
"doc": "McFadden's Choice Model (Alternative-Specific Conditional Logit)",
"title": "R",
"content": "We will implement McFadden’s choice model in R using the mlogit package, which can accept “wide” or “long” data in the mlogit.data function. library(mlogit) # Get Car data, in \"wide\" choice format data(Car) # If we look at the data, the choice-specific variables are named # e.g. \"speed1\" \"speed2\" \"speed3\" and so on. # So we need our choice variable to be 1, 2, 3 ,to match # Right now instead it's choice1, choice2, choice3. So we edit. Car$choice <- substr(Car$choice, 7, 7) # For this we need to specify the choice variable with choice # whether it's currently in wide or long format with shape # the column numbers of the alternative-specific variables with varying. # We need alt.levels to tell us what our alternatives are (1-6, as seen in choice). # We also need sep = \"\" since our wide-format variable names are type1, type2, etc. # If the variable names were type_1, type_2, etc., we'd need sep = \"_\". # If this were long data we'd also want: # the case identifier with id.var (for individuals) and/or chid.var # (for multiple choices within individuals) # And a variable indicating the alternatives with alt.var # But could skip the alt.levels and sep arguments mlogit.Car <- mlogit.data(Car, choice = 'choice', shape = 'wide', varying = 5:70, sep=\"\") # mlogit.Car is now in \"long\" format # Note that if we did start with \"long\" format we could probably skip the mlogit.data() step. # Now we can run the regression with mlogit(). # We \"regress\" the choice on the alternative-specific variables like type, fuel, and price # Then put a pipe separator | # and add our case-specific variables like college model <- mlogit(choice ~ type + fuel + price | college, data = mlogit.Car) # Look at the results summary(model) . ",
"url": "/Model_Estimation/GLS/mcfaddens_choice_model.html#r",
"relUrl": "/Model_Estimation/GLS/mcfaddens_choice_model.html#r"
- },"549": {
+ },"550": {
"doc": "McFadden's Choice Model (Alternative-Specific Conditional Logit)",
"title": "Stata",
"content": "Stata has the McFadden model built in. We will estimate the model using the older asclogit command as well as the cmclogit command that comes with Stata 16. These commands require “long” choice data, as described in the Keep in Mind section. * Load in car choice data webuse carchoice * To use asclogit, we \"regress\" our choice variable (purchase) * on any alternative-specific variables (dealers) * then we put our case ID variable consumerid in case() * and our variable specifying alternatives, car, in alternatives() * then finally we put any case-specific variables like gender and income, in casevars() asclogit purchase dealers, case(consumerid) alternatives(car) casevars(gender income) * To use cmclogit, we first declare our data to be choice data with cmset * specifying our case ID variable and then the set of alternatives cmset consumerid car * Now that Stata knows the structure, we can omit those parts from the asclogit * specification, but the rest stays the same! cmclogit purchase dealers, casevars(gender income) . Why bother with the cmclogit version? cmset gives you a lot more information about your data, and makes it easy to transition between different choice model types, including those incorporating panel data (each person makes multiple choices). ",
"url": "/Model_Estimation/GLS/mcfaddens_choice_model.html#stata",
"relUrl": "/Model_Estimation/GLS/mcfaddens_choice_model.html#stata"
- },"550": {
+ },"551": {
"doc": "McFadden's Choice Model (Alternative-Specific Conditional Logit)",
"title": "McFadden's Choice Model (Alternative-Specific Conditional Logit)",
"content": " ",
"url": "/Model_Estimation/GLS/mcfaddens_choice_model.html",
"relUrl": "/Model_Estimation/GLS/mcfaddens_choice_model.html"
- },"551": {
+ },"552": {
"doc": "Merging Shape Files",
"title": "Merging Shape Files",
"content": "When we work with spatial anaylsis, it is quite often we need to deal with data in different format and at different scales. For example, I have nc data with global pm2.5 estimation with \\(0.01\\times 0.01\\) resolution. But I want to see the pm2.5 estimation in municipal level. I need to integrate my nc file into my municipality shp file so that I can group by the data into municipal level and calculate the mean. Then, I can make a map of it. In this page, I will use Brazil’s pm2.5 estimation and its shp file in municipal level. ",
"url": "/Geo-Spatial/merging_shape_files.html",
"relUrl": "/Geo-Spatial/merging_shape_files.html"
- },"552": {
+ },"553": {
"doc": "Merging Shape Files",
"title": "Keep in Mind",
"content": ". | It doesn’t have to be nc file to map into the shp file, any format that can read in and convert to a sf object works. But the data has to have geometry coordinates(longitude and latitude). | . ",
"url": "/Geo-Spatial/merging_shape_files.html#keep-in-mind",
"relUrl": "/Geo-Spatial/merging_shape_files.html#keep-in-mind"
- },"553": {
+ },"554": {
"doc": "Merging Shape Files",
"title": "Implementations",
"content": " ",
"url": "/Geo-Spatial/merging_shape_files.html#implementations",
"relUrl": "/Geo-Spatial/merging_shape_files.html#implementations"
- },"554": {
+ },"555": {
"doc": "Merging Shape Files",
"title": "R",
"content": "Unusually for LOST, the example data files cannot be accessed from the code directly. Please visit this page and download both files to your working directory before running this code. It is also strongly recommended that you find a high-powered computer or cloud service before attempting to run this code, as it requires a lot of memory. # If necesary # install.packages(c('ncdf4','sp','raster','dplyr','sf','ggplot2','reprex','ggsn')) # Load packages library(ncdf4) library(sp) library(raster) library(dplyr) library(sf) library(ggplot2) library(reprex) ### Step 1: Read in nc file as a dataframe* pm2010 = nc_open(\"https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Geo-Spatial/Data/Merging_Shape_Files/GlobalGWRwUni_PM25_GL_201001_201012-RH35_Median_NoDust_NoSalt.nc?raw=true\") nc.brick = brick(\"https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Geo-Spatial/Data/Merging_Shape_Files/GlobalGWRwUni_PM25_GL_201001_201012-RH35_Median_NoDust_NoSalt.nc?raw=true\") # Check the dimensions dim(nc.brick) # Turn into a data frame for use nc.df = as.data.frame(nc.brick[[1]], xy = T) head(nc.df) ### Step 2: Filter out a specific country. # Global data is very big. I am going to focus only on Brazil. nc.brazil = nc.df %>% filter(x >= -73.59 & x <= 34.47 & y >= -33.45 & y <= 5.16) rm(nc.df) head(nc.brazil) ### Step 3: Change the dataframe to a sf object using the st_as_sf function pm25_sf = st_as_sf(nc.brazil, coords = c(\"x\", \"y\"), crs = 4326, agr = \"constant\") rm(nc.brazil) head(pm25_sf) ### Step 4: Read in the Brazil shp file. we plan to merge to Brazil_map_2010 = st_read(\"https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Geo-Spatial/Data/Merging_Shape_Files/geo2_br2010.shp?raw=true\") head(Brazil_map_2010) ### Step 5: Intersect pm25 sf object with the shp file.* # Now let's use a sample from pm25 data and intersect it with the shp file. Since the sf object is huge, I recommend running the analysis on a cloud server pm25_sample = sample_n(pm25_sf, 1000, replace = FALSE) # Now look for the intersection between the pollution data and the Brazil map to merge them pm25_municipal_2010 = st_intersection(pm25_sample, Brazil_map_2010) head(pm25_municipal_2010) ### Step 6: Make a map using ggplot pm25_municipal_2010 = pm25_municipal_2010 %>% select(1,6) pm25_municipal_2010 = st_drop_geometry(pm25_municipal_2010) Brazil_pm25_2010 = left_join(Brazil_map_2010, pm25_municipal_2010) ggplot(Brazil_pm25_2010) + # geom_sf creates the map we need geom_sf(aes(fill = -layer), alpha=0.8, lwd = 0, col=\"white\") + # and we fill with the pollution concentration data scale_fill_viridis_c(option = \"viridis\", name = \"PM25\") + ggtitle(\"PM25 in municipals of Brazil\")+ ggsn::blank() . ",
"url": "/Geo-Spatial/merging_shape_files.html#r",
"relUrl": "/Geo-Spatial/merging_shape_files.html#r"
- },"555": {
+ },"556": {
"doc": "Mixed Logit Model",
"title": "Keep in Mind",
"content": ". | The mixed logit model estimates a distribution. Parameters are then generated from that distribution via a simulation with a specified number of draws. | The estimates from a mixed logit model cannot simply be interpreted as marginal effects, as they are maximum likelihood estimations. Further, the variation at the individual level means estimated effects are relative to the individual. | The estimation of mixed logit models is very difficult and there are quite a few details and different approaches. So you can’t really assume that one package will produce the same results as another. Read the documentation of the command you’re using so you at least know what paper produced the estimation method! | . ",
"url": "/Model_Estimation/Multilevel_Models/mixed_logit.html#keep-in-mind",
"relUrl": "/Model_Estimation/Multilevel_Models/mixed_logit.html#keep-in-mind"
- },"556": {
+ },"557": {
"doc": "Mixed Logit Model",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/Multilevel_Models/mixed_logit.html#implementations",
"relUrl": "/Model_Estimation/Multilevel_Models/mixed_logit.html#implementations"
- },"557": {
+ },"558": {
"doc": "Mixed Logit Model",
"title": "R",
"content": "To estimate a mixed logit model in R, we will first transform the data using the dfidx package. Then we will use the mlogit package to carry out the estimation. # Install mlogit which also includes the Electricity dataset for the example. # The package dfidx will be used to transform our data # install.packages(\"mlogit\", \"dfidx\") library(mlogit) library(dfidx) # Load the Electricity dataset data(\"Electricity\", package = \"mlogit\") # First, we need to coerce the data to a dfidx object # This allows for a panel with multiple indices # For further documentation, see dfidx. Electricity$index <- 1:nrow(Electricity) elec = dfidx(Electricity, idx = list(c(\"index\", \"id\")), choice = \"choice\", varying = 3:26, sep = \"\") # We then estimate individual choice over electricity providers for # different cost and contract structures with a suppressed intercept my_mixed_logit = mlogit(data = elec, formula = choice ~ 0 + pf + cl + loc + wk + tod + seas, # Specify distributions for random parameter estimates # \"n\" indicates we have specified a normal distribution # note pf is omitted from rpar, so it will not be estimated as random rpar = c(cl = \"n\", loc = \"u\", wk = \"n\", tod = \"n\", seas = \"n\"), # R is the number of simulation draws R = 100, # For simplicity, we won't include correlated parameter estimates correlation = FALSE, # This data is from a panel panel = TRUE) # Results summary(my_mixed_logit) # Note that this output will include the simulated coefficient estimates, # simulated standard error estimates, and distributional details for the # random coefficients (all, in this case) # Note also that pf is given as a point estimate, and mlogit does not generate # a distribution for it as it does the others # You can extract and summarize coefficient estimates using the rpar function marg_loc = rpar(my_mixed_logit, \"loc\") summary(marg_loc) # You can also normalize coefficients and distributions by, say, price cl_by_pf = rpar(my_mixed_logit, \"cl\", norm = \"pf\") summary(cl_by_pf) . For further examples, visit the CRAN vignette here. For a very detailed example using the Electricity dataset, see here. ",
"url": "/Model_Estimation/Multilevel_Models/mixed_logit.html#r",
"relUrl": "/Model_Estimation/Multilevel_Models/mixed_logit.html#r"
- },"558": {
+ },"559": {
"doc": "Mixed Logit Model",
"title": "Stata",
"content": "As of Stata 17, there is the base-Stata xtmlogit command which is probably preferable to mixlogit. However, many people do not have Stata 17, so this example uses mixlogit, which requires installation from ssc install mixlogit. For more information on xtmlogit, see this page. mixlogit requires data of the form (although not necessarily with the variable names): . | choice | X | group | id | . | 1 | 10 | 1 | 1 | . | 0 | 12 | 1 | 1 | . | 0 | 11 | 2 | 1 | . | 1 | 14 | 2 | 1 | . | 1 | 9 | 3 | 2 | . | 0 | 11 | 3 | 2 | . where choice is the dependent variable and is binary, indicating which of the options was chosen. X is (one of) the predictors, group is an identifying variable for the different choice occasions, and id is a vector of individual-decision-maker identifiers, if this is panel data where the same decision-making makes multiple decisions. * If necessary: * ssc install mixlogit import delimited \"https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Model_Estimation/Multilevel_Models/Data/Electricity.csv\", clear * Reshape data into \"long\" format like we need for mixlogit g decision_id = _n reshape long pf cl loc wk tod seas, i(choice id decision_id) j(option) * Remember, the dependent variable should be binary, indicating that this option * was chosen g chosen = choice == option * Let's fix the parameters on all the predictors * except for pf, which we'll allow to vary * (this is for speed in the example) mixlogit chose cl loc wk tod seas, /// group(decision_id) /// each individual choice is identified by decision_id id(id) /// each person is identified by id rand(pf) * Options to consider: * corr allows multiple parameter distributions to be correlated * ln() allows some of the parameter distributions to be log-normal * We can get individual parameter estimates with mixlbeta, which will * save the estimates to file mixlbeta pf, saving(pf_coefs.dta) . ",
"url": "/Model_Estimation/Multilevel_Models/mixed_logit.html#stata",
"relUrl": "/Model_Estimation/Multilevel_Models/mixed_logit.html#stata"
- },"559": {
+ },"560": {
"doc": "Mixed Logit Model",
"title": "Mixed Logit Model",
"content": "A mixed logit model (sometimes referred to as a random parameters logit model) estimates distributional parameters that allow for individual-level heterogeneity in tastes that are not compatible with a traditional logit framework. Mixed logit models can also provide for additional flexibility as it pertains to correlated random parameters and can be used with panel data. For more information about mixed logit models, see Wikipedia: Mixed Logit. ",
"url": "/Model_Estimation/Multilevel_Models/mixed_logit.html",
"relUrl": "/Model_Estimation/Multilevel_Models/mixed_logit.html"
- },"560": {
+ },"561": {
"doc": "Nested Logit Model",
"title": "Keep in Mind",
"content": ". | Returned beta coefficients are not the marginal effects normally returned from an OLS regression. They are maximum likelihood estimations. A beta coefficient can not be interpreted as “a unit increase in $X$ leads to a $\\beta$ unit change in the probability of $Y$.” . | The marginal effect can be obtained by performing a transformation after you estimate. A rough estimation technique is to divide the beta coefficient by 4. | Another transformation that may be helpful is the odds ratio. This value is found by raising $e$ to the power of the beta coefficient. $e^\\beta$ can be interpreted as : the percentage change in likelihood of $Y$, given a unit change in $X$. | . ",
"url": "/Model_Estimation/GLS/nested_logit.html#keep-in-mind",
"relUrl": "/Model_Estimation/GLS/nested_logit.html#keep-in-mind"
- },"561": {
+ },"562": {
"doc": "Nested Logit Model",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/GLS/nested_logit.html#implementations",
"relUrl": "/Model_Estimation/GLS/nested_logit.html#implementations"
- },"562": {
+ },"563": {
"doc": "Nested Logit Model",
"title": "R",
"content": "R has multiple packages that can estimate a nested logit model. To show a simple example, we will use the mlogit package. # Install mlogit and AER packages and load them. Latter is just for a dataset we'll be using. # install.packages(\"mlogit\", \"AER\") library(\"mlogit\", \"AER\") # Load dataset TravelMode data(\"TravelMode\", package = \"AER\") # Use the mlogit() function to run a nested logit estimation # Here, we will predict what mode of travel individuals # choose using cost and wait times nestedlogit = mlogit( choice ~ gcost + wait, data = TravelMode, ##The variable from which our nests are determined alt.var = 'mode', #The variable that dictates the binary choice choice = 'choice', #List of nests as named vectors nests = list(Fast = c('air','train'), Slow = c('car','bus')) ) # The results summary(nestedlogit) # In this case, air travel is treated as the base level. # others maximum likelihood estimators relative # to air are reported as separate intercepts # The elasticities for each cluster are displayed # as iv:Fast and iv:Slow . Another set of more robust examples comes from Kenneth Train and Yves Croissant . ",
"url": "/Model_Estimation/GLS/nested_logit.html#r",
"relUrl": "/Model_Estimation/GLS/nested_logit.html#r"
- },"563": {
+ },"564": {
"doc": "Nested Logit Model",
"title": "Nested Logit Model",
"content": "A nested logistical regression (nested logit, for short) is a statistical method for finding a best-fit line when the the outcome variable $Y$ is a binary variable, taking values of 0 or 1. Logit regressions, in general, follow a logistical distribution and restrict predicted probabilities between 0 and 1. Traditional logit models require that the Independence of Irrelevant Alternatives(IIA) property holds for all possible outcomes of some process. Nested logit models differ by allowing ‘nests’ of outcomes that satisfy IIA, but not requiring that all outcomes jointly satisfy IIA. For an example of violating the IIA property, see Red Bus/Blue Bus Paradox. For a more thorough theoretical treatment, see SAS Documentation: Nested Logit. ",
"url": "/Model_Estimation/GLS/nested_logit.html",
"relUrl": "/Model_Estimation/GLS/nested_logit.html"
- },"564": {
+ },"565": {
"doc": "Nonlinear Hypothesis Tests",
"title": "Nonlinear Significance Tests",
"content": "Most regression output, or output from other methods that produce multiple coefficients, will include the results of frequentist hypothesis tests comparing each coefficient to 0. However, in many cases, you may be interested in a hypothesis test of a null restriction that involves a nonlinear combination of the coefficients, or producing an estimate and sampling distriubtion for that nonlinear combination. For example, in the model . \\[Y = \\beta_0 + \\beta_1X + \\beta_2Z + \\varepsilon\\] You may be interested in the ratio of the two effects, \\(\\beta_1/\\beta_2\\), and would want an estimate of that combination, along with a standard error, and a hypothesis test comparing that estimate to 0 or some other value. Estimates and tests of nonlinear combinations of coefficients are different than for linear combinations, because they imply restrictions on estimation that cannot be expressed in the form of a matrix of linear restrictions. The most common approach to producing a sampling distribution for a nonlinear combination of coefficients is the delta method and that is what all the commands on this page use. ",
"url": "/Model_Estimation/Statistical_Inference/nonlinear_hypothesis_tests.html#nonlinear-significance-tests",
"relUrl": "/Model_Estimation/Statistical_Inference/nonlinear_hypothesis_tests.html#nonlinear-significance-tests"
- },"565": {
+ },"566": {
"doc": "Nonlinear Hypothesis Tests",
"title": "Keep in Mind",
"content": ". | Depending on your goal, you may be able to avoid doing a test of nonlinear combinations of coefficients by converting the combination into a linear one. For example, if you do not want to estimate \\(\\beta_1/\\beta_2\\) itself, but instead are only interested in testing the null hypothesis \\(\\beta_1/\\beta_2 = 1\\), this null hypothesis can be manipulated to instead be \\(\\beta_1 = \\beta_2\\) or \\(\\beta_1 - \\beta_2 = 0\\), either of which can be evaluated as a hypothesis test on a linear combination of coefficients. | . ",
"url": "/Model_Estimation/Statistical_Inference/nonlinear_hypothesis_tests.html#keep-in-mind",
"relUrl": "/Model_Estimation/Statistical_Inference/nonlinear_hypothesis_tests.html#keep-in-mind"
- },"566": {
+ },"567": {
"doc": "Nonlinear Hypothesis Tests",
"title": "Also Consider",
"content": ". | Linear Hypothesis Tests. | . ",
"url": "/Model_Estimation/Statistical_Inference/nonlinear_hypothesis_tests.html#also-consider",
"relUrl": "/Model_Estimation/Statistical_Inference/nonlinear_hypothesis_tests.html#also-consider"
- },"567": {
+ },"568": {
"doc": "Nonlinear Hypothesis Tests",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/Statistical_Inference/nonlinear_hypothesis_tests.html#implementations",
"relUrl": "/Model_Estimation/Statistical_Inference/nonlinear_hypothesis_tests.html#implementations"
- },"568": {
+ },"569": {
"doc": "Nonlinear Hypothesis Tests",
"title": "R",
"content": "In R, the marginaleffects package contains a number of useful functions for postestimation, including nonlinear combinations of coefficients via the deltamethod() function. It is used here with lm(), but is also compatible with regression output from many other packages and functions. library(marginaleffects) data(mtcars) # Run the model m = lm(mpg ~ hp + wt, data = mtcars) # Specify the combination of coefficients in the form of a null-hypothesis equation deltamethod(m, 'hp/wt = 1') # This produces an estimate, standard error, p-value, and confidence interval . ",
"url": "/Model_Estimation/Statistical_Inference/nonlinear_hypothesis_tests.html#r",
"relUrl": "/Model_Estimation/Statistical_Inference/nonlinear_hypothesis_tests.html#r"
- },"569": {
+ },"570": {
"doc": "Nonlinear Hypothesis Tests",
"title": "Stata",
"content": "Stata has the nlcom postestimation command for producing estimates and standard errors for nonlinear tests of coefficients. It will also produce the results of hypothesis tests comparing the combination to 0, so to compare to other values, subtract the desired value from the combination. * Load auto data sysuse https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Data/auto.dta regress mpg trunk weight nlcom _b[trunk]/_b[weight] - 1 . ",
"url": "/Model_Estimation/Statistical_Inference/nonlinear_hypothesis_tests.html#stata",
"relUrl": "/Model_Estimation/Statistical_Inference/nonlinear_hypothesis_tests.html#stata"
- },"570": {
+ },"571": {
"doc": "Nonlinear Hypothesis Tests",
"title": "Nonlinear Hypothesis Tests",
"content": " ",
"url": "/Model_Estimation/Statistical_Inference/nonlinear_hypothesis_tests.html",
"relUrl": "/Model_Estimation/Statistical_Inference/nonlinear_hypothesis_tests.html"
- },"571": {
+ },"572": {
"doc": "Nonstandard Errors",
"title": "Nonstandard errors",
"content": " ",
"url": "/Model_Estimation/Statistical_Inference/Nonstandard_Errors/nonstandard_errors.html#nonstandard-errors",
"relUrl": "/Model_Estimation/Statistical_Inference/Nonstandard_Errors/nonstandard_errors.html#nonstandard-errors"
- },"572": {
+ },"573": {
"doc": "Nonstandard Errors",
"title": "Nonstandard Errors",
"content": " ",
"url": "/Model_Estimation/Statistical_Inference/Nonstandard_Errors/nonstandard_errors.html",
"relUrl": "/Model_Estimation/Statistical_Inference/Nonstandard_Errors/nonstandard_errors.html"
- },"573": {
+ },"574": {
"doc": "Ordered Probit/Logit",
"title": "Ordered Probit / Ordered Logit",
"content": "Ordered probit and ordered logit are regression methods intended for use when the dependent variable is ordinal. That is, there is a natural ordering to the different (discrete) values, but no cardinal value. So we might know \\(A > B\\) but not by how much \\(A\\) is greater than \\(B\\). Examples of ordinal data include responses on a Likert scale (“Very satisfied” is more satisfied than “Satisfied”, and “Satisfied” is more satisfied than “Not Satisfied”, but the difference between “Very satisfied” and “Satisfied” may not be the same as the difference between “Satisfied” and “Not Satisfied” but we may not know by how much) or education levels (a Master’s degree is more education than a Bachelor’s degree, but how much more?). When the dependent variable is ordinal, typical linear regression may not work well because it relies on absolute differences in value. Ordered probit and ordered logit take a latent-variable approach to this problem. They assume that the discrete dependent variable simply represents a continuous latent variable. In the Likert scale example this might be “satisfied-ness”. In ordered probit this latent variable is normally distributed, and in ordered logit it is distributed according to a logistic distribution. Then, the actual values just carve up the regions of that latent variable. So if satisfied-ness is distributed \\(S\\sim N(0,1)\\), then perhaps “very satisfied” is \\(S > .892\\), “satisfied” is \\(.321 < S \\leq .892\\), and so on. The .321 and .892 are “cutoff values” separating the categories. These cutoff values are estimated by ordered probit and ordered logit. These models assume that predictors affect the latent variable the same no matter which level you’re at. There isn’t a predictor that, for example, makes you more likely to be “satified” and less likely to be either “very satisfied” or “not satisfied” (or a predictor that has a slight positive effect going from “not satisfied” to “satisfied” but a huge effect going from “satisfied” to “very sastisfied”). You can imagine taking your ordinal variable and collapsing it into a binary one: comparing, say, “very not satisfied” and “not satisfied” as one group vs. “satisfied” and “very satisfied” as the other in a typical probit or logit. Ordered logit/probit assumes that this will give the same results as if you’d split somewhere else, comparing “very not satisfied”, “not satisfied”, and “satisfied” vs. “very satisfied”. This is the “parallel lines” or “parallel regression” assumption, or for ordered logit “proportional odds”. ",
"url": "/Model_Estimation/GLS/ordered_probit_logit.html#ordered-probit--ordered-logit",
"relUrl": "/Model_Estimation/GLS/ordered_probit_logit.html#ordered-probit--ordered-logit"
- },"574": {
+ },"575": {
"doc": "Ordered Probit/Logit",
"title": "Keep in Mind",
"content": ". | Coefficients on predictors are scaled in terms of the latent variable and in general are difficult to interpret. You can calculate marginal effects from ordered probit/logit results, which report how changes in a predictor are related to people moving from one category to another. For example, if the marginal effect of \\(X\\) is +.03 for “very not satisfied”, +.02 for “not satisfied”, .+.02 for “satisfied”, and -.07 for “very satisfied”, that means that a one-unit increase in \\(X\\) results in a drop in the proportion of the sample predicted to be “very satisfied” and that drop is reallocated across the other three levels, everyone shifting down a bit and some ending up in a new category. | To identify the model, one of the cutoff parameters (the lowest one, separating the lowest category and the second-lowest) is usually fixed at 0. The cutoff values are in general only meaningful relative to each other for this reason and don’t mean anything on their own. | It is a good idea to test the parallel lines assumption. This is commonly done using a Brant (1990) test, which basically checks the different above/below splits possible with the dependent variable and sees how much the coefficients differ (hoping they don’t differ a lot!). If the test fails, you may want to use a generalized ordered logit, which has less explanatory power but does not rely on the parallel trends assumption. Code for both these steps is below. Doing the test rather than just starting with generalized ordered logit is a good idea because you do lose power and interpretability with generalized ordered logit; see Williams 2016. | . ",
"url": "/Model_Estimation/GLS/ordered_probit_logit.html#keep-in-mind",
"relUrl": "/Model_Estimation/GLS/ordered_probit_logit.html#keep-in-mind"
- },"575": {
+ },"576": {
"doc": "Ordered Probit/Logit",
"title": "Also Consider",
"content": ". | If the dependent variable is not ordered, consider a multinomial model instead. | . ",
"url": "/Model_Estimation/GLS/ordered_probit_logit.html#also-consider",
"relUrl": "/Model_Estimation/GLS/ordered_probit_logit.html#also-consider"
- },"576": {
+ },"577": {
"doc": "Ordered Probit/Logit",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/GLS/ordered_probit_logit.html#implementations",
"relUrl": "/Model_Estimation/GLS/ordered_probit_logit.html#implementations"
- },"577": {
+ },"578": {
"doc": "Ordered Probit/Logit",
"title": "R",
"content": "The necessary tools to work with ordered probit and logit are unfortunately scattered across several packages in R. MASS contains the ordered probit/logit estimator, brant has the Brant test, and if that fails you’re off to VGAM for generalized ordered logit. # For the ordered probit/logit model library(MASS) # For the brant test library(brant) # For the generalized ordered logit library(VGAM) # For marginal effects library(erer) # Data on marital happiness and affairs # Documentation: https://vincentarelbundock.github.io/Rdatasets/doc/Ecdat/Fair.html mar <- read.csv('https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Fair.csv') # See how various factors predict marital happiness m <- polr(factor(rate) ~ age + child + religious + education + nbaffairs, data = mar, method = 'logistic' # change to 'probit' for ordered probit ) summary(m) # Brant test of proportional odds brant(m) # The \"Omnibus\" probability is .03, if we have alpha = .05 then we reject proportional odds # Specifically the test tells us that education is the problem. Dang. # We can use vglm for the generalized ordered logit gologit <- vglm(factor(rate) ~ age + child + religious + education + nbaffairs, cumulative(link = 'logitlink', parallel = FALSE), # parallel = FALSE tells it not to assume parallel lines data = mar) summary(gologit) # Notice how each predictor now has many coefficients - one for each level # and we have other problems denoted in its warnings! # If we want marginal effects for our original ordered logit... ocME(m) . ",
"url": "/Model_Estimation/GLS/ordered_probit_logit.html#r",
"relUrl": "/Model_Estimation/GLS/ordered_probit_logit.html#r"
- },"578": {
+ },"579": {
"doc": "Ordered Probit/Logit",
"title": "Stata",
"content": "Ordered logit / probit requires a few packages to be installed, including gologit2 for the generalized ordered logit, and for the Brant test spost13, which is not on ssc. * For the brant test we must install spost13 * which is not on ssc, so do \"findit spost13\" and install \"spost13_ado\" * for generalized ordered logit, do \"ssc install gologit2\" * Data on marital happiness and affairs * Documentation: https://vincentarelbundock.github.io/Rdatasets/doc/Ecdat/Fair.html import delimited \"https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Fair.csv\", clear * strings can't be factors encode child, g(child_n) * Run ologit or oprobit ologit rate age i.child_n religious education nbaffairs * Use the brant test brant * The \"All\" probability is .03, if we have alpha = .05 then we reject proportional odds * Specifically the test tells us that education is the problem. Dang. * Running generalized ordered logit instead gologit2 rate age i.child_n religious education nbaffairs * Notice how each predictor now has many coefficients - one for each level * and we have a negative predicted probability denoted in the warnings! * We can get marginal effects for either model using margins ologit rate age i.child_n religious education nbaffairs margins, dydx(*) gologit2 rate age i.child_n religious education nbaffairs margins, dydx(*) . ",
"url": "/Model_Estimation/GLS/ordered_probit_logit.html#stata",
"relUrl": "/Model_Estimation/GLS/ordered_probit_logit.html#stata"
- },"579": {
+ },"580": {
"doc": "Ordered Probit/Logit",
"title": "Ordered Probit/Logit",
"content": " ",
"url": "/Model_Estimation/GLS/ordered_probit_logit.html",
"relUrl": "/Model_Estimation/GLS/ordered_probit_logit.html"
- },"580": {
+ },"581": {
"doc": "Penalized Regression",
"title": "Penalized Regression",
"content": "When running a regression, especially one with many predictors, the results have a tendency to overfit the data, reducing out-of-sample predictive properties. Penalized regression eases this problem by forcing the regression estimator to shrink its coefficients towards 0 in order to avoid the “penalty” term imposed on the coefficients. This process is closely related to the idea of Bayesian shrinkage, and indeed standard penalized regression results are equivalent to regression performed using certain Bayesian priors. Regular OLS selects coefficients \\(\\hat{\\beta}\\) to minimize the sum of squared errors: . \\[\\min\\sum_i(y_i - X_i\\hat{\\beta})^2\\] Non-OLS regressions similarly select coefficients to minimize a similar objective function. Penalized regression adds a penalty term \\(\\lambda\\lVert\\beta\\rVert_p\\) to that objective function, where \\(\\lambda\\) is a tuning parameter that determines how harshly to penalize coefficients, and \\(\\lVert\\beta\\rVert_p\\) is the \\(p\\)-norm of the coefficients, or \\(\\sum_j\\lvert\\beta\\rvert^p\\). \\[\\min\\left(\\sum_i(y_i - X_i\\hat{\\beta})^2 + \\lambda\\left\\lVert\\beta\\right\\rVert_p \\right)\\] Typically \\(p\\) is set to 1 for LASSO regression (least absolute shrinkage and selection operator), which has the effect of tending to set coefficients to 0, i.e. model selection, or to 2 for Ridge Regression. Elastic net regression provides a weighted mix of LASSO and Ridge penalties, commonly referring to the weight as \\(\\alpha\\). ",
"url": "/Machine_Learning/penalized_regression.html",
"relUrl": "/Machine_Learning/penalized_regression.html"
- },"581": {
+ },"582": {
"doc": "Penalized Regression",
"title": "Keep in Mind",
"content": ". | To avoid being penalized for a constant term, or by differences in scale between variables, it is a very good idea to standardize each variable (subtract the mean and divide by the standard deviation) before running a penalized regression. | Penalized regression can be run for logit and other kinds of regression, not just linear regression. Using penalties with general linear models like logit is common. | Penalized regression coefficients are designed to improve out-of-sample prediction, but they are biased. If the goal is estimation of a parameter, rather than prediction, this should be kept in mind. A common procedure is to use LASSO to select variables, and then run regular regression models with the variables that LASSO has selected. | The \\(\\lambda\\) parameter is often chosen using cross-validation. Many penalized regression commands include an option to select \\(\\lambda\\) by cross-validation automatically. | LASSO models commonly include variables along with polynomial transformation of those variables and interactions, allowing LASSO to determine which transformations are worth keeping. | . ",
"url": "/Machine_Learning/penalized_regression.html#keep-in-mind",
"relUrl": "/Machine_Learning/penalized_regression.html#keep-in-mind"
- },"582": {
+ },"583": {
"doc": "Penalized Regression",
"title": "Also Consider",
"content": ". | If it is not important to estimate coefficients but the goal is simply to predict an outcome, then there are many other machine learning methods that do so, and in some cases can handle higher dimensionality or work with smaller samples. | . ",
"url": "/Machine_Learning/penalized_regression.html#also-consider",
"relUrl": "/Machine_Learning/penalized_regression.html#also-consider"
- },"583": {
+ },"584": {
"doc": "Penalized Regression",
"title": "Implementations",
"content": " ",
"url": "/Machine_Learning/penalized_regression.html#implementations",
"relUrl": "/Machine_Learning/penalized_regression.html#implementations"
- },"584": {
+ },"585": {
"doc": "Penalized Regression",
"title": "Python",
"content": "This is an example of running penalised regressions in Python. The main takeaways are that the ubiquitous machine learning package sklearn can perform lasso, ridge, and elastic net regressions. In the example below, we’ll see all three in action. The level of penalisation will be set automatically by cross-validation, although a user may also supply the number directly. This example will use the seaborn package (for data), the patsy package (to create matrices from formulae), the matplotlib package (for plotting), the pandas package (for data manipulation), and the sklearn package (for machine learning). To run the example below, you may need to first install these packages. First, we need to import these packages for use. import seaborn as sns from patsy import dmatrices, dmatrix from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LassoCV, ElasticNetCV, RidgeCV import matplotlib.pyplot as plt import pandas as pd . Now let’s load the data and transform it into a vector of endogeneous variables, and a matrix of exogenous variables. Using patsy, we’ll ask for all interaction variables among sepal width, petal length, and petal width (and exclude having an intercept). iris = sns.load_dataset(\"iris\") formula = (\"sepal_length ~ (sepal_width + petal_length + petal_width)**2 - 1\") y, X = dmatrices(formula, iris) . Some machine learning algorithms are more performant with data that are scaled before being used. One should be careful when scaling data if using test and training sets; here, we’re not worried about a test set though, so we just use the standard scaler (which transforms data to have 0 mean and unit standard deviation) on all of the \\(X\\) and \\(y\\) data. scale_X = StandardScaler().fit(X).transform(X) scale_y = StandardScaler().fit(y).transform(y) scale_y = scale_y.ravel() # ravel collapses a (150, 1) vector to (150,) . Now we run lasso with cross-validation. reg_lasso = LassoCV(cv=10).fit(scale_X, scale_y) . Let’s display the results so we can see for which value of \\(\\alpha\\) the lowest mean squared error occurred. Note that sklearn uses the convention that \\(\\alpha\\) (rather than \\(\\lambda\\)) is the shrinkage parameter. EPSILON = 1e-4 # This is to avoid division by zero while taking the base 10 logarithm plt.figure() plt.semilogx(reg_lasso.alphas_ + EPSILON, reg_lasso.mse_path_, ':') plt.plot(reg_lasso.alphas_ + EPSILON, reg_lasso.mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(reg_lasso.alpha_ + EPSILON, linestyle='--', color='k', label=r'$\\alpha$: CV estimate') plt.legend() plt.xlabel(r'$\\alpha$') plt.ylabel('Mean square error') plt.title('Mean square error on each fold: coordinate descent ') plt.axis('tight') plt.show() . Let’s look at the coefficients that are selected with this optimal value of \\(\\alpha\\) (which you can access via reg_lasso.alpha_): . for coef, name in zip(reg_lasso.coef_, dmatrix(formula.split('~')[1], iris).design_info.term_names): print(f'Coeff {name} = {coef:.2f}') . Coeff sepal_width = 0.36 Coeff petal_length = 1.38 Coeff petal_width = -0.39 Coeff sepal_width:petal_length = -0.00 Coeff sepal_width:petal_width = -0.32 Coeff petal_length:petal_width = 0.33 . Now let’s see what coefficients we get with ridge regression and elastic net (a mixture between ridge and lasso; here we use the default setting of a half-mixture between the two). reg_elastic = ElasticNetCV(cv=10).fit(scale_X, scale_y) reg_ridge = RidgeCV(cv=10).fit(scale_X, scale_y) # For convenient comparison, let's pop these into a dataframe df = pd.DataFrame({'Lasso': reg_lasso.coef_, 'Elastic Net (0.5)': reg_elastic.coef_, 'Ridge': reg_ridge.coef_}, index=dmatrix(formula.split('~')[1], iris).design_info.term_names).T df[r'$\\alpha$'] = [reg_lasso.alpha_, reg_elastic.alpha_, reg_ridge.alpha_] df = df.T df . | | Lasso | Elastic Net (0.5) | Ridge | . | sepal_width | 0.362891 | 0.357877 | 0.288003 | . | petal_length | 1.383851 | 1.321840 | 0.931508 | . | petal_width | -0.386780 | -0.320669 | -0.148416 | . | sepal_width:petal_length | -0.000000 | 0.039810 | 0.363751 | . | sepal_width:petal_width | -0.322053 | -0.362515 | -0.497244 | . | petal_length:petal_width | 0.327846 | 0.321951 | 0.326384 | . | α | 0.000901 | 0.001802 | 1.000000 | . ",
"url": "/Machine_Learning/penalized_regression.html#python",
"relUrl": "/Machine_Learning/penalized_regression.html#python"
- },"585": {
+ },"586": {
"doc": "Penalized Regression",
"title": "R",
"content": "We will use the glmnet package. # Install glmnet and tidyverse if necessary # install.packages('glmnet', 'tidyverse') # Load glmnet library(glmnet) # Load iris data data(iris) # Create a matrix with all variables other than our dependent vairable, Sepal.Length # and interactions. # -1 to omit the intercept M <- model.matrix(lm(Sepal.Length ~ (.)^2 - 1, data = iris)) # Add squared terms of numeric variables numeric.var.names <- names(iris)[2:4] M <- cbind(M,as.matrix(iris[,numeric.var.names]^2)) colnames(M)[16:18] <- paste(numeric.var.names,'squared') # Create a matrix for our dependent variable too Y <- as.matrix(iris$Sepal.Length) # Standardize all variables M <- scale(M) Y <- scale(Y) # Use glmnet to estimate penalized regression # We pick family = \"gaussian\" for linear regression; # other families work for other kinds of data, like binomial for binary data # In each case, we use cv.glmnet to pick our lambda value using cross-validation # using nfolds folds for cross-validation # Note that alpha = 1 picks LASSO cv.lasso <- cv.glmnet(M, Y, family = \"gaussian\", nfolds = 20, alpha = 1) # We might want to see how the choice of lambda relates to out-of-sample error with a plot plot(cv.lasso) # After doing CV, we commonly pick the lambda.min for lambda, # which is the lambda that minimizes out-of-sample error # or lambda.1se, which is one standard error above lambda.min, # which penalizes more harshly. The choice depends on context. lasso.model <- glmnet(M, Y, family = \"gaussian\", alpha = 1, lambda = cv.lasso$lambda.min) # coefficients are shown in the beta element. means LASSO dropped it lasso.model$beta # Running Ridge, or mixing the two with elastic net, simply means picking # alpha = 0 (Ridge), or 0 < alpha < 1 (Elastic Net) cv.ridge <- cv.glmnet(M, Y, family = \"gaussian\", nfolds = 20, alpha = 0) ridge.model <- glmnet(M, Y, family = \"gaussian\", alpha = 0, lambda = cv.ridge$lambda.min) cv.elasticnet <- cv.glmnet(M, Y, family = \"gaussian\", nfolds = 20, alpha = .5) elasticnet.model <- glmnet(M, Y, family = \"gaussian\", alpha = .5, lambda = cv.elasticnet$lambda.min) . ",
"url": "/Machine_Learning/penalized_regression.html#r",
"relUrl": "/Machine_Learning/penalized_regression.html#r"
- },"586": {
+ },"587": {
"doc": "Penalized Regression",
"title": "Stata",
"content": "Penalized regression is one of the few machine learning algorithms that Stata does natively. This requires Stata 16. If you do not have Stata 16, you can alternately perform some forms of penalized regression by installing the lars package using ssc install lars. * Use NLSY-W data sysuse nlsw88.dta, clear * Construct all squared and interaction terms by loop so we don't have to specify them all * by hand in the regression function local numeric_vars = \"age grade hours ttl_exp tenure\" local factor_vars = \"race married never_married collgrad south smsa c_city industry occupation union\" * Add all squares foreach x in `numeric_vars' { g sq_`x' = `x'^2 } * Turn all factors into dummies so we can standardize them local faccount = 1 local dummy_vars = \"\" foreach x in `factor_vars' { xi i.`x', pre(f`count'_) local count = `count' + 1 } * Add all numeric-numeric interactions; these are easy * factor interactions would need a more thorough loop forvalues i = 1(1)5 { local next_i = `i'+1 forvalues j = `next_i'(1)5 { local namei = word(\"`numeric_vars'\",`i') local namej = word(\"`numeric_vars'\",`j') g interact_`i'_`j' = `namei'*`namej' } } * Standardize everything foreach var of varlist `numeric_vars' f*_* interact_* { qui summ `var' qui replace `var' = (`var' - r(mean))/r(sd) } * Use the lasso command to run LASSO * using sel(cv) to select lambda using cross-validation * we specify a linear model here, but logit/probit/poisson would work lasso linear wage `numeric_vars' f*_* interact_*, sel(cv) * get list of included coefficients lassocoef * We can use elasticnet to run Elastic Net * By default, alpha will be selected by cross-validation as well elasticnet linear wage `numeric_vars' f*_* interact_*, sel(cv) . ",
"url": "/Machine_Learning/penalized_regression.html#stata",
"relUrl": "/Machine_Learning/penalized_regression.html#stata"
- },"587": {
+ },"588": {
"doc": "Probit Model",
"title": "Probit Regressions",
"content": "A Probit regression is a statistical method for a best-fit line between a binary [0/1] outcome variable \\(Y\\) and any number of independent variables. Probit regressions follow a standard normal probability distribution and the predicted values are bounded between 0 and 1. For more information about Probit, see Wikipedia: Probit. ",
"url": "/Model_Estimation/GLS/probit_model.html#probit-regressions",
"relUrl": "/Model_Estimation/GLS/probit_model.html#probit-regressions"
- },"588": {
+ },"589": {
"doc": "Probit Model",
"title": "Keep in Mind",
"content": ". | The beta coefficients from a probit model are maximum likelihood estimations. They are not the marginal effect, as you would see in an OLS estimation. So you cannot interpret the beta coefficient as a marginal effect of \\(X\\) on \\(Y\\). | To obtain the marginal effect, you need to perform a post-estimation command to discover the marginal effect. In general, you can ‘eye-ball’ the marginal effect by dividing the probit beta coefficient by 2.5. | . ",
"url": "/Model_Estimation/GLS/probit_model.html#keep-in-mind",
"relUrl": "/Model_Estimation/GLS/probit_model.html#keep-in-mind"
- },"589": {
+ },"590": {
"doc": "Probit Model",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/GLS/probit_model.html#implementations",
"relUrl": "/Model_Estimation/GLS/probit_model.html#implementations"
- },"590": {
+ },"591": {
"doc": "Probit Model",
"title": "Gretl",
"content": "# Load auto data open auto.gdt # Run probit using the auto data, with mpg as the outcome variable # and headroom, trunk, and weight as predictors probit mpg const headroom trunk weight . ",
"url": "/Model_Estimation/GLS/probit_model.html#gretl",
"relUrl": "/Model_Estimation/GLS/probit_model.html#gretl"
- },"591": {
+ },"592": {
"doc": "Probit Model",
"title": "Python",
"content": "The statsmodels package has methods that can perform probit regressions. # Use pip or conda to install pandas and statsmodels import pandas as pd import statsmodels.formula.api as smf # Read in the data df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv', index_col=0) # Specify the model mod = smf.probit('vs ~ mpg + cyl', data=df) # Fit the model res = mod.fit() # Look at the results res.summary() # Compute marginal effects marge_effect = res.get_margeff(at='mean', method='dydx') # Show marginal effects marge_effect.summary() . ",
"url": "/Model_Estimation/GLS/probit_model.html#python",
"relUrl": "/Model_Estimation/GLS/probit_model.html#python"
- },"592": {
+ },"593": {
"doc": "Probit Model",
"title": "R",
"content": "R can run a probit regression using the glm() function. However, to get marginal effects you will need to calculate them by hand or use a package. We will use the mfx package, although the margins package is another good option, which produces tidy model output. # If necessary, install the mfx package # install.packages('mfx') # mfx is only needed for the marginal effect, not the regression itself library(mfx) # Load mtcars data data(mtcars) # Use the glm() function to run probit # Here we are predicting engine type using # miles per gallon and number of cylinders as predictors my_probit <- glm(vs ~ mpg + cyl, data = mtcars, family = binomial(link = 'probit')) # The family argument says we are working with binary data # and using a probit link function (rather than, say, logit) # The results summary(my_probit) # Marginal effects probitmfx(vs ~ mpg + cyl, data = mtcars) . ",
"url": "/Model_Estimation/GLS/probit_model.html#r",
"relUrl": "/Model_Estimation/GLS/probit_model.html#r"
- },"593": {
+ },"594": {
"doc": "Probit Model",
"title": "Stata",
"content": "* Load auto data sysuse auto.dta * Probi Estimation probit foreign mpg weight headroom trunk * Recover the Marginal Effects (Beta Coefficient in OLS) margins, dydx(*) . ",
"url": "/Model_Estimation/GLS/probit_model.html#stata",
"relUrl": "/Model_Estimation/GLS/probit_model.html#stata"
- },"594": {
+ },"595": {
"doc": "Probit Model",
"title": "Probit Model",
"content": " ",
"url": "/Model_Estimation/GLS/probit_model.html",
"relUrl": "/Model_Estimation/GLS/probit_model.html"
- },"595": {
+ },"596": {
"doc": "Propensity Score Matching",
"title": "Propensity Score Matching",
"content": "Propensity Score Matching (PSM) is a non-parametric method of estimating a treatment effect in situations where randomization is not possible. This method comes from Rosenbaum & Rubin, 1983 and works by estimating a propensity score which is the predicted probability that someone received treatment based on the explanatory variables of interest. As long as all confounders are included in the propensity score estimation, this reduces bias in observational studies by controlling for variation in treatment that is driven by confounding, essentially attempting to replicate a randomized control trial. ",
"url": "/Model_Estimation/Matching/propensity_score_matching.html",
"relUrl": "/Model_Estimation/Matching/propensity_score_matching.html"
- },"596": {
+ },"597": {
"doc": "Propensity Score Matching",
"title": "Inverse Probability Weighting",
"content": "The recommendation of the current literature, by King and Nielsen 2019, is that propensity scores should be used with inverse probability weighting (IPW) rather than matching. With this in mind, there will be examples of how to implement IPWs first followed by the process for implementing a matching method. ",
"url": "/Model_Estimation/Matching/propensity_score_matching.html#inverse-probability-weighting",
"relUrl": "/Model_Estimation/Matching/propensity_score_matching.html#inverse-probability-weighting"
- },"597": {
+ },"598": {
"doc": "Propensity Score Matching",
"title": "Workflow for Inverse Probability Weighting",
"content": ". | Run a logistic regression where the outcome variable is a binary indicator for whether or not someone received the treatment, and gather the predicted value of the propensity score. The explanatory variables in this case are the covariates that we might reasonably believe influence treatment | Filter out observations in our data that are not inside the range of our propensity scores, or that have extremely high or low values. | Create the inverse probability weights | Run a regression using the IPWs | . ",
"url": "/Model_Estimation/Matching/propensity_score_matching.html#workflow-for-inverse-probability-weighting",
"relUrl": "/Model_Estimation/Matching/propensity_score_matching.html#workflow-for-inverse-probability-weighting"
- },"598": {
+ },"599": {
"doc": "Propensity Score Matching",
"title": "Workflow for Matching",
"content": ". | The same as step one from the IPW section. | Match those that received treatment with those that did not based on propensity score. There are a number of different ways to perform this matching including, but not limited to : | . | Nearest neighbor matching | Exact matching | Stratification matching | . In this example we will focus on nearest neighbor matching. ",
"url": "/Model_Estimation/Matching/propensity_score_matching.html#workflow-for-matching",
"relUrl": "/Model_Estimation/Matching/propensity_score_matching.html#workflow-for-matching"
- },"599": {
+ },"600": {
"doc": "Propensity Score Matching",
"title": "Checking Balance",
"content": "Unlike methods like Entropy Balancing and Coarsened Exact Matching, propensity score approaches do not ensure that the covariates are balanced between the treated and control groups. It is a good idea to check whether decent balance has been achieved, and if it hasn’t, go back and modify the model, perhaps adding more matching variables or allowing polynomial terms in the logistic regression, until there is acceptable balance. | Check the balance of the matched sample. That is, see whether the averages (and perhaps variances and other summary statistics) of the covariates are similar in the matched/weighted treated and control groups. | In the case of inverse probability weighting, also check whether the post-weighting propensity score distributions are similar in the treated and control groups. | . Once the workflow is finished, the treatment effect can be estimated using the treated and matched sample with matching, or using the weighted sample with inverse probability weighting. ",
"url": "/Model_Estimation/Matching/propensity_score_matching.html#checking-balance",
"relUrl": "/Model_Estimation/Matching/propensity_score_matching.html#checking-balance"
- },"600": {
+ },"601": {
"doc": "Propensity Score Matching",
"title": "Keep in Mind",
"content": ". | Propensity Score Matching is based on selection on observable characteristics. This assume that the potential outcome is independent of the treatment D conditional on the covariates, or the Conditional Independence Assumption: | . \\[Y_i(1),Y_i(0)\\bot|X_i\\] . | Propensity Score Matching also requires us to make the Common Support or Overlap Assumption: | . \\[0<Pr(D_i = 1 | X_i = x)<1\\] The overlap assumption says that the probability that the treatment is equal to 1 for each level of x is between zero and one, or in other words there are both treated and untreated units for each level of x. | Treatment effect estimation will produce incorrcect standard errors unless they are specifically tailored for matching results, since they will not account for noise in the matching process. Use software designed for treatment effect estimates with matching. Or, for inverse probability weighting, you can bootstrap the entire procedure (from matching to estimation) and produce standard errors that way. | . ",
"url": "/Model_Estimation/Matching/propensity_score_matching.html#keep-in-mind",
"relUrl": "/Model_Estimation/Matching/propensity_score_matching.html#keep-in-mind"
- },"601": {
+ },"602": {
"doc": "Propensity Score Matching",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/Matching/propensity_score_matching.html#implementations",
"relUrl": "/Model_Estimation/Matching/propensity_score_matching.html#implementations"
- },"602": {
+ },"603": {
"doc": "Propensity Score Matching",
"title": "R",
"content": "The matching implementation will use the MatchIt package. A great place to find more information about the MatchIt package is on the package’s github site or CRAN Page. The inverse probability implementation uses the causalweight package. Here is a handy link to get more information about the causalweight package which is an alternative way of creating inverse probability weights, courtesy of Hugo Bodory and Martin Huber. Inverse Probability Weights in R . Data comes from OpenIntro.org . # First follow basic workflow without causalweights package library(pacman) p_load(tidyverse, causalweight) #Load data on smoking in the United Kingdom. smoking = read_csv(\"https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Model_Estimation/Matching/Data/smoking.csv\") #Turn smoking and married into numeric variables smoking = smoking %>% mutate(smoke = 1*(smoke == \"Yes\"), married = 1*(marital_status == \"Married\")) # Pull out the variables # Outcome Y = smoking %>% pull(married) # Treatment D <- smoking %>% pull(smoke) # Matching variables X <- model.matrix(~-1+gender+age+marital_status+ethnicity+region, data = smoking) # Note this estimats the propensity score for us, trims propensity # scores based on extreme values, # and then produces appropriate bootstrapped standard errors IPW <- treatweight(Y, D, X, trim = .001, logit = TRUE) # Estimate and SE IPW$effect IPW$se . Matching in R . ##load the packages and data we need. library(pacman) p_load(tidyverse, MatchIt) smoking = read_csv(\"https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Model_Estimation/Matching/Data/smoking.csv\") smoking = smoking %>% mutate(smoke = 1*(smoke == \"Yes\")) # Mapping the categories to new categorical values 1 to 8 and giving NA to \"Refused\" and \"Unknown\" smoking$new_income <- NA smoking$new_income[smoking$gross_income == \"Under 2,600\"] <- 1 smoking$new_income[smoking$gross_income == \"2,600 to 5,200\"] <- 2 smoking$new_income[smoking$gross_income == \"5,200 to 10,400\"] <- 3 smoking$new_income[smoking$gross_income == \"10,400 to 15,600\"] <- 4 smoking$new_income[smoking$gross_income == \"15,600 to 20,800\"] <- 5 smoking$new_income[smoking$gross_income == \"20,800 to 28,600\"] <- 6 smoking$new_income[smoking$gross_income == \"28,600 to 36,400\"] <- 7 smoking$new_income[smoking$gross_income == \"Above 36,400\"] <- 8 smoking$new_income[smoking$gross_income == \"Refused\"] <- NA smoking$new_income[smoking$gross_income == \"Unknown\"] <- NA ##Step One: Run the logistic regression. ps_model = glm(smoke ~ gender+age+marital_status+ethnicity+region, data=smoking) ##Step Two: Match on propensity score. #Does not apply in this situation, but need to make sure there are no missing values in the covariates we are choosing. #In order to match use the matchit command, passing the function a formula, the data to use and the method, in this case, nearest neighor estimation. Match = matchit(smoke ~ gender+age+marital_status+ethnicity+region, method = \"nearest\", data =smoking) ##Step Three: Check for Balance. summary(match) ##Create a data frame from matches using the match.data function. match_data = match.data(match) #Check the dimensions. dim(match_data) ##Step Four: Conduct Analysis using the new sample. ##We can now get the treatment effect of smoking on gross income with and without controls # Note these standard errors will be incorrect, see Caliendo and Kopeinig (2008) for fixes # https://onlinelibrary.wiley.com/doi/full/10.1111/j.1467-6419.2007.00527.x lm_nocontrols = lm(new_income ~ smoke, data= match_data) #With controls, standard errors also wrong here ##Turn marital status into a factor variable so that we can use it in our regression match_data = match_data %>% mutate(marital_status = as.factor(marital_status)) lm_controls =lm(new_income ~ smoke+age+gender+ethnicity+marital_status, data=match_data) . ",
"url": "/Model_Estimation/Matching/propensity_score_matching.html#r",
"relUrl": "/Model_Estimation/Matching/propensity_score_matching.html#r"
- },"603": {
+ },"604": {
"doc": "Quantile Regression",
"title": "Quantile Regression",
"content": "Quantile Regression is an extension of linear regression analysis. Quantile Regression differs from OLS in how it estimates the response variable. OLS estimates the conditional mean of \\(Y\\) across the predictor variables (\\(X_1, X_2, X_3...\\)), whereas quantile regression estimates the conditional median (or quantiles) of \\(Y\\) across the predictor variables (\\(X_1, X_2, X_3...\\)). It is useful in situations where OLS assumptions are not met (heteroskedasticity, bi-modal or skewed distributions). To specify the desired quantile, select a \\(\\tau\\) value between 0 to 1 (.5 gives the median). For more information on Quantile Regression, see Wikipedia: Quantile Regression . ",
"url": "/Model_Estimation/GLS/quantile_regression.html",
"relUrl": "/Model_Estimation/GLS/quantile_regression.html"
- },"604": {
+ },"605": {
"doc": "Quantile Regression",
"title": "Keep in Mind",
"content": ". | This method allows for the dependent variable to have any distributional form, however it cannot be a dummy variable and must be continuous. | This method is robust to outliers, so there is no need to remove outlier observations. | Either the intercept term or at least one predictor is required to run an analysis. | LASSO regression cannot be used for feature selection in this framework due to it requiring OLS assumptions to be satisfied. | This method does not restrict the use of polynomial or interaction terms. A unique functional form can be specified. | . ",
"url": "/Model_Estimation/GLS/quantile_regression.html#keep-in-mind",
"relUrl": "/Model_Estimation/GLS/quantile_regression.html#keep-in-mind"
- },"605": {
+ },"606": {
"doc": "Quantile Regression",
"title": "Also Consider",
"content": ". | While Quantile Regression can be useful in applications where OLS assumptions are not met, it can actually be used to detect heteroskedasticity. This makes is a useful tool to ensure this assumption is met for OLS. | Several different standard error calculations can be used with this method, however bootstrapped standard errors are generally the best for complex modeling situations. Clustered standard errors are also possible by estimating a quantile regression with pooled OLS clustered errors. | . ",
"url": "/Model_Estimation/GLS/quantile_regression.html#also-consider",
"relUrl": "/Model_Estimation/GLS/quantile_regression.html#also-consider"
- },"606": {
+ },"607": {
"doc": "Quantile Regression",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/GLS/quantile_regression.html#implementations",
"relUrl": "/Model_Estimation/GLS/quantile_regression.html#implementations"
- },"607": {
+ },"608": {
"doc": "Quantile Regression",
"title": "Python",
"content": "The quantreg function in statsmodels allows for quantile regression. import statsmodels.api as sm import statsmodels.formula.api as smf mtcars = sm.datasets.get_rdataset(\"mtcars\", \"datasets\").data mod = smf.quantreg('mpg ~ cyl + hp + wt', mtcars) # Specify the quantile when you fit res = mod.fit(q=.2) print(res.summary()) . ",
"url": "/Model_Estimation/GLS/quantile_regression.html#python",
"relUrl": "/Model_Estimation/GLS/quantile_regression.html#python"
- },"608": {
+ },"609": {
"doc": "Quantile Regression",
"title": "R",
"content": "The main package to implement Quantile Regression in R is through the quantreg package. The main function in this package is qr(), which fits a Quantile Regression model with a default \\(\\tau\\) value of .5 but can be changed. # Load package library(quantreg) # Load data data(mtcars) # Run quantile regression with mpg as outcome variable # and cyl, hp, and wt as predictors # Using a tau value of .2 for quantiles quantreg_model = rq(mpg ~ cyl + hp + wt, data = mtcars, tau = .2) # Look at results summary(quantreg_model) . ",
"url": "/Model_Estimation/GLS/quantile_regression.html#r",
"relUrl": "/Model_Estimation/GLS/quantile_regression.html#r"
- },"609": {
+ },"610": {
"doc": "Quantile Regression",
"title": "Stata",
"content": "Quantile regression can be performed in Stata using the qreg function. By default it fits a median (q(.5)). See help qreg for some variants, including a bootstrapped quantile regression bsqreg. sysuse auto qreg mpg price trunk weight, q(.2) . ",
"url": "/Model_Estimation/GLS/quantile_regression.html#stata",
"relUrl": "/Model_Estimation/GLS/quantile_regression.html#stata"
- },"610": {
+ },"611": {
"doc": "Random Forest",
"title": "Random Forest",
"content": "Random forest is one of the most popular and powerful machine learning algorithms. A random forest works by building up a number of decision trees, each built using a bootstrapped sample and a subset of the variables/features. Each node in each decision tree is a condition on a single feature, selecting a way to split the data so as to maximize predictive accuracy. Each individual tree gives a classification. The average, or vote-counting of that classification across trees provides an overall prediction. More trees in the forest are associated with higher accuracy. A random forest classifier can be used for both classification and regression tasks. In terms of regression, it takes the average of the outputs by different trees. Random forest can work with large datasets with multiple dimensions. However, it may overfit data, especially for regression problems. ",
"url": "/Machine_Learning/random_forest.html",
"relUrl": "/Machine_Learning/random_forest.html"
- },"611": {
+ },"612": {
"doc": "Random Forest",
"title": "Keep in Mind",
"content": ". | Individual features need to have low correlations with each other, and sometimes we may remove features that are strongly correlated with other features. | Random forest can deal with missing values, and may simply treat “missing” as another value that the variable can take. | . ",
"url": "/Machine_Learning/random_forest.html#keep-in-mind",
"relUrl": "/Machine_Learning/random_forest.html#keep-in-mind"
- },"612": {
+ },"613": {
"doc": "Random Forest",
"title": "Also Consider",
"content": ". | If you are not familiar with decision tree, please go to the decision tree page first as decision trees are building blocks of random forests. | . ",
"url": "/Machine_Learning/random_forest.html#also-consider",
"relUrl": "/Machine_Learning/random_forest.html#also-consider"
- },"613": {
+ },"614": {
"doc": "Random Forest",
"title": "Implementations",
"content": " ",
"url": "/Machine_Learning/random_forest.html#implementations",
"relUrl": "/Machine_Learning/random_forest.html#implementations"
- },"614": {
+ },"615": {
"doc": "Random Forest",
"title": "Python",
"content": "Random forests can be used to perform both regression and classification tasks. In the example below, we’ll use the RandomForestClassifier from the popular sklearn machine learning library. RandomForestClassifier is an ensemble function that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. We’ll use this classifier to predict the species of iris based on its properties, using data from the iris dataset. You may need to install packages on the command line, using pip install package-name or conda install package-name, to run these examples (if you don’t already have them installed). import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Read data df = pd.read_csv(\"https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv\") # Prepare data X = df[[\"Sepal.Length\", \"Sepal.Width\", \"Petal.Length\", \"Petal.Width\"]] y = df[[\"Species\"]] # Split data into training and test set X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=1996 ) # Creating model using random forest model = RandomForestClassifier(max_depth=2) model.fit(X_train, y_train) # Predict values for test data y_pred = model.predict(X_test) # Evaluate model prediction print(f\"Accuracy is {accuracy_score(y_pred, y_test)*100:.2f} %\") . ",
"url": "/Machine_Learning/random_forest.html#python",
"relUrl": "/Machine_Learning/random_forest.html#python"
- },"615": {
+ },"616": {
"doc": "Random Forest",
"title": "R",
"content": "There are a number of packages in R capable of training a random forest, including randomForest and ranger. Here we will use randomForest. We’ll be using a built-in dataset in R, called “Iris”. There are five variables in this dataset, including species, petal width and length as well as sepal length and width. #Load packages library(tidyverse) library(rvest) library(dplyr) library(caret) library(randomForest) library(Metrics) library(readr) #Read data in R data(iris) iris #Create features and target X <- iris %>% select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) y <- iris$Species #Split data into training and test sets index <- createDataPartition(y, p=0.75, list=FALSE) X_train <- X[ index, ] X_test <- X[-index, ] y_train <- y[index] y_test<-y[-index] #Train the model iris_rf <- randomForest(x = X_train, y = y_train , maxnodes = 10, ntree = 10) print(iris_rf) #Make predictions predictions <- predict(iris_rf, X_test) result <- X_test result['Species'] <- y_test result['Prediction']<- predictions head(result) #Check the classification accuracy (number of correct predictions out of total datapoints used to test the prediction) print(sum(predictions==y_test)) print(length(y_test)) print(sum(predictions==y_test)/length(y_test)) . ",
"url": "/Machine_Learning/random_forest.html#r",
"relUrl": "/Machine_Learning/random_forest.html#r"
- },"616": {
+ },"617": {
"doc": "Random/Mixed Effects in Linear Regression",
"title": "Random/Mixed Effects in Linear Regression",
"content": "In panel data, we often have to deal with unobserved heterogeneity among the units of observation that are observed over time. If we assume that the unobserved heterogeneity is uncorrelated with the independent variables, we can use random effects model. Otherwise, we may consider fixed effects. In practice, random effects and fixed effects are often combined to implement a mixed effects model. Mixed refers to the fact that these models contain both fixed, and random effects. For more information, see Wikipedia: Random Effects Model . ",
"url": "/Model_Estimation/Multilevel_Models/random_mixed_effects_estimation.html",
"relUrl": "/Model_Estimation/Multilevel_Models/random_mixed_effects_estimation.html"
- },"617": {
+ },"618": {
"doc": "Random/Mixed Effects in Linear Regression",
"title": "Keep in Mind",
"content": ". | To use random effects model, you must observe the same person multiple times (panel data). | If unobserved heterogeneity is correlated with independent variables, the random effects estimator is biased and inconsistent. | However, even if unobserved heterogeneity is expected to be correlated with independent variables, the fixed effects model may have high standard errors if the number of observation per unit of observation is very small. Random effects maybe considered in such cases. | Additionally, modeling the correlation between the indepdendent variables and the random effect by using variables in predicting the random effect can account for this problem | . ",
"url": "/Model_Estimation/Multilevel_Models/random_mixed_effects_estimation.html#keep-in-mind",
"relUrl": "/Model_Estimation/Multilevel_Models/random_mixed_effects_estimation.html#keep-in-mind"
- },"618": {
+ },"619": {
"doc": "Random/Mixed Effects in Linear Regression",
"title": "Also Consider",
"content": ". | Consider Fixed effects if unobserved heterogeneity and independent variables are correlated or if only within-variation is desired. | Hauman Tests are often used to inform us about the appropiateness of fixed effects models vs. random effects models in which only the intercept is random. | Clustering your error | . ",
"url": "/Model_Estimation/Multilevel_Models/random_mixed_effects_estimation.html#also-consider",
"relUrl": "/Model_Estimation/Multilevel_Models/random_mixed_effects_estimation.html#also-consider"
- },"619": {
+ },"620": {
"doc": "Random/Mixed Effects in Linear Regression",
"title": "Implementations",
"content": "We continue from our the example in Fixed effects. In that example we estimated a fixed effect model of the form: . \\[earnings_{it} = \\beta_0 + \\beta_1 prop\\_ working_{it} + \\delta_t + \\delta_i + \\epsilon_{it}\\] That is, average earnings of graduates of an institution depends on proportion employed, after controlling for time and institution fixed effects. But, some institutions have one observation, and the average number of observations is 5.1. We may be worried about the precision of our estimates. So, we may choose to use random effects for intercepts by institution to estimate the model even if we think \\(corr(prop\\_ working_{it}, \\delta_{i}) \\ne 0\\). That is, we choose possiblity of bias over variance. ",
"url": "/Model_Estimation/Multilevel_Models/random_mixed_effects_estimation.html#implementations",
"relUrl": "/Model_Estimation/Multilevel_Models/random_mixed_effects_estimation.html#implementations"
- },"620": {
+ },"621": {
"doc": "Random/Mixed Effects in Linear Regression",
"title": "R",
"content": "Several packages can be used to implement a random effects model in R - such as lme4 and nlme. lme4 is more widely used. The example that follows uses the lme4 package. # If necessary, install lme4 if(!require(lme4)){install.packages(\"lme4\")} library(lme4) # Read in data from the College Scorecard df <- read.csv('https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Fixed_Effects_in_Linear_Regression/Scorecard.csv') # Calculate proportion of graduates working df$prop_working <- df$count_working/(df$count_working + df$count_not_working) # We write the mixed effect formula for estimation in lme4 as: # dependent_var ~ # covariates (that can include fixed effects) + # random effects - we need to specify if our model is random effects in intercepts or in slopes. In our example, we suspect random effects in intercepts at institutions. So we write \"...+(1 | inst_name), ....\" If we wanted to specify a model where the coefficient on prop_working was also varying by institution - we would use (1 + open | inst_name). # Here we regress average earnings graduates in an institution on prop_working, year fixed effects and random effects in intercepts for institutions. relm_model <- lmer(earnings_med ~ prop_working + factor(df$year) + (1 | inst_name), data = df) # Display results summary(relm_model) # We note that comparing with the fixed effects model, our estimates are more precise. But, the correlation between X`s and errors suggest bias in our mixed effect model, and we do see a large increase in estimated beta. ",
"url": "/Model_Estimation/Multilevel_Models/random_mixed_effects_estimation.html#r",
"relUrl": "/Model_Estimation/Multilevel_Models/random_mixed_effects_estimation.html#r"
- },"621": {
+ },"622": {
"doc": "Random/Mixed Effects in Linear Regression",
"title": "Stata",
"content": "We will estimate a mixed effects model using Stata using the built in xtreg command. * Obtain same data from Fixed Effect tutorial import delimited \"https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Fix ed_Effects_in_Linear_Regression/Scorecard.csv\", clear * Data cleaning * We are turning missings are written as \"NA\" into numeric destring count_not_working count_working earnings_med, replace force * Calculate the proportion working g prop_working = count_working/(count_working + count_not_working) * xtset requires that the individual identifier be a numeric variable encode inst_name, g(name_number) * Set the data as panel data with xtset xtset name_number * Use xtreg with the \"re\" option to run random effects on institution intercepts * Regressing earnings_med on prop_working * with random effects for name_number (implied by re) * and also year fixed effects (which we'll add manually with i.year) xtreg earnings_med prop_working i.year, re * We note that comparing with the fixed effects model, our estimates are more precise. But, correlation between X`s and errors suggest bias in our random effect model, and we do see a large increase in estimated beta. ",
"url": "/Model_Estimation/Multilevel_Models/random_mixed_effects_estimation.html#stata",
"relUrl": "/Model_Estimation/Multilevel_Models/random_mixed_effects_estimation.html#stata"
- },"622": {
+ },"623": {
"doc": "Regression Discontinuity Design",
"title": "Regression Discontinuity Design",
"content": "Regression discontinuity (RDD) is a research design for the purposes of causal inference. It can be used in cases where treatment is assigned based on a cutoff value of a “running variable”. For example, perhaps students in a school take a test in 8th grade. Students who score 30 or below are assigned to remedial classes, while students scoring above 30 stay in regular classes. Regression discontinuity could be applied to this setting with test score as a running variable and 30 as the cutoff to look at the effects of remedial classes. Regression discontinuity works by focusing on the cutoff. It makes an estimate of what the outcome is within a narrow bandwidth to the left of the cutoff, and also makes an estimate of what the outcome is to the right of the cutoff. Then it compares them to generate a treatment effect estimate. See Wikpedia: Regression Discontinuity Design for more information. Regression discontinuity receives a lot of attention because it relies on what some consider to be plausible assumptions. If the running variable is finely measured and is not being manipulated, then one can argue that being just to the left or the right of a cutoff is effectively random (someone getting a 30 or 31 on the test can basically be down to bad luck on the day) and so this approach by itself can remove confounding from lots of factors. ",
"url": "/Model_Estimation/Research_Design/regression_discontinuity_design.html",
"relUrl": "/Model_Estimation/Research_Design/regression_discontinuity_design.html"
- },"623": {
+ },"624": {
"doc": "Regression Discontinuity Design",
"title": "Keep in Mind",
"content": ". | There are many, many options to choose when performing an RDD. Bandwidth selection procedure, polynomial terms, bias correction, etc. etc.. Please check the help file for your command of choice closely, and ensure you know what kind of analysis you’re about to run. Don’t assume the defaults are correct. | Regression discontinuity relies on the absence of manipulation of the running variable. In the test score example, if the teachers scoring the exam nudge a few students from 30 to 31 so they can avoid remedial classes, RDD doesn’t work any more. | Because the method relies on isolating a narrow bandwidth around the cutoff, RDD doesn’t work quite the same if the running variable is discrete and split into a small number of groups. You want a running variable with a lot of different values! See Kolesár and Rothe (2018) for more information. | In order to improve statistical performance, regression discontinuity designs often incorporate information from data points far away from the cutoff to improve the estimate of what the outcome is near the cutoff. This can be done nonparametrically, but is most often done by fitting a separate polynomial function for the running variable on either side of the cutoff. A temptation is to use a very high-order polynomial (say, \\(x, x^2, x^3, x^4\\) and \\(x^5\\)) to improve fit. However, in general a low-order polynomial is probably a better idea. See Gelman and Imbens 2019 for more information. | Regression discontinuity designs are very well-suited to graphical demonstrations of the method. Software packages designed for RDD specifically will almost always provide an easy method for creating these graphs, and it is rare that you will not want to do this. However, do keep in mind that graphs can sometimes obscure meaningfully large effects. See Kirabo Jackson for an explanation. | Regression discontinuities can be sharp, where everyone to one side of the cutoff is treated and nobody on the other side is, or fuzzy, where the probability of treatment changes across the cutoff but assignment isn’t perfect. Most RDD packages can handle both. The intuition for both is similar, but the statistical properties of sharp designs are generally stronger. Fuzzy RDD can be thought of as similar to using an instrumental variables estimator in a case of imperfect random assignment in an experiment. Covariates are generally not necessary in a sharp RDD but may be advisable in a fuzzy one. | . ",
"url": "/Model_Estimation/Research_Design/regression_discontinuity_design.html#keep-in-mind",
"relUrl": "/Model_Estimation/Research_Design/regression_discontinuity_design.html#keep-in-mind"
- },"624": {
+ },"625": {
"doc": "Regression Discontinuity Design",
"title": "Also Consider",
"content": ". | The Regression Kink Design is an extension of RDD that looks for a change in a relationship between the running variable and the outcome, i.e. the slope, at the cutoff, rather than a change in the predicted outcome. | It is common to run a Density Discontinuity Test to check for manipulation in the running vairiable before performing a regression discontinuity. | Regression discontinuity designs are often accompanied by placebo tests, where the same RDD is run again, but with a covariate or some other non-outcome measure used as the outcome. If the RDD shows a significant effect for the covariates, this suggests that balancing did not occur properly and there may be an issue with the RDD assumptions. | Part of performing an RDD is selecting a bandwidth around the cutoff to focus on. This can be done by context, but more commonly there are data-based methods for selecting a bandwidth Check your RDD command of choice to see what methods are available for selecting a bandwidth. | . ",
"url": "/Model_Estimation/Research_Design/regression_discontinuity_design.html#also-consider",
"relUrl": "/Model_Estimation/Research_Design/regression_discontinuity_design.html#also-consider"
- },"625": {
+ },"626": {
"doc": "Regression Discontinuity Design",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/Research_Design/regression_discontinuity_design.html#implementations",
"relUrl": "/Model_Estimation/Research_Design/regression_discontinuity_design.html#implementations"
- },"626": {
+ },"627": {
"doc": "Regression Discontinuity Design",
"title": "Stata",
"content": "A standard package for performing regression discontinuity in Stata is rdrobust, installable from scc. * If necessary * ssc install rdrobust * Load RDD of house elections from the R package rddtools, * and originally from Lee (2008) https://www.sciencedirect.com/science/article/abs/pii/S0304407607001121 import delimited \"https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Estimation/Data/Regression_Discontinuity_Design/house.csv\", clear * x is \"vote margin in the previous election\" and y is \"vote margin in this election\" * If we want to specify options for bandwidth selection, we can run rdbwselect directly. * Otherwise, rdrobust will run it with default options by itself * c(0) indicates that treatment is assigned at 0 (i.e. someone gets more votes than the opponent) rdbwselect y x, c(0) * Run a sharp RDD with a second-order polynomial term rdrobust y x, c(0) p(2) * Run a fuzzy RDD * We don't have a fuzzy RDD in this data, but let's create one, where * probability of treatment jumps from 20% to 60% at the cutoff g treatment = (runiform() < .2)*(x < 0) + (runiform() < .6)*(x >= 0) rdrobust y x, c(0) fuzzy(treatment) * Generate a standard RDD plot with a polynomial of 2 (default is 4) rdplot y x, c(0) p(2) . ",
"url": "/Model_Estimation/Research_Design/regression_discontinuity_design.html#stata",
"relUrl": "/Model_Estimation/Research_Design/regression_discontinuity_design.html#stata"
- },"627": {
+ },"628": {
"doc": "Regression Discontinuity Design",
"title": "R",
"content": "There are several packages in R designed for the estimation of RDD. Three prominent options are rdd, rddtools, and rdrobust. See this article for comparisons between them in terms of their strengths and weaknesses. The article, considering the verisons of the packages available in 2017, recommends rddtools for assumption and sensitivity checks, and rdrobust for bandwidth selection and treatment effect estimation. We will consider rdrobust here. See the rddtools walkthrough for a detailed example of the use of rddtools. # If necessary # install.packages('rdrobust') library(rdrobust) # Load RDD of house elections from the R package rddtools, # and originally from Lee (2008) https://www.sciencedirect.com/science/article/abs/pii/S0304407607001121 df <- read.csv(\"https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Regression_Discontinuity_Design/house.csv\") # x is \"vote margin in the previous election\" and y is \"vote margin in this election\" # If we want to specify options for bandwidth selection, we can run rdbwselect directly. # Otherwise, rdrobust will run it with default options by itself # c(0) indicates that treatment is assigned at 0 (i.e. someone gets more votes than the opponent) bandwidth <- rdbwselect(df$y, df$x, c=0) # Run a sharp RDD with a second-order polynomial term rdd <- rdrobust(df$y, df$x, c=0, p=2) summary(rdd) # Run a fuzzy RDD # We don't have a fuzzy RDD in this data, but let's create one, where # probability of treatment jumps from 20% to 60% at the cutoff N <- nrow(df) df$treatment <- (runif(N) < .2)*(df$x < 0) + (runif(N) < .6)*(df$x >= 0) rddfuzzy <- rdrobust(df$y, df$x, c=0, p=2, fuzzy = df$treatment) summary(rddfuzzy) # Generate a standard RDD plot with a polynomial of 2 (default is 4) rdplot(df$y, df$x, c = 0, p = 2) . ",
"url": "/Model_Estimation/Research_Design/regression_discontinuity_design.html#r",
"relUrl": "/Model_Estimation/Research_Design/regression_discontinuity_design.html#r"
- },"628": {
+ },"629": {
"doc": "Regression Discontinuity Design",
"title": "Stata",
"content": "A standard package for performing regression discontinuity in Stata is rdrobust, installable from scc. * If necessary * ssc install rdrobust * Load RDD of house elections from the R package rddtools, * and originally from Lee (2008) https://www.sciencedirect.com/science/article/abs/pii/S0304407607001121 import delimited \"https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Regression_Discontinuity_Design/house.csv\", clear * x is \"vote margin in the previous election\" and y is \"vote margin in this election\" * If we want to specify options for bandwidth selection, we can run rdbwselect directly. * Otherwise, rdrobust will run it with default options by itself * c(0) indicates that treatment is assigned at 0 (i.e. someone gets more votes than the opponent) rdbwselect y x, c(0) * Run a sharp RDD with a second-order polynomial term rdrobust y x, c(0) p(2) * Run a fuzzy RDD * We don't have a fuzzy RDD in this data, but let's create one, where * probability of treatment jumps from 20% to 60% at the cutoff g treatment = (runiform() < .2)*(x < 0) + (runiform() < .6)*(x >= 0) rdrobust y x, c(0) fuzzy(treatment) * Generate a standard RDD plot with a polynomial of 2 (default is 4) rdplot y x, c(0) p(2) . ",
"url": "/Model_Estimation/Research_Design/regression_discontinuity_design.html#stata-1",
"relUrl": "/Model_Estimation/Research_Design/regression_discontinuity_design.html#stata-1"
- },"629": {
+ },"630": {
"doc": "Reshaping Data",
"title": "Reshaping Data",
"content": " ",
"url": "/Data_Manipulation/Reshaping/reshape.html",
"relUrl": "/Data_Manipulation/Reshaping/reshape.html"
- },"630": {
+ },"631": {
"doc": "Reshape Panel Data from Long to Wide",
"title": "Reshape Panel Data from Long to Wide",
"content": "Panel data is data in which individuals are observed at multiple points in time. There are two standard ways of storing this data: . In wide format, there is one row per individual. Then, for each variable in the data set that varies over time, there is one column per time period. For example: . | Individual | FixedCharacteristic | TimeVarying1990 | TimeVarying1991 | TimeVarying1992 | . | 1 | C | 16 | 20 | 22 | . | 2 | H | 23.4 | 10 | 14 | . This format makes it easy to perform calculations across multiple years. In long format, there is one row per individual per time period: . | Individual | FixedCharacteristic | Year | TimeVarying | . | 1 | C | 1990 | 16 | . | 1 | C | 1991 | 20 | . | 1 | C | 1992 | 22 | . | 2 | H | 1990 | 23.4 | . | 2 | H | 1991 | 10 | . | 2 | H | 1992 | 14 | . This format makes it easy to run models like fixed effects. Reshaping is the method of converting wide-format data to long and vice versa. ",
"url": "/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.html",
"relUrl": "/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.html"
- },"631": {
+ },"632": {
"doc": "Reshape Panel Data from Long to Wide",
"title": "Keep in Mind",
"content": ". | If your data has multiple observations per individual/time, then standard reshaping techniques generally won’t work. | It’s a good idea to check your data by directly looking at it both before and after a reshape to check that it worked properly. | . ",
"url": "/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.html#keep-in-mind",
"relUrl": "/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.html#keep-in-mind"
- },"632": {
+ },"633": {
"doc": "Reshape Panel Data from Long to Wide",
"title": "Also Consider",
"content": ". | To go in the other direction, reshape from wide to long. | Determine the observation level of a data set. | . ",
"url": "/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.html#also-consider",
"relUrl": "/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.html#also-consider"
- },"633": {
+ },"634": {
"doc": "Reshape Panel Data from Long to Wide",
"title": "Implementations",
"content": " ",
"url": "/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.html#implementations",
"relUrl": "/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.html#implementations"
- },"634": {
+ },"635": {
"doc": "Reshape Panel Data from Long to Wide",
"title": "Python",
"content": "The pandas package has several functions to reshape data. For going from long data to wide data, there’s pivot and pivot_table, both of which are demonstrated in the example below. # Install pandas using pip or conda, if you don't already have it installed. import pandas as pd # Load WHO data on population as an example, which has 'country', 'year', # and 'population' columns. df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/tidyr/population.csv', index_col=0) # In this example, we would like to have one row per country but the data have # multiple rows per country, each corresponding with # a year-country value of population. # Let's take a look at the first 5 rows: print(df.head()) # To reshape this into a dataframe with one country per row, we can use # the pivot function and set 'country' as the index. As we'd like to # split out years into different columns, we set columns to 'years', and the # values within this new dataframe will be population: df_wide = df.pivot(index='country', columns='year', values='population') # What if there are multiple year-country pairs? Pivot can't work # because it needs unique combinations. In this case, we can use # pivot_table which can aggregate any duplicate year-country pairs. To test it, let's # create some synthetic duplicate data for France and add it to the original # data. We'll pretend there was a second count of population that came in with # 5% higher values for all years. # Copy the data for France synth_fr_data = df.loc[df['country'] == 'France'] # Add 5% for all years synth_fr_data['population'] = synth_fr_data['population']*1.05 # Append it to the end of the original data df = pd.concat([df, synth_fr_data], axis=0) # Compute the wide data - averaging over the two estimates for France for each # year. df_wide = df.pivot_table(index='country', columns='year', values='population', aggfunc='mean') . ",
"url": "/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.html#python",
"relUrl": "/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.html#python"
- },"635": {
+ },"636": {
"doc": "Reshape Panel Data from Long to Wide",
"title": "R",
"content": "There are many ways to reshape in R, including base-R reshape and the deprecated reshape2::melt and cast and tidyr::gather and spread. We will be using the tidyr package function pivot_wider, which requires tidyr version 1.0.0 or later. # install.packages('tidyr') library(tidyr) # Load in population, which has one row per country per year data(\"population\") # If we look at the data, we'll see that we have: # identifying information in \"country\", # a time indicator in \"year\", # and our values in \"population\" head(population) . Now we think: . | Think about the set of variables that contain the values we’re interested in reshaping. Here’s it’s population. This list of variable names will be our values_from argument. | Think about what we want the new variables to be called. The variable variable says which variable we’re looking at. So that will be our names_from argument. And we want to specify that each variable represents population in a given year (rather than some other variable, so we’ll add “pop_” as our names_prefix. | . pop_wide <- pivot_wider(population, names_from = year, values_from = population, names_prefix = \"pop_\") . Another way to do this is using data.table. #install.packages('data.table') library(data.table) # The second argument here is the formula describing the observation level of the data # The full set of variables together is the current observation level (one row per country and year) # The parts before the ~ are what we want the new observation level to be in the wide data (one row per country) # The parts after the ~ are for the variables we want to no longer be part of the observation level (we no longer want a row per year) population = as.data.table(population) pop_wide = dcast(population, country ~ year, value.var = \"population\" ) . ",
"url": "/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.html#r",
"relUrl": "/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.html#r"
- },"636": {
+ },"637": {
"doc": "Reshape Panel Data from Long to Wide",
"title": "Stata",
"content": "* Load blood pressure data in long format, which contains * blood pressure both before and after a treatment for some patients sysuse bplong.dta . The next steps involve thinking: . | Think about the set of variables that identify individuals. Here it’s patient. This will go in i(), so we have i(patient). | Think about the set of variables that vary across time. Here’s it’s bp. This will be one of our “stub”s. | Think about which variable separates the different time periods within individual. Here we have “when”, and this goes in j(), so we have j(when). | . * Syntax is: * reshape wide stub, i(individualvars) j(newtimevar) * So we have reshape wide bp i(patient) j(when) * Note that simply typing reshape * will show the syntax for the function . With especially large datasets, the Gtools package provides a much faster version of reshape known as greshape. The syntax can function exactly the same, though they provide alternative syntax that you may find more intuitive. * First, we will create a toy dataset that is very large to demonstrate the speed gains * If necessary, first install gtools: * ssc install gtools * Clear memory clear all * Turn on return message to see command run time set rmsg on * Set data size to 15 million observations set obs 15000000 * Create ten observations per person generate person_id = floor((_n-1)/10) * Number time periods from 1 to 10 for each person generate time_id = mod((_n-1), 10) + 1 *Create an income in each period generate income = round(rnormal(100, 20)) * Demonstrate the comparative speed of these two reshape approaches. * preserve and restore aren't a part of the reshape command; * they just store the current state of the data and then restore it, * so we can try our different reshape commands on the same data. *The traditional reshape command preserve reshape wide income, i(person_id) j(time_id) restore *The Gtools reshape command preserve greshape wide income, i(person_id) j(time_id) restore *The Gtools reshape command, alternative syntax preserve greshape wide income, by(person_id) keys(time_id) restore . Note: there is much more guidance to the usage of greshape on the Gtools reshape page. ",
"url": "/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.html#stata",
"relUrl": "/Data_Manipulation/Reshaping/reshape_panel_data_from_long_to_wide.html#stata"
- },"637": {
+ },"638": {
"doc": "Reshape Panel Data from Wide to Long",
"title": "Reshape Panel Data from Wide to Long",
"content": "Panel data is data in which individuals are observed at multiple points in time. There are two standard ways of storing this data: . In wide format, there is one row per individual. Then, for each variable in the data set that varies over time, there is one column per time period. For example: . | Individual | FixedCharacteristic | TimeVarying1990 | TimeVarying1991 | TimeVarying1992 | . | 1 | C | 16 | 20 | 22 | . | 2 | H | 23.4 | 10 | 14 | . This format makes it easy to perform calculations across multiple years. In long format, there is one row per individual per time period: . | Individual | FixedCharacteristic | Year | TimeVarying | . | 1 | C | 1990 | 16 | . | 1 | C | 1991 | 20 | . | 1 | C | 1992 | 22 | . | 2 | H | 1990 | 23.4 | . | 2 | H | 1991 | 10 | . | 2 | H | 1992 | 14 | . This format makes it easy to run models like fixed effects. Reshaping is the method of converting wide-format data to long and vice versa.. ",
"url": "/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.html",
"relUrl": "/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.html"
- },"638": {
+ },"639": {
"doc": "Reshape Panel Data from Wide to Long",
"title": "Keep in Mind",
"content": ". | If your data has multiple observations per individual/time, then standard reshaping techniques generally won’t work. | It’s a good idea to check your data by directly looking at it both before and after a reshape to check that it worked properly. | . ",
"url": "/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.html#keep-in-mind",
"relUrl": "/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.html#keep-in-mind"
- },"639": {
+ },"640": {
"doc": "Reshape Panel Data from Wide to Long",
"title": "Also Consider",
"content": ". | To go in the other direction, reshape from long to wide. | Determine the observation level of a data set. | . ",
"url": "/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.html#also-consider",
"relUrl": "/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.html#also-consider"
- },"640": {
+ },"641": {
"doc": "Reshape Panel Data from Wide to Long",
"title": "Implementations",
"content": " ",
"url": "/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.html#implementations",
"relUrl": "/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.html#implementations"
- },"641": {
+ },"642": {
"doc": "Reshape Panel Data from Wide to Long",
"title": "Python",
"content": "The most user friendly ways to use Python to reshape data from wide to long formats come from the pandas data analysis package. Its wide_to_long function is relatively easy to use, the alternative melt function can handle more complex cases. In this example, we will download the billboard dataset, which has multiple columns for different weeks when a record was in the charts (with the values in each column giving the chart position for that week). All of the columns that we would like to convert to long format begin with the prefix ‘wk’. The wide_to_long function accepts this prefix (as the stubnames= keyword parameter) and uses it to work out which columns to transform into a single column. # Install pandas using pip or conda, if you don't have it already installed import pandas as pd df = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/tidyr/billboard.csv', index_col=0) # stubnames is the prefix for the columns we want to convert to long. i is the # unique id for each row, and j will be the name of the new column. Finally, # the values from the original wide columns (the chart position) adopt the # stubname, so we rename 'wk' to 'position' in the last step. long_df = (pd.wide_to_long(df, stubnames='wk', i=['artist', 'track', 'date.entered'], j='week') .rename(columns={'wk': 'position'})) # The wide_to_long function is a special case of the 'melt' function, which # can be used in more complex cases. Here we melt any columns that have the # string 'wk' in their names. In the final step, we extract the number of weeks # from the prefix 'wk' using regex. The final dataframe is the same as above. long_df = pd.melt(df, id_vars=['artist', 'track', 'date.entered'], value_vars=[x for x in df.columns if 'wk' in x], var_name='week', value_name='position') long_df['week'] = long_df['week'].str.extract(r'(\\d+)') # A more complex case taken from the pandas docs: import numpy as np # In this case, there are two different patterns in the many columns # that we want to convert to two different long columns. We can pass # stubnames a list of these prefixes. It then splits the columns that # have the year suffix into two different long columns depending on # their first letter (A or B) # Create some synthetic data df = pd.DataFrame({\"A1970\" : {0 : \"a\", 1 : \"b\", 2 : \"c\"}, \"A1980\" : {0 : \"d\", 1 : \"e\", 2 : \"f\"}, \"B1970\" : {0 : 2.5, 1 : 1.2, 2 : .7}, \"B1980\" : {0 : 3.2, 1 : 1.3, 2 : .1}, \"X\" : dict(zip(range(3), np.random.randn(3))) }) # Set an index df[\"id\"] = df.index # Wide to multiple long columns df_long = pd.wide_to_long(df, [\"A\", \"B\"], i=\"id\", j=\"year\") . ",
"url": "/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.html#python",
"relUrl": "/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.html#python"
- },"642": {
+ },"643": {
"doc": "Reshape Panel Data from Wide to Long",
"title": "R",
"content": "There are many ways to reshape in R, including base-R reshape and the deprecated reshape2::melt and cast and tidyr::gather and spread. There is also the incredibly fast data.table::melt(). We will be using the tidyr package function pivot_longer, which requires tidyr version 1.0.0 or later. # install.packages('tidyr') library(tidyr) # Load in billboard, which has one row per song # and one variable per week, for its chart position each week data(\"billboard\") # If we look at the data, we'll see that we have: # identifying information in \"artist\" and \"track\" # A variable consistent within individuals \"date.entered\" # and a bunch of variables containing position information # all named wk and then a number names(billboard) . Now we think: . | Think about the set of variables that contain time-varying information. Here’s it’s wk1-wk76. So we can give a list of all the variables we want to widen using the tidyselect helper function starts_with(): starts_with(\"wk\"). This list of variable names will be our col argument. | Think about what we want the new variables to be called. I’ll call the week time variable “week” (this will be the names_to argument), and the data values currently stored in wk1-wk76 is the “position” (values_to). | Think about the values you want to be in your new time variable. The column names are wk1-wk76 but we want the variable to have 1-76 instead, so we’ll take out the “wk” with names_prefix = \"wk\". | . billboard_long <- pivot_longer(billboard, col = starts_with(\"wk\"), names_to = \"week\", names_prefix = \"wk\", values_to = \"position\", values_drop_na = TRUE) # values_drop_na says to drop any rows containing missing values of position. # If reshaping to create multiple variables, see the names_sep or names_pattern options. This task can also be done through data.table. #install.packages('data.table') library(data.table) billboard = as.data.table(billboard) billboard_long = melt(billboard, id = 1:3, na.rm=TRUE, variable.names = \"Week\", value.name = \"Position\" ) . ",
"url": "/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.html#r",
"relUrl": "/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.html#r"
- },"643": {
+ },"644": {
"doc": "Reshape Panel Data from Wide to Long",
"title": "Stata",
"content": "* Load blood pressure data in wide format, which contains * bp_before and bp_after sysuse bpwide.dta . The next steps involve thinking: . | Think about the set of variables that identify individuals. Here it’s patient. This will go in i(), so we have i(patient). | Think about the set of variables that vary across time. Here’s it’s bp_. Note the inclusion of the _, so that “before” and “after” will be our time periods. This will be one of our “stub”s. | Think about what we want the new time variable to be called. I’ll just call it “time”, and this goes in j(), so we have j(time). | . * Syntax is: * reshape long stub, i(individualvars) j(newtimevar) * So we have reshape long bp_ i(patient) j(time) s * Where the s indicates that our time variable is a string (\"before\", \"after\") * Note that simply typing reshape * will show the syntax for the function . With especially large datasets, the Gtools package provides a much faster version of reshape known as greshape. The syntax can function exactly the same, though they provide alternative syntax that you may find more intuitive. * If necessary, install gtools * ssc install gtools * First, we will create a toy dataset that is very large to demonstrate the speed gains * Clear memory clear all * Turn on return message to see command run time set rmsg on * Set data size to 15 million observations set obs 15000000 * Create an ID variable generate person_id = _n * Create 4 separate fake test scores per student generate test_score1 = round(rnormal(180, 30)) generate test_score2 = round(rnormal(180, 30)) generate test_score3 = round(rnormal(180, 30)) generate test_score4 = round(rnormal(180, 30)) * Demonstrate the comparative speed of these two reshape approaches * preserve and restore aren't a part of the reshape command; * they just store the current state of the data and then restore it, * so we can try our different reshape commands on the same data. * The traditional reshape command preserve reshape long test_score, i(person_id) j(test_number) restore *The Gtools reshape command preserve greshape long test_score, i(person_id) j(test_number) restore *The Gtools reshape command, alternative syntax preserve greshape long test_score, by(person_id) keys(test_number) restore . Note: there is much more guidance to the usage of greshape on the Gtools reshape page. ",
"url": "/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.html#stata",
"relUrl": "/Data_Manipulation/Reshaping/reshape_panel_data_from_wide_to_long.html#stata"
- },"644": {
+ },"645": {
"doc": "Rowwise Calculations",
"title": "Rowwise Calculations",
"content": "When working with a table of data, it’s not uncommon to want to perform a calculations across many columns. For example, taking the mean of a bunch of columns for each row. This is generally not difficult to do by hand if the number of variables being handled is small. For example, in most software packages, you could take the mean of columns A and B for each row by just asking for (A+B)/2. This becomes more difficult, though, when the list of variables gets too long to type out by hand, or when the calculation doesn’t play nicely with being given columns. In these cases, approaches explicitly designed for rowwise calculations are necessary. ",
"url": "/Data_Manipulation/rowwise_calculations.html",
"relUrl": "/Data_Manipulation/rowwise_calculations.html"
- },"645": {
+ },"646": {
"doc": "Rowwise Calculations",
"title": "Keep in Mind",
"content": ". | When incorporating lots of variables, rowwise calculations often allow you to select those variables by group, such as “all variables starting with r_”. When doing this, check ahead of time to make sure you aren’t accidentally incorporating unintended variables. | . ",
"url": "/Data_Manipulation/rowwise_calculations.html#keep-in-mind",
"relUrl": "/Data_Manipulation/rowwise_calculations.html#keep-in-mind"
- },"646": {
+ },"647": {
"doc": "Rowwise Calculations",
"title": "Implementations",
"content": " ",
"url": "/Data_Manipulation/rowwise_calculations.html#implementations",
"relUrl": "/Data_Manipulation/rowwise_calculations.html#implementations"
- },"647": {
+ },"648": {
"doc": "Rowwise Calculations",
"title": "Python",
"content": "The pandas data analysis package provides several methods for performing row-wise (or column-wise) operations in Python. Many common operations, such as sum and mean, can be called directly (eg summing over multiple columns to create a new column). It’s useful to know the axis convention in pandas: operations that combine columns often require the user to pass axis=1 to the function, while operations that combine rows require axis=0. This convention follows the usual one for matrices of denoting individual elements first by the ith row and then by the jth column. Although not demonstrated in the example below, lambda functions can be used for more complex operations that aren’t built-in and apply to multiple rows or columns. # If necessary, install pandas using pip or conda import pandas as pd # Grab the data df = pd.read_csv(\"https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/midwest.csv\", index_col=0) # Let's assume that we want to sum, row-wise, every column # that contains 'perc' in its column name and check that # the total is 300. Use a list comprehension to get only # relevant columns, sum across them (axis=1), and create a # new column to store them: df['perc_sum'] = df[[x for x in df.columns if 'perc' in x]].sum(axis=1) # We can now check whether, on aggregate, each row entry of this new column # is 300 (it's not!) df['perc_sum'].describe() . ",
"url": "/Data_Manipulation/rowwise_calculations.html#python",
"relUrl": "/Data_Manipulation/rowwise_calculations.html#python"
- },"648": {
+ },"649": {
"doc": "Rowwise Calculations",
"title": "R",
"content": "There are a few ways to perform rowwise operations in R. If you are summing the columns or taking their mean, rowSums and rowMeans in base R are great. For something more complex, apply in base R can perform any necessary rowwise calculation, but pmap in the purrr package is likely to be faster. In all cases, the tidyselect helpers in the dplyr package can help you to select many variables by name. # If necessary # install.packages(c('purrr','ggplot2','dplyr')) # ggplot2 is only for the data data(midwest, package = 'ggplot2') # dplyr is for the tidyselect functions, the pipe %>%, and select() to pick columns library(dplyr) # There are three sets of variables starting with \"perc\" - let's make sure they # add up to 300 as they maybe should # Use starts_with to select the variables # First, do it with rowSums, # either by picking column indices or using tidyselect midwest$rowsum_rowSums1 <- rowSums(midwest[,c(12:16,18:20,22:26)]) midwest$rowsum_rowSums2 <- midwest %>% select(starts_with('perc')) %>% rowSums() # Next, with apply - we're doing sum() here for the function # but it could be anything midwest$rowsum_apply <- apply( midwest %>% select(starts_with('perc')), MARGIN = 1, sum) # Next, two ways with purrr: library(purrr) # First, using purrr::reduce, which is good for some functions like summing # Note that . is the data set being sent by %>% midwest <- midwest %>% mutate(rowsum_purrrReduce = reduce(select(., starts_with('perc')), `+`)) # More flexible, purrr::pmap, which works for any function # using pmap_dbl here to get a numeric variable rather than a list midwest <- midwest %>% mutate(rowsum_purrrpmap = pmap_dbl( select(.,starts_with('perc')), sum)) # So do we get 300? summary(midwest$rowsum_rowSums2) # Uh-oh... looks like we didn't understand the data after all. ",
"url": "/Data_Manipulation/rowwise_calculations.html#r",
"relUrl": "/Data_Manipulation/rowwise_calculations.html#r"
- },"649": {
+ },"650": {
"doc": "Rowwise Calculations",
"title": "Stata",
"content": "Stata has a series of built-in row operations that use the egen command. See help egen for the full list, and look for functions beginning with row like rowmean. The full list includes: rowfirst and rowlast (first or last non-missing observation), rowmean, rowmedian, rowmax, rowmin, rowpctile, and rowtotal (the mean, median, max, min, given percentile, or sum of all the variables), and rowmiss and rownonmiss (the count of the number of missing or nonmissing observations across the variables). The egenmore package, which can be installed with ssc install egenmore, adds rall, rany, and rcount (checks a condition for each variable and returns whether all are true, any are true, or the number that are true), rownvals and rowsvals (number of unique values for numeric and string variables, respectively), and rsum2 (rowtotal with some additional options). * Get data on midwestern states import delimited using \"https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/midwest.csv\" * There are three sets of variables starting with \"perc\" - let's make sure they * add up to 300 as they should * Use * as a wildcard for variable names egen total_perc = rowtotal(perc*) summ total_perc * They don't! Uh oh. * Let's just check the education variables - should add up to 100 * Use - to include all variables from one to the other * based on their current order in the data egen total_ed = rowtotal(perchsd-percprof) * Oh that explains it... * These aren't exclusive categories (HSD, college overlap) * and also leaves out non-HS graduates. summ total_ed . ",
"url": "/Data_Manipulation/rowwise_calculations.html#stata",
"relUrl": "/Data_Manipulation/rowwise_calculations.html#stata"
- },"650": {
+ },"651": {
"doc": "Sankey Diagrams",
"title": "Sankey Diagrams",
"content": "Sankey diagrams are visual displays that represent a data flow across sequential points of change, sorting, or decision. They can be used to track decision-making, behavioral patterns, resource flow, or as a method to display time series data, among other uses. ",
"url": "/Presentation/Figures/sankey_diagrams.html",
"relUrl": "/Presentation/Figures/sankey_diagrams.html"
- },"651": {
+ },"652": {
"doc": "Sankey Diagrams",
"title": "Keep in Mind",
"content": ". | A Sankey diagram is comprised of stacked categorical variables, with each variable on its own vertical axis. | Categorical flow points are generally referred to as “nodes.” | Horizontal lines or bands show the density of variables at each node and the subsequent distribution onto the next variable. | . ",
"url": "/Presentation/Figures/sankey_diagrams.html#keep-in-mind",
"relUrl": "/Presentation/Figures/sankey_diagrams.html#keep-in-mind"
- },"652": {
+ },"653": {
"doc": "Sankey Diagrams",
"title": "Also Consider",
"content": ". | Variables should generally be categorical, as continuous values will typically not work in this setting. | Too few or too many categories can make a Sankey diagram less effective. Segmenting or grouping variables may be useful. | Sankey diagrams are sometimes known as alluvial diagrams, though the latter is often used to describe changes over time. | . ",
"url": "/Presentation/Figures/sankey_diagrams.html#also-consider",
"relUrl": "/Presentation/Figures/sankey_diagrams.html#also-consider"
- },"653": {
+ },"654": {
"doc": "Sankey Diagrams",
"title": "Implementations",
"content": " ",
"url": "/Presentation/Figures/sankey_diagrams.html#implementations",
"relUrl": "/Presentation/Figures/sankey_diagrams.html#implementations"
- },"654": {
+ },"655": {
"doc": "Sankey Diagrams",
"title": "R",
"content": "There are many excellent packages in R for making Sankey diagrams (networkD3, alluvial, and ggforce among them), but let’s begin by looking at the highcharter package. It is an R wrapper for the Highcharts Javascript library and a powerful tool. It’s also easy to get up and running quickly, while some other packages may require more preliminary data wrangling. We begin by loading pacman and dplyr. library(pacman) p_load(dplyr) . Next, we bring in the highcharter package and import a csv file that includes data from the 2020-2021 NBA season, including team, division, winning percentage, playoff seeding, and appearance in the conference semifinals. We change the winning percentage “win_perc” variable to a character so that it functions appropriately in this setting and take a look at the first few rows. p_load(highcharter) nba = read.csv(\"https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Figures/Data/Sankey_Diagrams/NBA.csv\") nba$win_perc <- as.character(nba$win_perc) head(nba) . Now we simply use “data_to_sankey” within the hchart function to create our Sankey diagram. We see that the data flows in the same order as our data frame, from individual team to conference, and then from winning percentage and playoff position to whether the team made the conference semifinals. I have chosen the theme ggplot2 but there are many nice options. hchart(data_to_sankey(nba), \"sankey\", name = \"Number of teams\") %>% hc_title(text= \"NBA 2020-2021 Season\") %>% hc_subtitle(text= \"Team --- Conference --- Winning Percentage --- Playoff Position --- Advancement to Conference Semifinals\") %>% hc_add_theme(hc_theme_ggplot2()) %>% hc_plotOptions(series = list(dataLabels = list( style = list(fontSize = \"10px\")))) . Dynamically hovering the cursor over each node or branch gives us a count of how many teams went to each of the next nodes. For instance, we see that 3 teams from the West had a winning percentage of 0.4. Also, between the last two nodes we see that one top 4 seed did not advance to the conference semifinals and one 5 to 8 seed did. Next, we look at the ggalluvial package, which is an extension for the ggplot2 package. This, too, is simple to get started. In fact, the bulk of the code here is manipulating the familiar mtcars data set such that hp, wt, mpg, and qsec are made categorical from their original numeric values. This fact underscores one way the Sankey diagram is useful. Namely, that values can be essentially binned in order to see trends in data flow. We load the package and mtcars, do our data wrangling, and check out the first few rows. p_load(ggplot2, ggalluvial) data(mtcars) mtc = mtcars %>% select(cyl, hp, wt, qsec, mpg) %>% mutate( hp = case_when( hp <= 100 ~ \"0-100\", hp <= 150 ~ \"100-150\", hp <= 200 ~ \"150-200\", hp <= 500 ~ \"200-350\"), wt = case_when( wt <= 2 ~ \"1-2\", wt <= 3 ~ \"2-3\", wt <= 4 ~ \"3-4\", wt <= 7 ~ \"4-6\"), mpg = case_when( mpg <= 20 ~ \"10-20 mpg\", mpg <= 30 ~ \"20-30 mpg\", mpg <= 50 ~ \"30-40 mpg\"), qsec = case_when( qsec <= 16 ~ \"14-16\", qsec <= 17 ~ \"16-17\", qsec <= 18 ~ \"17-18\", qsec <= 23 ~ \"18-23\" )) head(mtc) . Next, we use the familiar ggplot and include the line “geom_alluvium” to induce an alluvial diagram. We can interpret that weight and number of cylinders are highly correlated but that horsepower and the quarter-mile time are less so. ggplot(data = mtc, aes(axis1 = wt, axis2 = cyl, axis3 = hp, axis4 = qsec)) + scale_x_discrete(limits = c(\"Weight (1,000 lbs)\", \"Cylinders\", \"Horsepower\", \"1/4 mile time (seconds\"), expand = c(.05, .05)) + geom_alluvium(aes(fill = mpg)) + geom_stratum(color = \"grey\") + geom_text(stat = \"stratum\", aes(label = after_stat(stratum))) + theme_minimal() + ggtitle(\"Miles per Gallon\", \"Stratified by weight, cylinders, horsepower, & 1/4 mile time (n = 32 car models)\") . We see four variables (wt, cyl, hp, and qsec) in columns, with the proportion of each category represented by the height of the node. In this package, it is easier to see the distribution of each variable because columns are all the same height and frequency of categorical values is proportional. The y axis is a measure of the number of observations in our sample. Additionally, our fifth variable, mpg, is color coded in bands across the diagram, allowing us to highlight a particular aspect of this data set. These are relatively basic examples, but in a few lines of code demonstrate the usefulness of a Sankey diagram to track the flow and distribution of variables in a data set. ",
"url": "/Presentation/Figures/sankey_diagrams.html#r",
"relUrl": "/Presentation/Figures/sankey_diagrams.html#r"
- },"655": {
+ },"656": {
"doc": "Sankey Diagrams",
"title": "Stata",
"content": "They sankey package can be used to easily conduct sankey plots in Stata. For further vignettes to master the subcommands, please reference Asjad Naqvi’s Github repository on the matter. First, install the package through SSC and be sure to replace in case of updates to the package. A dependency is the palettes package as well. ssc install sankey, replace ssc install palettes, replace ssc install colrspace, replace . While you can use the subcommands to elaborate on the process, the basic commands for the sankey plot is shown below. For this vingette, we will use the Sankey example dataset from Asjad Naqvi. # Import Data import excel using \"https://github.com/asjadnaqvi/stata-sankey/blob/main/data/sankey_example2.xlsx?raw=true\", clear first # Basic Sankey plot sankey value, from(source) to(destination) by(layer) . ",
"url": "/Presentation/Figures/sankey_diagrams.html#stata",
"relUrl": "/Presentation/Figures/sankey_diagrams.html#stata"
- },"656": {
+ },"657": {
"doc": "Scatterplot by Group on Shared Axes",
"title": "Scatterplot by Group on Shared Axes",
"content": "Scatterplots are a standard data visualization tool that allows you to look at the relationship between two variables \\(X\\) and \\(Y\\). If you want to see how the relationship between \\(X\\) and \\(Y\\) might be different for Group A as opposed to Group B, then you might want to plot the scatterplot for both groups on the same set of axes, so you can compare them. ",
"url": "/Presentation/Figures/scatterplot_by_group_on_shared_axes.html",
"relUrl": "/Presentation/Figures/scatterplot_by_group_on_shared_axes.html"
- },"657": {
+ },"658": {
"doc": "Scatterplot by Group on Shared Axes",
"title": "Keep in Mind",
"content": ". | Scatterplots may not work well if the data is discrete, or if there are a large number of data points. | . ",
"url": "/Presentation/Figures/scatterplot_by_group_on_shared_axes.html#keep-in-mind",
"relUrl": "/Presentation/Figures/scatterplot_by_group_on_shared_axes.html#keep-in-mind"
- },"658": {
+ },"659": {
"doc": "Scatterplot by Group on Shared Axes",
"title": "Also Consider",
"content": ". | Sometimes, instead of putting both Group A and Group B on the same set of axes, it makes more sense to plot them separately, and put the plots next to each other. See Faceted Graphs. | There are many ways to make the scatterplots of the two groups distinct. See Styling Scatterplots. | . ",
"url": "/Presentation/Figures/scatterplot_by_group_on_shared_axes.html#also-consider",
"relUrl": "/Presentation/Figures/scatterplot_by_group_on_shared_axes.html#also-consider"
- },"659": {
+ },"660": {
"doc": "Scatterplot by Group on Shared Axes",
"title": "Implementations",
"content": " ",
"url": "/Presentation/Figures/scatterplot_by_group_on_shared_axes.html#implementations",
"relUrl": "/Presentation/Figures/scatterplot_by_group_on_shared_axes.html#implementations"
- },"660": {
+ },"661": {
"doc": "Scatterplot by Group on Shared Axes",
"title": "R",
"content": "library(ggplot2) # Load auto data data(mtcars) # Make sure that our grouping variable is a factor # and labeled properly mtcars$Transmission <- factor(mtcars$am, labels = c(\"Automatic\", \"Manual\")) # Put wt on the x-axis, mpg on the y-axis, ggplot(mtcars, aes(x = wt, y = mpg, # distinguish the Transmission values by color, color = Transmission)) + # make it a scatterplot with geom_point() geom_point()+ # And label properly labs(x = \"Car Weight\", y = \"MPG\") . This results in: . ",
"url": "/Presentation/Figures/scatterplot_by_group_on_shared_axes.html#r",
"relUrl": "/Presentation/Figures/scatterplot_by_group_on_shared_axes.html#r"
- },"661": {
+ },"662": {
"doc": "Scatterplot by Group on Shared Axes",
"title": "Stata",
"content": "* Load auto data sysuse auto.dta * Start a twoway command * Then, for each group, put its scatter command in () * Using if to plot each group separately * And specifying mcolor or msymbol (etc.) to differentiate them twoway (scatter weight mpg if foreign == 0, mcolor(black)) (scatter weight mpg if foreign == 1, mcolor(blue)) * Add a legend option so you know what the colors mean twoway (scatter weight mpg if foreign == 0, mcolor(black)) (scatter weight mpg if foreign == 1, mcolor(blue)), legend(lab(1 Domestic) lab(2 Foreign)) xtitle(\"Weight\") ytitle(\"MPG\") . This results in: . ",
"url": "/Presentation/Figures/scatterplot_by_group_on_shared_axes.html#stata",
"relUrl": "/Presentation/Figures/scatterplot_by_group_on_shared_axes.html#stata"
- },"662": {
+ },"663": {
"doc": "Set a Working Directory",
"title": "Set a Working Directory",
"content": "When you want to refer to files on your computer in your code, for example to load a data file, you can usually refer to them in one of two ways: . | Using an absolute path, which starts from the root of the computer. On Windows it would look something like C:/Users/Name/Documents/MyFile.csv | Using a relative path which starts in the working directory and works from there. For example, if your workind directory were C:/Users/Name/ then you could refer to that same MyFile.csv file using Documents/MyFile.csv as a relative path. | . Using absolute paths is generally frowned upon because it makes it very difficult for anyone to run your code on their computer, since they won’t have the same folder structure. So, you want to use relative paths in your code. This means you need to know how to set the working directory so you know where your file searching starts from. ",
"url": "/Other/set_a_working_directory.html",
"relUrl": "/Other/set_a_working_directory.html"
- },"663": {
+ },"664": {
"doc": "Set a Working Directory",
"title": "Keep in Mind",
"content": ". | If you set a working directory in your code, that’s basically the same as using an absolute path. Your code won’t work on anyone else’s computer! Setting a working directory is generally something you’ll do interactively (by hand, either using a menu or some code you type directly in the console) when you start your software package. | Once you are in a working directory, you can explore your folder structure using your filepath. As above, if your working directory is C:/Users/Name/, you can get to the file MyFile.csv in the C:/Users/Name/Documents/ folder with Documents/MyFile.csv. You can also go up folders with ... You can get at image.png in the Users folder with ../image.png Or if you want the file C:/Users/Admin/passwords.txt you could do ../Admin/passwords.txt. This means you can set your working directory once, and reach for files anywhere you like without having to change it again. Or if you got the working directory wrong, you can get at a new one with a relative filepath! If cd() is your language’s working-directory-setting command, you can go from C:/Users/Name/Documents/ to C:/Users/Name/ with cd('..') to go up one folder. | Because setting the working directory is often done by hand anyway, it’s common for it to be a point-and-click or menu feature in your software, even in software designed for use with text code. Some examples of this will be in the Implementations section. | Many editors and IDEs come with project managers. Most project managers have you designate a folder as being that project’s home. Then, when you open that project, most managers will automatically set the working directory to that home folder. | In Windows, if you copy a filepath in, it will often use \\ instead of / between folders. Many programming languages don’t like this. You may have to change them manually. | . ",
"url": "/Other/set_a_working_directory.html#keep-in-mind",
"relUrl": "/Other/set_a_working_directory.html#keep-in-mind"
- },"664": {
+ },"665": {
"doc": "Set a Working Directory",
"title": "Also Consider",
"content": ". | Get a list of files from a directory. | . ",
"url": "/Other/set_a_working_directory.html#also-consider",
"relUrl": "/Other/set_a_working_directory.html#also-consider"
- },"665": {
+ },"666": {
"doc": "Set a Working Directory",
"title": "Implementations",
"content": " ",
"url": "/Other/set_a_working_directory.html#implementations",
"relUrl": "/Other/set_a_working_directory.html#implementations"
- },"666": {
+ },"667": {
"doc": "Set a Working Directory",
"title": "Julia",
"content": "In Julia, you can use the cd() function to change the working directory. cd(\"C:/My/New/Working/Directory/\") . You may use the pwd() function to check the current working directory. pwd() . ",
"url": "/Other/set_a_working_directory.html#julia",
"relUrl": "/Other/set_a_working_directory.html#julia"
- },"667": {
+ },"668": {
"doc": "Set a Working Directory",
"title": "Python",
"content": "In Python, the os.chdir() function will let you change working directories. import os os.chdir('C:/My/New/Working/Directory/') # Or if you want to change the directory to your \"Home\" directory, you can use os.path.expanduser(\"~\") os.chdir(os.path.expanduser(\"~\")) . In the Spyder IDE, the working directory is listed by default in the top-right, and you can edit it directly. ",
"url": "/Other/set_a_working_directory.html#python",
"relUrl": "/Other/set_a_working_directory.html#python"
- },"668": {
+ },"669": {
"doc": "Set a Working Directory",
"title": "R",
"content": "In R, the setwd() function can change the working directory. setwd('C:/My/New/Working/Directory/') . If you are working in an R project, there is also the here package. library(here) here() . here() will start in whatever your current working directory and look upwards into parent folders until it finds something that indicates that it’s found a folder containing a project: an .Rproj (R Project) file, a .git or .svn folder, or any of the files .here, .projectile, remake.yml, or DESCRIPTION, and will set the working directory to that folder. This won’t work if you haven’t set up a proper project folder structure. If you are using RStudio, there are several other ways to set the working directory. In the Session menu, you can choose to set the working directory to the Source File location (whatever folder the active code tab file is saved in), to the File Pane location (whatever folder the Files pane, in the bottom-right by default, has navigated to), or you can choose it using your standard operating system folder-picker. You can also navigate to the folder you want in the Files pane (which is in the bottom-right by default) and select More \\(\\rightarrow\\) Set as Working Directory. ",
"url": "/Other/set_a_working_directory.html#r",
"relUrl": "/Other/set_a_working_directory.html#r"
- },"669": {
+ },"670": {
"doc": "Set a Working Directory",
"title": "Stata",
"content": "In Stata, you can use the cd command to change working directories. cd \"C:/My/New/Working/Directory/\" . You can also change the working directory in the File \\(\\rightarrow\\) Change Working Directory menu, which will pull up your standard operating system folder-picker. Additionally, if you open Stata by clicking on a .do file saved on your computer, the working directory will automatically be set to whatever folder that .do file is saved in. ",
"url": "/Other/set_a_working_directory.html#stata",
"relUrl": "/Other/set_a_working_directory.html#stata"
- },"670": {
+ },"671": {
"doc": "Simple Linear Regression",
"title": "Simple Linear Regression",
"content": "Ordinary Least Squares (OLS) is a statistical method that produces a best-fit line between some outcome variable \\(Y\\) and any number of predictor variables \\(X_1, X_2, X_3, ...\\). These predictor variables may also be called independent variables or right-hand-side variables. For more information about OLS, see Wikipedia: Ordinary Least Squares. ",
"url": "/Model_Estimation/OLS/simple_linear_regression.html",
"relUrl": "/Model_Estimation/OLS/simple_linear_regression.html"
- },"671": {
+ },"672": {
"doc": "Simple Linear Regression",
"title": "Keep in Mind",
"content": ". | OLS assumes that you have specified a true linear relationship. | OLS results are not guaranteed to have a causal interpretation. Just because OLS estimates a positive relationship between \\(X_1\\) and \\(Y\\) does not necessarily mean that an increase in \\(X_1\\) will cause \\(Y\\) to increase. | OLS does not require that your variables follow a normal distribution. | . ",
"url": "/Model_Estimation/OLS/simple_linear_regression.html#keep-in-mind",
"relUrl": "/Model_Estimation/OLS/simple_linear_regression.html#keep-in-mind"
- },"672": {
+ },"673": {
"doc": "Simple Linear Regression",
"title": "Also Consider",
"content": ". | OLS standard errors assume that the model’s error term is IID, which may not be true. Consider whether your analysis should use heteroskedasticity-robust standard errors or cluster-robust standard errors. | If your outcome variable is discrete or bounded, then OLS is by nature incorrectly specified. You may want to use probit or logit instead for a binary outcome variable, or ordered probit or ordered logit for an ordinal outcome variable. | If the goal of your analysis is predicting the outcome variable and you have a very long list of predictor variables, you may want to consider using a method that will select a subset of your predictors. A common way to do this is a penalized regression method like LASSO. | In many contexts, you may want to include interaction terms or polynomials in your regression equation. | . ",
"url": "/Model_Estimation/OLS/simple_linear_regression.html#also-consider",
"relUrl": "/Model_Estimation/OLS/simple_linear_regression.html#also-consider"
- },"673": {
+ },"674": {
"doc": "Simple Linear Regression",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/OLS/simple_linear_regression.html#implementations",
"relUrl": "/Model_Estimation/OLS/simple_linear_regression.html#implementations"
- },"674": {
+ },"675": {
"doc": "Simple Linear Regression",
"title": "Gretl",
"content": "# Load auto data open https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Data/auto.gdt # Run OLS using the auto data, with mpg as the outcome variable # and headroom, trunk, and weight as predictors ols mpg const headroom trunk weight . ",
"url": "/Model_Estimation/OLS/simple_linear_regression.html#gretl",
"relUrl": "/Model_Estimation/OLS/simple_linear_regression.html#gretl"
- },"675": {
+ },"676": {
"doc": "Simple Linear Regression",
"title": "Julia",
"content": "# Uncomment the next line to install all the necessary packages # import Pkg; Pkg.add([\"CSV\", \"DataFrames\", \"GLM\", \"StatsModels\"]) # We tap into JuliaStats ecosystem to solve our data and regression problems :) # In particular, DataFrames package provides dataset handling functions, # StatsModels gives us the `@formula` macro to specify our model in a concise and readable form, # while GLM implements (Generalized) Linear Models fitting and analysis. # And all these packages work together seamlessly. using StatsModels, GLM, DataFrames, CSV # Here we download the data set, parse the file with CSV and load into a DataFrame mtcars = CSV.read(download(\"https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Data/mtcars.csv\"), DataFrame) # The following line closely follows the R and Python syntax, thanks to GLM and StatModels packages # Here we specify a linear model and fit it to our data set in one go ols = lm(@formula(mpg ~ cyl + hp + wt), mtcars) # This will print out the summary of the fitted model including # coefficients' estimates, standard errors, confidence intervals and p-values print(ols) . ",
"url": "/Model_Estimation/OLS/simple_linear_regression.html#julia",
"relUrl": "/Model_Estimation/OLS/simple_linear_regression.html#julia"
- },"676": {
+ },"677": {
"doc": "Simple Linear Regression",
"title": "Matlab",
"content": "% Load auto data load('https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Data/auto.mat') % Run OLS using the auto data, with mpg as the outcome variable % and headroom, trunk, and weight as predictors intercept = ones(length(headroom),1); X = [intercept headroom trunk weight]; [b,bint,r,rint,stats] = regress(mpg,X); . ",
"url": "/Model_Estimation/OLS/simple_linear_regression.html#matlab",
"relUrl": "/Model_Estimation/OLS/simple_linear_regression.html#matlab"
- },"677": {
+ },"678": {
"doc": "Simple Linear Regression",
"title": "Python",
"content": "# Use 'pip install statsmodels' or 'conda install statsmodels' # on the command line to install the statsmodels package. # Import the relevant parts of the package: import statsmodels.api as sm import statsmodels.formula.api as smf # Get the mtcars example dataset mtcars = sm.datasets.get_rdataset(\"mtcars\").data # Fit OLS regression model to mtcars ols = smf.ols(formula='mpg ~ cyl + hp + wt', data=mtcars).fit() # Look at the OLS results print(ols.summary()) . ",
"url": "/Model_Estimation/OLS/simple_linear_regression.html#python",
"relUrl": "/Model_Estimation/OLS/simple_linear_regression.html#python"
- },"678": {
+ },"679": {
"doc": "Simple Linear Regression",
"title": "R",
"content": "# Load Data # data(mtcars) ## Optional: automatically loaded anyway # Run OLS using the mtcars data, with mpg as the outcome variable # and cyl, hp, and wt as predictors olsmodel <- lm(mpg ~ cyl + hp + wt, data = mtcars) # Look at the results summary(olsmodel) . ",
"url": "/Model_Estimation/OLS/simple_linear_regression.html#r",
"relUrl": "/Model_Estimation/OLS/simple_linear_regression.html#r"
- },"679": {
+ },"680": {
"doc": "Simple Linear Regression",
"title": "SAS",
"content": "/* Load Data */ proc import datafile=\"C:mtcars.dbf\" out=fromr dbms=dbf; run; /* OLS regression */ proc reg; model mpg = cyl hp wt; run; . ",
"url": "/Model_Estimation/OLS/simple_linear_regression.html#sas",
"relUrl": "/Model_Estimation/OLS/simple_linear_regression.html#sas"
- },"680": {
+ },"681": {
"doc": "Simple Linear Regression",
"title": "Stata",
"content": "* Load auto data sysuse https://github.com/LOST-STATS/lost-stats.github.io/blob/master/Data/auto.dta * Run OLS using the auto data, with mpg as the outcome variable * and headroom, trunk, and weight as predictors regress mpg headroom trunk weight . ",
"url": "/Model_Estimation/OLS/simple_linear_regression.html#stata",
"relUrl": "/Model_Estimation/OLS/simple_linear_regression.html#stata"
- },"681": {
+ },"682": {
"doc": "Simple Web Scraping",
"title": "Introduction",
"content": "Webscraping is the processs of programmatically extracting information from the internet that was intended to be displayed in a browser. But it should only be used as a last resort; generally an API (appplication programming interface) is a much better way to obtain information, if one is available. If you do find yourself in a scraping situation, be really sure to check it’s legally allowed and that you are not violating the website’s robots.txt rules. robots.txt is a special file on almost every website that sets out what’s fair play to crawl (conditional on legality) and what your webscraper should not go poking around in. ",
"url": "/Other/simple_web_scrape.html#introduction",
"relUrl": "/Other/simple_web_scrape.html#introduction"
- },"682": {
+ },"683": {
"doc": "Simple Web Scraping",
"title": "Keep in Mind",
"content": " Remember that webscraping is an art as much a science so play around with a problem and figure out creative ways to solve issues, it might not pop out at you immediately. ",
"url": "/Other/simple_web_scrape.html#keep-in-mind",
"relUrl": "/Other/simple_web_scrape.html#keep-in-mind"
- },"683": {
+ },"684": {
"doc": "Simple Web Scraping",
"title": "Implementation",
"content": " ",
"url": "/Other/simple_web_scrape.html#implementation",
"relUrl": "/Other/simple_web_scrape.html#implementation"
- },"684": {
+ },"685": {
"doc": "Simple Web Scraping",
"title": "Python",
"content": "Five of the most well-known and powerful libraries for webscraping in Python, which between them cover a huge range of needs, are requests, lxml, beautifulsoup, selenium, and scrapy. Broadly, requests is for downloading webpages via code, beautifulsoup and lxml are for parsing webpages and extracting info, and scrapy and selenium are full web-crawling solutions. For the special case of scraping table from websites, pandas is the best option. For quick and simple webscraping of individual HTML tags, a good combo is requests, which does little more than go and grab the HTML of a webpage, and beautifulsoup, which then helps you to navigate the structure of the page and pull out what you’re actually interested in. For dynamic webpages that use javascript rather than just HTML, you’ll need selenium. To scale up and hit thousands of webpages in an efficient way, you might try scrapy, which can work with the other tools and handle multiple sessions, and all other kinds of bells and whistles… it’s actually a “web scraping framework”. Let’s see a simple example using requests and beautifulsoup, followed by an example of extracting a table using pandas. First we need to import the packages; remember you may need to install these first by running pip install packagename on your computer’s command line. import requests from bs4 import BeautifulSoup import pandas as pd . Now we’ll specify a URL to scrape, download it as a page, and show some of the HTML as downloaded (here, the first 500 characters) . url = \"https://blackopaldirect.com/product/black-opals/2-86-ct-black-opal-11-6x9-7x3-9mm/\" page = requests.get(url) print(page.text[:500]) . <!DOCTYPE html> <html lang=\"en-US\"> <head> <meta charset=\"UTF-8\"> <meta name=\"viewport\" content=\"width=device-width, initial-scale=1, maximum-scale=1.0, user-scalable=no\"> <link rel=\"profile\" href=\"http://gmpg.org/xfn/11\"> <link rel=\"pingback\" href=\"https://blackopaldirect.com/xmlrpc.php\"> <!-- Facebook Pixel Code --> <script> !function(f,b,e,v,n,t,s) {if(f.fbq)return;n=f.fbq=function(){n.callMethod? n.callMethod.apply(n,arguments):n.queue.push(arguments)}; if(!f._fbq)f._fbq=n;n.push=n;n.loaded= . That’s a bit tricky to read, let alone get any useful data out of. So let’s now use beautifulsoup, which parses extracted HTML. To pretty print the page use .text. In the example below, we’ll just show the first 100 characters and we’ll also use rstrip and lstrip to trim leading and trailing whitespace: . soup = BeautifulSoup(page.text, 'html.parser') print(soup.text[:100].lstrip().rstrip()) . 2.86 ct black opal 11.6x9.7x3.9mm - Black Opal Direct . There are lots of different elements, with tags, that make up a page of HTML. For example, a title might have a tag ‘h1’ and a class ‘product_title’. Let’s see how we can retrieve anything with a class that is ‘price’ and a tag that is ‘p’ as these are the characteristics of prices displayed on the URL we are scraping. price_html = soup.find(\"p\", {\"class\": \"price\"}) print(price_html) . <p class=\"price\"><span class=\"woocommerce-Price-amount amount\"><bdi><span class=\"woocommerce-Price-currencySymbol\">US$</span>2,500.00</bdi></span></p> . This returns the first tag found that satisfies the conditions (to get all tags matching the criteria use soup.find_all). To extract the value, just use .text: . price_html.text . 'US$2,500.00' . Now let’s see an example of reading in a whole table of data. For this, we’ll use pandas, the ubiquitous Python library for working with data. We will read data from the first table on ‘https://simple.wikipedia.org/wiki/FIFA_World_Cup’ using pandas. The function we’ll use is read_html, which returns a list of dataframes of all the tables it finds when you pass it a URL. If you want to filter the list of tables, use the match= keyword argument with text that only appears in the table(s) you’re interested in. The example below shows how this works; looking at the website, we can see that the table we’re interested in (of past world cup results), has a ‘fourth place’ column while other tables on the page do not. Therefore we run: . df_list = pd.read_html('https://simple.wikipedia.org/wiki/FIFA_World_Cup', match='Fourth Place') # Retrieve first and only entry from list of dataframes df = df_list[0] df.head() . | | Year | Host | Winner | Score | Runners-up | Third Place | Score.1 | Fourth Place | . | 0 | 1930 Details | Uruguay | Uruguay | 4 - 2 | Argentina | United States | [note 1] | Yugoslavia | . | 1 | 1934 Details | Italy | Italy | 2 - 1(a.e.t.) | Czechoslovakia | Germany | 3 - 2 | Austria | . | 2 | 1938 Details | France | Italy | 4 - 2 | Hungary | Brazil | 4 - 2 | Sweden | . | 3 | 1950 Details | Brazil | Uruguay | [note 2] | Brazil | Sweden | [note 2] | Spain | . | 4 | 1954 Details | Switzerland | West Germany | 3 - 2 | Hungary | Austria | 3 - 1 | Uruguay | . This delivers the table neatly loaded into a pandas dataframe ready for further use. ",
"url": "/Other/simple_web_scrape.html#python",
"relUrl": "/Other/simple_web_scrape.html#python"
- },"685": {
+ },"686": {
"doc": "Simple Web Scraping",
"title": "R",
"content": "The “rvest” package is a webscraping package in R that provides a tremendous amount of versatility, as well as being easy to use. For this specific task, of web scrapping pages on a website, we will be using read_html(), html_node(), html_table(), html_elements(), and html_text(). I will also make use of the selector gadget tool,(link for the download:Selector Gaget, as well as F12, to find the html paths. html_node and html_text . library(rvest) black_opals = read_html(\"https://blackopaldirect.com/product/black-opals/2-86-ct-black-opal-11-6x9-7x3-9mm/\") # Website of interest price = black_opals %>% html_node(\"#product-103505 > div.summary.entry-summary > p.price > span > bdi\") %>% # Find the exact element's node for the price html_text() # Convert it to text price # print the price ## [1] \"US$2,500.00\" . html_table . world_cup = read_html(\"https://simple.wikipedia.org/wiki/FIFA_World_Cup\") # Past_World_Cup_results cup_table = world_cup %>% html_elements(xpath = \"/html/body/div[3]/div[3]/div[5]/div[1]/table[2]\") %>% html_table() # Extract html elements cup_table = cup_table[[1]] # Assign the table from the lists cup_table %>% head(5) # First 5 obs ## # A tibble: 5 x 8 ## Year Host Winner Score `Runners-up` `Third Place` Score `Fourth Place` ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 1930 D~ Uruguay Uruguay 4 - 2 Argentina United States [not~ Yugoslavia ## 2 1934 D~ Italy Italy 2 - 1~ Czechoslova~ Germany 3 - 2 Austria ## 3 1938 D~ France Italy 4 - 2 Hungary Brazil 4 - 2 Sweden ## 4 1950 D~ Brazil Uruguay [note~ Brazil Sweden [not~ Spain ## 5 1954 D~ Switze~ West G~ 3 - 2 Hungary Austria 3 - 1 Uruguay . Another good tool is html_element. Also here is the rvest website for more information. rvest . ",
"url": "/Other/simple_web_scrape.html#r",
"relUrl": "/Other/simple_web_scrape.html#r"
- },"686": {
+ },"687": {
"doc": "Simple Web Scraping",
"title": "Simple Web Scraping",
"content": " ",
"url": "/Other/simple_web_scrape.html",
"relUrl": "/Other/simple_web_scrape.html"
- },"687": {
+ },"688": {
"doc": "Spatial Joins",
"title": "Spatial Joins",
"content": "Spatial joins are crucial for merging different types of data in geospatial analysis. For example, if you want to know how many libraries (points) are in a city, county, or state (polygon). This skill allows you to take data from different types of spatial data (vector data like points, lines, and polygons, and raster data (with a little more work)) sets and merge them together using unique identifiers. Joins are typically interesections of objects, but can be expressed in different ways. These include: equals, covers, covered by, within, touches, near, crosses, and more. These are all functions within the sf function in R or the geopandas package in Python. For more on the different types of intersections in 2D projections, see the Wikipedia page on spatial relations. ",
"url": "/Geo-Spatial/spatial_joins.html",
"relUrl": "/Geo-Spatial/spatial_joins.html"
- },"688": {
+ },"689": {
"doc": "Spatial Joins",
"title": "Keep in Mind",
"content": ". | Geospatial packages in R and Python tend to have a large number of complex dependencies, which can make installing them painful. Best practice is to install geospatial packages in a new virtual environment. | When it comes to the package we are using in R for the US boundaries, it is much easier to install via the devtools. This will save you the trouble of getting errors when installing the data packages for the boundaries. Otherwise, your mileage may vary. When I installed USAboundariesData via USAboundaries, I received errors. | . devtools::install_github(\"ropensci/USAboundaries\") devtools::install_github(\"ropensci/USAboundariesData\") . | Note: Even with the R installation via devtools, you may be prompted to install the “USAboundariesData” package and need to restart your session. | . ",
"url": "/Geo-Spatial/spatial_joins.html#keep-in-mind",
"relUrl": "/Geo-Spatial/spatial_joins.html#keep-in-mind"
- },"689": {
+ },"690": {
"doc": "Spatial Joins",
"title": "Implementations",
"content": " ",
"url": "/Geo-Spatial/spatial_joins.html#implementations",
"relUrl": "/Geo-Spatial/spatial_joins.html#implementations"
- },"690": {
+ },"691": {
"doc": "Spatial Joins",
"title": "Python",
"content": "The geopandas package is the easiest way to start doing geo-spatial analysis in Python. This example of a spatial merge closely follows one from the documentation for geopandas. # Geospatial packages tend to have many elaborate dependencies. The quickest # way to get going is to use a clean virtual environment and then # 'conda install geopandas' followed by # 'conda install -c conda-forge descartes' # descartes is what allows geopandas to plot data. import geopandas as gpd # Grab a world map world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')) # Plot the map of the world world.plot() # Grab data on cities cities = gpd.read_file(gpd.datasets.get_path('naturalearth_cities')) # We can plot the cities too - but they're just dots of lat/lon without any # context for now cities.plot() # The data don't actually need to be combined to be viewed on a map as long as # they are using the same 'crs', or coordinate reference system. # Force cities and world to share crs: cities = cities.to_crs(world.crs) # Combine them on a plot: base = world.plot(color='white', edgecolor='black') cities.plot(ax=base, marker='o', color='red', markersize=5) # We want to perform a spatial merge, but there are many kinds in 2D # projections, including withins, touches, crosses, and overlaps. We want to # use an intersects spatial join - ie we want to combine each city (a lat/lon # point) with the shapes of countries and determine which city goes in which # country (even if it's on the boundary). We use the 'sjoin' function: cities_with_country = gpd.sjoin(cities, world, how=\"inner\", op='intersects') cities_with_country.head() # name_left geometry pop_est continent \\ # Vatican City POINT (12.45339 41.90328) 62137802 Europe # San Marino POINT (12.44177 43.93610) 62137802 Europe # Rome POINT (12.48131 41.89790) 62137802 Europe # Vaduz POINT (9.51667 47.13372) 8754413 Europe # Vienna POINT (16.36469 48.20196) 8754413 Europe # name_right iso_a3 gdp_md_est # Italy ITA 2221000.0 # Italy ITA 2221000.0 # Italy ITA 2221000.0 # Austria AUT 416600.0 # Austria AUT 416600.0 . ",
"url": "/Geo-Spatial/spatial_joins.html#python",
"relUrl": "/Geo-Spatial/spatial_joins.html#python"
- },"691": {
+ },"692": {
"doc": "Spatial Joins",
"title": "R",
"content": "Acknowledgments to Ryan A. Peek for his guide that I am reimagining for LOST. We will need a few packages to do our analysis. If you need to install any packages, do so with install.packages(‘name_of_package’), then load it if necessary. library(sf) library(dplyr) library(viridis) library(ggplot2) library(USAboundaries) library(GSODR) . | We will work with polygon data from the USA boundaries initially, then move on to climate data point data via the Global Surface Summary of the Day (gsodr) package and join them together. | We start with the boundaries of the United States to get desirable polygons to work with for our analysis. To pay homage to the states of my alma maters, we will do some analysis with Oregon, Ohio, and Michigan. | . #Selecting the United States Boundaries, but omitting Alaska, Hawaii, and Puerto Rico for it to be scaled better usa <- us_boundaries(type=\"state\", resolution = \"low\") %>% filter(!state_abbr %in% c(\"PR\", \"AK\", \"HI\")) #Ohio with high resolution oh <- USAboundaries::us_states(resolution = \"high\", states = \"OH\") #Oregon with high resolution or <- USAboundaries::us_states(resolution = \"high\", states = \"OR\") #Michigan with high resolution mi <- USAboundaries::us_states(resolution = \"high\", states = \"MI\") #Insets for the identified states #Oregon or_box <- st_make_grid(or, n = 1) #Ohio oh_box <- st_make_grid(oh, n = 1) #Michigan mi_box <- st_make_grid(mi, n = 1) #We can also include the counties boundaries within the state too! #Oregon or_co <- USAboundaries::us_counties(resolution = \"high\", states = \"OR\") #Ohio oh_co <- USAboundaries::us_counties(resolution = \"high\", states = \"OH\") #Michigan mi_co <- USAboundaries::us_counties(resolution = \"high\", states = \"MI\") . Now we can plot it out. Oregon highlighted . plot(usa$geometry) plot(or$geometry, add=T, col=\"gray50\", border=\"black\") plot(or_co$geometry, add=T, border=\"green\", col=NA) plot(or_box, add=T, border=\"yellow\", col=NA, lwd=2) . Ohio highlighted . plot(usa$geometry) plot(oh$geometry, add=T, col=\"gray50\", border=\"black\") plot(oh_co$geometry, add=T, border=\"yellow\", col=NA) plot(oh_box, add=T, border=\"blue\", col=NA, lwd=2) . Michigan highlighted . plot(usa$geometry) plot(mi$geometry, add=T, col=\"gray50\", border=\"black\") plot(mi_co$geometry, add=T, border=\"gray\", col=NA) plot(mi_box, add=T, border=\"green\", col=NA, lwd=2) . All three highlighted at once. plot(usa$geometry) plot(mi$geometry, add=T, col=\"gray50\", border=\"black\") plot(mi_co$geometry, add=T, border=\"gray\", col=NA) plot(mi_box, add=T, border=\"green\", col=NA, lwd=2) plot(oh$geometry, add=T, col=\"gray50\", border=\"black\") plot(oh_co$geometry, add=T, border=\"yellow\", col=NA) plot(oh_box, add=T, border=\"blue\", col=NA, lwd=2) plot(or$geometry, add=T, col=\"gray50\", border=\"black\") plot(or_co$geometry, add=T, border=\"green\", col=NA) plot(or_box, add=T, border=\"yellow\", col=NA, lwd=2) . Now that there are polygons established and identified, we can add in some point data to join to our currently existing polygon data and do some analysis with it. To do this we will use the Global Surface Summary of the Day (gsodr) package for climate data. We will take the metadata from the GSODR package via ‘isd_history’, make it spatial data, then filter out only those observations in our candidate states of Oregon, Ohio, and Michigan. load(system.file(\"extdata\", \"isd_history.rda\", package = \"GSODR\")) #We want this to be spatial data isd_history <- as.data.frame(isd_history) %>% st_as_sf(coords=c(\"LON\",\"LAT\"), crs=4326, remove=FALSE) #There are many observations, so we want to narrow it to our three candidate states isd_history_or <- dplyr::filter(isd_history, CTRY==\"US\", STATE==\"OR\") isd_history_oh <- dplyr::filter(isd_history, CTRY==\"US\", STATE==\"OH\") isd_history_mi <- dplyr::filter(isd_history, CTRY==\"US\", STATE==\"MI\") . This filtering should take you from around 26,700 observation sites around the world to approximately 200 in Michigan, 85 in Ohio, and 100 in Oregon. These numbers may vary based on when you independently do your analysis. Let’s see these stations plotted in each state individually: . Note: the codes in the ‘border’ and ‘bg’ identifiers are from the viridis package. You can get some awesome color scales using that package. You can also use standard names. Oregon . plot(isd_history_or$geometry, cex=0.5) plot(or$geometry, col=alpha(\"gray\", 0.5), border=\"#1F968BFF\", lwd=1.5, add=TRUE) plot(isd_history_or$geometry, add=T, pch=21, bg=\"#FDE725FF\", cex=0.7, col=\"black\") title(\"Oregon GSOD Climate Stations\") . Ohio . plot(isd_history_oh$geometry, cex=0.5) plot(oh$geometry, col=alpha(\"red\", 0.5), border=\"gray\", lwd=1.5, add=TRUE) plot(isd_history_oh$geometry, add=T, pch=21, bg=\"black\", cex=0.7, col=\"black\") title(\"Ohio GSOD Climate Stations\") . Michigan . plot(isd_history_mi$geometry, cex=0.5) plot(mi$geometry, col=alpha(\"green\", 0.5), border=\"blue\", lwd=1.5, add=TRUE) plot(isd_history_mi$geometry, add=T, pch=21, bg=\"white\", cex=0.7, col=\"black\") title(\"Michigan GSOD Climate Stations\") . ",
"url": "/Geo-Spatial/spatial_joins.html#r",
"relUrl": "/Geo-Spatial/spatial_joins.html#r"
- },"692": {
+ },"693": {
"doc": "Spatial Joins",
"title": "Now, for the magic:",
"content": "We are going to start with selecting polygons from points. This is not necessarily merging the data together, but using a spatial join to filter out polygons (counties, states, etc.) from points (climate data stations) . We will start by selecting the Oregon counties that have climate data stations within their boundaries: . or_co_isd_poly <- or_co[isd_history, ] plot(or_co_isd_poly$geometry, col=alpha(\"green\",0.7)) title(\"Oregon Counties with GSOD Climate Stations\") . Now for all of our three candidate states: . cand_co <- USAboundaries::us_counties(resolution = \"high\", states = c(\"OR\", \"OH\", \"MI\")) cand_co_isd_poly <- cand_co[isd_history, ] plot(cand_co_isd_poly$geometry, col=alpha(\"blue\",0.7)) title(\"Counties in Candidate States with GSOD Climate Stations\") . We see how we can filter out polygons from attributes or intersecting relationships with points, but what if we want to merge data from the points into the polygon or vice versa? . We will use the data set for Oregon for the join example. Notice in our point dataset that there are no county names. Only station/city names. Let us join the county polygons with the climate station points and add the county names to the station data. We do this using the st_join function, which comes from the sf package. isd_or_co_pts <- st_join(isd_history, left = FALSE, or_co[\"name\"]) #Rename the county name variable county instead of name, since we already have NAME for the station location colnames(isd_or_co_pts)[which(names(isd_or_co_pts) == \"name\")] <- \"county\" plot(isd_or_co_pts$geometry, pch=21, cex=0.7, col=\"black\", bg=\"orange\") plot(or_co$geometry, border=\"gray\", col=NA, add=T) . You now have successfully joined the county name data into your new point data set! Those points in the plot now contain the county information for data analysis purposes. You can join in any attribute you would like, or by leaving it as: . isd_or_co_pts <- st_join(isd_history, left = FALSE, or_co) . You add all attributes from the polygon into the point data frame! . Also note that st_join is the default function that joins any type of intersection. You can be more precise our particular about your conditions with the other spatial joins: . st_within only joins elements that are completely within the defined area . st_equal only joins elements that are spatially equal. Meaning that A is within B and B is within A. You can use these to pare down your selections and joins to specific relationships. Good luck with your geospatial analysis! . ",
"url": "/Geo-Spatial/spatial_joins.html#now-for-the-magic",
"relUrl": "/Geo-Spatial/spatial_joins.html#now-for-the-magic"
- },"693": {
+ },"694": {
"doc": "Spatial Lag Model",
"title": "Spatial Lag Model",
"content": "Data that is to some extent geographical in nature often displays spatial autocorrelation. Outcome variables and explanatory variables both tend to be clustered geographically, which can drive spurious correlations, or upward-biased treatment effect estimates (Ploton et al. 2020). One way to account for this spatial dependence is to model the autocorrelation directly, as would be done with autocorrated time-series data. One such model is the spatial lag model, in which a dependent variable is predicted using the value of the dependent variable of an observation’s “neighbors.” . \\[Y_i = \\rho W Y_j + \\beta X_i + \\varepsilon_i\\] Where $Y_j$ is the set of $Y$ values from observations other than $i$, and $W$ is a matrix of spatial weights, which are higher for $j$s that are spatially closer to $i$. This process requires estimation of which observations constitute neighbors, and generally the estimation of $\\rho$ is performed using a separate process from how $\\beta$ is estimated. More estimation details are in Darmofal (2015). ",
"url": "/Geo-Spatial/spatial_lag_model.html",
"relUrl": "/Geo-Spatial/spatial_lag_model.html"
- },"694": {
+ },"695": {
"doc": "Spatial Lag Model",
"title": "Keep in Mind",
"content": ". | There is more than one way to create the weighting matrix, and also more than one way to estimate the spatial lag model. Be sure to read the documentation to see what model and method your command is estimating, and that it’s the one you want. | Some approaches select a list of “neighbor” observations, such that each observation $j$ either is or is not a neighbor of $i$ (note that non-neighbors can still affect $i$ if they are neighbors-of-neighbors, and so on) | The effect of a given predictor in a spatial lag model is not just given by its coefficient, but should also include its spillover effects via $\\rho$. | . ",
"url": "/Geo-Spatial/spatial_lag_model.html#keep-in-mind",
"relUrl": "/Geo-Spatial/spatial_lag_model.html#keep-in-mind"
- },"695": {
+ },"696": {
"doc": "Spatial Lag Model",
"title": "Also Consider",
"content": ". | There are other ways of modeling spatial dependence, such as the Spatial Moving-Average Model | A common test to determine whether there is spatial dependence that needs to be modeled is the Moran Test | . ",
"url": "/Geo-Spatial/spatial_lag_model.html#also-consider",
"relUrl": "/Geo-Spatial/spatial_lag_model.html#also-consider"
- },"696": {
+ },"697": {
"doc": "Spatial Lag Model",
"title": "Implementations",
"content": "These examples will use some data on US colleges from IPEDS, including their latitude, longitude, and the extent of distance learning they offered in 2018. It will then see if this distance learning predicts (and perhaps reduces?) the prevalence of COVID in the college’s county by July 2020. ",
"url": "/Geo-Spatial/spatial_lag_model.html#implementations",
"relUrl": "/Geo-Spatial/spatial_lag_model.html#implementations"
- },"697": {
+ },"698": {
"doc": "Spatial Lag Model",
"title": "Python",
"content": "import pandas as pd from libpysal.cg import KDTree, RADIUS_EARTH_MILES from libpysal.weights import KNN from spreg import ML_Lag url = ('https://github.com/LOST-STATS/lost-stats.github.io/raw/source' '/Geo-Spatial/Data/Merging_Shape_Files/colleges_covid.csv') # specify index cols we need only for identification -- not modeling df = pd.read_csv(url, index_col=['unitid', 'instnm']) # we'll `pop` renaming columns so they're no longer in our dataframe x = df.copy().dropna(how='any') # tree object is the main input to nearest neighbors tree = KDTree( data=zip(x.pop('longitude'), x.pop('latitude')), # default is euclidean, but we want to use arc or haversine distance distance_metric='arc', radius=RADIUS_EARTH_MILES ) nn = KNN(tree, k=5) y = x.pop('covid_cases_per_cap_jul312020') # spreg only accepts numpy arrays or lists as arguments mod = ML_Lag( y=y.to_numpy(), x=x.to_numpy(), w=nn, name_y=y.name, name_x=x.columns.tolist() ) # results print(mod.summary) . ",
"url": "/Geo-Spatial/spatial_lag_model.html#python",
"relUrl": "/Geo-Spatial/spatial_lag_model.html#python"
- },"698": {
+ },"699": {
"doc": "Spatial Lag Model",
"title": "R",
"content": "# if necessary # install.packages(c('spatialreg', 'spdep')) # Library for calculating neighbors library(spdep) # And for the spatial lag model library(spatialreg) # Load data df <- read.csv('https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Geo-Spatial/Data/Merging_Shape_Files/colleges_covid.csv') # Use latitude and longitude to determine the list of neighbors # Here we're using K-nearest-neighbors to find 5 neighbors for each college # But there are othe rmethods available # Get latitude and longitude into a matrix # Make sure longitude comes first loc_matrix <- as.matrix(df[, c('longitude','latitude')]) # Get 5 nearest neighbors kn <- knearneigh(loc_matrix, 5) # Turn the k-nearest-neighbors object into a neighbors object nb <- knn2nb(kn) # Turn the nb object into a listw object # Which is a list of spatial weights for the neighbors listw <- nb2listw(nb) # Use a spatial regression # This uses the method from Bivand & Piras (2015) https://www.jstatsoft.org/v63/i18/. m <- lagsarlm(covid_cases_per_cap_jul312020 ~ pctdesom + pctdenon, data = df, listw = listw) # Note that, whlie summary(m) will show rho below the regression results, # most regression-table functions like modelsummary::msummary() or jtools::export_summs() # will include it as a coefficient along with the others and report its standard error summary(m) . ",
"url": "/Geo-Spatial/spatial_lag_model.html#r",
"relUrl": "/Geo-Spatial/spatial_lag_model.html#r"
- },"699": {
+ },"700": {
"doc": "Spatial Lag Model",
"title": "Stata",
"content": "Stata has a suite of built-in spatial analysis commands, which we will be using here. A more thorough description of using Stata for spatial autocorrelation models (and perhaps using shapefiles to start with) can be found here. * Import data import delimited using \"https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Geo-Spatial/Data/Merging_Shape_Files/colleges_covid.csv\", clear * This process requires full data drop if missing(pctdesom) | missing(pctdenon) * Get Stata to recognize this is a spatial dataset * with longitude and latitude spset unitid, coord(longitude latitude) * Create matrix of inverse distance weights spmatrix create idistance M * Note that Stata doesn't have an automatic process for selecting a set of neighbors * Unless you are working with a shapefile * Run spatial regression model * This uses a maximum likelihood estimator, but GS2SLS is also available spregress covid_cases_per_cap_jul312020 pctdesom pctdenon, ml dvarlag(M) * Get impact of each predictor, including spillovers, with estat impact estat impact . ",
"url": "/Geo-Spatial/spatial_lag_model.html#stata",
"relUrl": "/Geo-Spatial/spatial_lag_model.html#stata"
- },"700": {
+ },"701": {
"doc": "Stepwise Regression",
"title": "Stepwise Regression",
"content": "When we use multiple explanatory variables to perform regression analysis on a dependent variable, there is a possibility that the problem of multicollinearity will occur. However, multiple linear regression requires that the correlation between the independent variables is not too high, so there is value in a method to eliminate multicollinearity and select the “optimal” regression equation. Stepwise regression is one approach to this. It can automatically help us retain the most important explanatory variables and remove relatively unimportant variables from the model. The idea of stepwise regression is to introduce independent variables one by one, and after each independent variable is introduced, the selected variables are tested one by one. If the originally introduced variable is no longer significant due to the introduction of subsequent variables, then delete it. Repeat this process until the regression equation does not introduce insignificant independent variables and does not remove significant independent variables, then the optimal regression equation can be obtained. ",
"url": "/Model_Estimation/OLS/stepwise_regression.html",
"relUrl": "/Model_Estimation/OLS/stepwise_regression.html"
- },"701": {
+ },"702": {
"doc": "Stepwise Regression",
"title": "Keep in Mind",
"content": ". | The purpose of stepwise regression is to find which combination of variables can explain more changes in dependent variables. | Stepwise regression uses statistical measures such as R-square, t-stats, and AIC indicators to identify important variables. | There are three methods of stepwise regression: Forward Selection, Backward Elimination and Stepwise Selection. | Forward selection starts from the most important independent variable in the model, and then increases the variable in each step. | Backward elimination starts with all the independent variables of the model, and then removes the least significant variable at each step. | The standard stepwise selection combines the above two methods, adding or removing independent variables in each step. | Standard stepwise regression approaches use statistical significance to make decisions about model design, which is not the typical purpose of statistical significance | . ",
"url": "/Model_Estimation/OLS/stepwise_regression.html#keep-in-mind",
"relUrl": "/Model_Estimation/OLS/stepwise_regression.html#keep-in-mind"
- },"702": {
+ },"703": {
"doc": "Stepwise Regression",
"title": "Also Consider",
"content": ". | Penalized regression, specifically the LASSO approach to model selection. | . ",
"url": "/Model_Estimation/OLS/stepwise_regression.html#also-consider",
"relUrl": "/Model_Estimation/OLS/stepwise_regression.html#also-consider"
- },"703": {
+ },"704": {
"doc": "Stepwise Regression",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/OLS/stepwise_regression.html#implementations",
"relUrl": "/Model_Estimation/OLS/stepwise_regression.html#implementations"
- },"704": {
+ },"705": {
"doc": "Stepwise Regression",
"title": "R",
"content": "We will use the built-in mtcars dataset. The step() function in package stats can perform the stepwise regression. Set up . # Load package library(stats) library(broom) # Load data and take a look at this dataset data(mtcars) head(mtcars) # mpg cyl disp hp drat wt qsec vs am gear carb # Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 # Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 # Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 # Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 # Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 # Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 # Define a regression model mpg ~ all other independent variables. reg_mpg <- lm(mpg ~ ., data=mtcars) # Define intercept model intercept <- lm(mpg ~ 1, data=mtcars) . Stepwise Selection . # Stepwise selection # The direction argument can be changed to perform forwards or backwards selection stepwise <- step(intercept, direction = c(\"both\"), scope=formula(reg_mpg)) # Start: AIC=115.94 # mpg ~ 1 # Df Sum of Sq RSS AIC # + wt 1 847.73 278.32 73.217 # + cyl 1 817.71 308.33 76.494 # + disp 1 808.89 317.16 77.397 # + hp 1 678.37 447.67 88.427 # + drat 1 522.48 603.57 97.988 # + vs 1 496.53 629.52 99.335 # + am 1 405.15 720.90 103.672 # + carb 1 341.78 784.27 106.369 # + gear 1 259.75 866.30 109.552 # + qsec 1 197.39 928.66 111.776 # <none> 1126.05 115.943 # Omit the filter in the middle... # Step: AIC=62.66 # mpg ~ wt + cyl + hp # Df Sum of Sq RSS AIC # <none> 176.62 62.665 # - hp 1 14.551 191.17 63.198 # + am 1 6.623 170.00 63.442 # + disp 1 6.176 170.44 63.526 # - cyl 1 18.427 195.05 63.840 # + carb 1 2.519 174.10 64.205 # + drat 1 2.245 174.38 64.255 # + qsec 1 1.401 175.22 64.410 # + gear 1 0.856 175.76 64.509 # + vs 1 0.060 176.56 64.654 # - wt 1 115.354 291.98 76.750 . # Result tidy(stepwise) # A tibble: 4 x 5 # term estimate std.error statistic p.value # <chr> <dbl> <dbl> <dbl> <dbl> # 1 (Intercept) 38.8 1.79 21.7 4.80e-19 # 2 wt -3.17 0.741 -4.28 1.99e- 4 # 3 cyl -0.942 0.551 -1.71 9.85e- 2 # 4 hp -0.0180 0.0119 -1.52 1.40e- 1 . The optimal equation we get from stepwise selection is . \\[mpg = 38.752 - 3.167*wt - 0.942*cyl - 0.018*hyp\\] ",
"url": "/Model_Estimation/OLS/stepwise_regression.html#r",
"relUrl": "/Model_Estimation/OLS/stepwise_regression.html#r"
- },"705": {
+ },"706": {
"doc": "Styling Line Graphs",
"title": "Styling Line Graphs",
"content": "There are several ways of styling line graphs. The following examples demonstrate how to modify the appearances of the lines (type and sizes), as well chart titles and axes labels. ",
"url": "/Presentation/Figures/styling_line_graphs.html",
"relUrl": "/Presentation/Figures/styling_line_graphs.html"
- },"706": {
+ },"707": {
"doc": "Styling Line Graphs",
"title": "Keep in Mind",
"content": ". | To get started on how to plot line graphs, see here. | Elements for customization include line thickness, line type (solid, dashed, etc.), shade, transparency, and color. | Color is one of the easiest ways to distinguish a large number of line graphs. If you have many line graphs overlaid and have to use black-and-white, consider different shades of black/gray. | . ",
"url": "/Presentation/Figures/styling_line_graphs.html#keep-in-mind",
"relUrl": "/Presentation/Figures/styling_line_graphs.html#keep-in-mind"
- },"707": {
+ },"708": {
"doc": "Styling Line Graphs",
"title": "Implementation",
"content": " ",
"url": "/Presentation/Figures/styling_line_graphs.html#implementation",
"relUrl": "/Presentation/Figures/styling_line_graphs.html#implementation"
- },"708": {
+ },"709": {
"doc": "Styling Line Graphs",
"title": "R",
"content": "## If necessary ## install.packages(c('ggplot2','cowplot')) ## load packages library(ggplot2) ## Cowplot is just to join together the four graphs at the end library(cowplot) ## load data (the Economics dataset comes with ggplot2) eco_df <- economics ## basic plot p1 <- ggplot() + geom_line(aes(x=date, y = uempmed), data = eco_df) p1 ## Change line color and chart labels ## Note here that color is *outside* of the aes() argument, and so this will color the line ## If color were instead *inside* aes() and set to a factor variable, ggplot would create ## a different line for each value of the factor variable, colored differently. p2 <- ggplot() + ## choose a color of preference geom_line(aes(x=date, y = uempmed), color = \"navyblue\", data = eco_df) + ## add chart title and change axes labels labs( title = \"Median Duration of Unemployment\", x = \"Date\", y = \"\") + ## Add a ggplot theme theme_light() ## center the chart title theme(plot.title = element_text(hjust = 0.5)) + p2 ## plotting multiple charts (of different line types and sizes) p3 <-ggplot() + geom_line(aes(x=date, y = uempmed), color = \"navyblue\", size = 1.5, data = eco_df) + geom_line(aes(x=date, y = psavert), color = \"red2\", linetype = \"dotted\", size = 0.8, data = eco_df) + labs( title = \"Unemployment Duration (Blue) and Savings Rate (Red)\", x = \"Date\", y = \"\") + theme_light() + theme(plot.title = element_text(hjust = 0.5)) p3 ## Plotting a different line type for each group ## There isn't a natural factor in this data so let's just duplicate the data and make one up eco_df$fac <- factor(1, levels = c(1,2)) eco_df2 <- eco_df eco_df2$fac <- 2 eco_df2$uempmed <- eco_df2$uempmed - 2 + rnorm(nrow(eco_df2)) eco_df <- rbind(eco_df, eco_df2) p4 <- ggplot() + ## This time, color goes inside aes geom_line(aes(x=date, y = uempmed, color = fac), data = eco_df) + ## add chart title and change axes labels labs( title = \"Median Duration of Unemployment\", x = \"Date\", y = \"\") + ## Add a ggplot theme theme_light() + ## center the chart title theme(plot.title = element_text(hjust = 0.5), ## Move the legend onto some blank space on the diagram legend.position = c(.25,.8), ## And put a box around it legend.background = element_rect(color=\"black\")) + ## Retitle the legend that pops up to explain the discrete (factor) difference in colors ## (note if we just want a name change we could do guides(color = guide_legend(title = 'Random Factor')) instead) scale_color_manual(name = \"Random Factor\", # And specify the colors for the factor levels (1 and 2) by hand if we like values = c(\"1\" = \"red\", \"2\" = \"blue\")) p4 # Put them all together with cowplot for LOST upload plot_grid(p1,p2,p3,p4, nrow=2) . The four plots generated by the code are (in order p1, p2, then p3 and p4): . ",
"url": "/Presentation/Figures/styling_line_graphs.html#r",
"relUrl": "/Presentation/Figures/styling_line_graphs.html#r"
- },"709": {
+ },"710": {
"doc": "Styling Line Graphs",
"title": "Stata",
"content": "In Stata, one can create plot lines using the command line, which in combination with twoway allows you to modify components of sub-plots individually. In this demonstration, I will use minimal formatting, but will apply minimal modifications using Ben Jann’s grstyle. ** Setup: Install grstyle ssc install grstyle grstyle init grstyle color background white grstyle set legend, nobox . Setup . First, you need to load the data into Stata. The data is a copy from the data economics available within ggplot package, and translated using foreign. use https://friosavila.github.io/playingwithstata/rnd_dta/economics, clear ** Since this was taken directly from R, the date variable will not be formatted. ** We can format the date using the following. format date %tdCCYY ** This indicates to create a _mask_, to put on top of \"data\" ** but only display the \"year\" . Simple line plot . Now, For a simple plot, we could use the following syntax: . line yvar1 [yvar2 yvar3 ...] xvar1 . This requests plotting all variables yvarX against xvar1 (horizontal axis). Internally, the command connects every pair of data [yvar1,xvar1] sequentially, based on the order they appear in the dataset. Below, we can do that, plotting unemployment duration uempmed vs date. line uempmed date . Something to keep in mind. If the dataset is not sorted by date, you may end up with a lineplot that is all over the place. For example: . sort uempmed line uempmed date . To avoid this, it is recommended to use the option sort. line uempmed date, sort . Adding titles, and axis titles . The next thing you may want to do is add information to the plot, so its easier to understand what the figure is showing. Specifically, we can add information on the vertical axis using ytitle(). I will also use xtitle() to drop the horizontal axis information, and add a title title(). line uempmed date, sort /// ytitle(\"# of weeks\") xtitle(\"\") /// title(Unemployment Duration) . Changing Line characteristics. It is also possible to modify the line width lwidth(), line color lcolor(), and line pattern lpattern(). To show how this can affect the plot, below 4 examples are provided. Notice that each plot is saved in memory using name(), and all are combined using graph combine. line uempmed date, sort /// ytitle(\"# of weeks\") xtitle(\"\") /// title(Unemployment Duration 1) /// lwidth(.5) lcolor(red) lpattern(solid) name(m1,replace) line uempmed date, sort /// ytitle(\"# of weeks\") xtitle(\"\") /// title(Unemployment Duration 2) /// lwidth(.25) lcolor(gold) lpattern(dash) name(m2,replace) line uempmed date, sort /// ytitle(\"# of weeks\") xtitle(\"\") /// title(Unemployment Duration 3) /// lwidth(1) lcolor(\"68 170 153\") lpattern(dot) name(m3,replace) line uempmed date, sort /// ytitle(\"# of weeks\") xtitle(\"\") /// title(Unemployment Duration 4) /// lwidth(.5) lcolor(navy%50) lpattern(dash_dot) name(m4,replace) graph combine m1 m2 m3 m4 . Ploting Multiple Lines, and different axis . You may also want to plot multiple variables in the same figure. There are two ways to do this: . twoway (line uempmed date, sort lwidth(.75) lpattern(solid) ) /// (line psavert date, sort lwidth(.25) lpattern(dash) ), /// legend (order(1 \"Unemployment duration\" 2 \"Saving rate\")) line uempmed psavert date, sort lwidth(0.75 .25) lpattern(solid dash) /// legend(order(1 \"Unemployment duration\" 2 \"Saving rate\")) . Both options provide the same figure, however, I prefer the first option since that allows for more flexibility. You can also choose to plot each variable in a different axis. Each axis can have its own title. twoway (line uempmed date, sort lwidth(.75) lpattern(solid) yaxis(1)) /// (line psavert date, sort lwidth(.25) lpattern(dash) yaxis(2)), /// legend(order(1 \"Unemployment duration\" 2 \"Saving rate\")) /// ytitle(Weeks ,axis(1) ) ytitle(Interest rate,axis(2) ) . Adding informative Vertical lines. Finally, it is possible to add vertical lines. This may be useful, for example, to differentiate the great recession period. Additionally, in this plot, I add a note. twoway (line uempmed date, sort lwidth(.75) lpattern(solid) yaxis(1)) /// (line psavert date, sort lwidth(.25) lpattern(dash) yaxis(2)), /// legend(order(1 \"Unemployment duration\" 2 \"Saving rate\")) /// ytitle(Weeks ,axis(1) ) ytitle(Interest rate,axis(2) ) /// xline(`=td(1dec2007)'/`=td(30jun2008)', lcolor(gs8)) /// note(\"Note:Grey Area marks the Great recession Period\") /// title(\"Unemployement Duration and\" \"Saving Rate\") . ",
"url": "/Presentation/Figures/styling_line_graphs.html#stata",
"relUrl": "/Presentation/Figures/styling_line_graphs.html#stata"
- },"710": {
+ },"711": {
"doc": "Graphing a By-Group or Over-Time Summary Statistic",
"title": "Graphing a By-Group or Over-Time Summary Statistic",
"content": "A common task in exploring or presenting data is looking at by-group summary statistics. This commonly takes the form of a graph where the group is along the x-axis and the summary statistic is on the y-axis. Often this group might be a time period so as to look at changes over time. Producing such a graph requires three things: . | A decision of what kind of graph will be produced (line graph, bar graph, scatterplot, etc.) | The creation of the grouped summary statistic | The creation of the graph itself | . ",
"url": "/Presentation/Figures/summary_graphs.html",
"relUrl": "/Presentation/Figures/summary_graphs.html"
- },"711": {
+ },"712": {
"doc": "Graphing a By-Group or Over-Time Summary Statistic",
"title": "Keep in Mind",
"content": ". | Line graphs are only intended for use in cases where the x-axis variable (1) is ordinal (one value is “more” than another), and (2) takes consistently-sized jumps from one observation to the next. A time-based x-axis is a good candidate for use with line graphs. If your group is categorical and doesn’t follow a natural ordering, then do not use a line graph. Consider a bar graph or some other kind of graph instead. | If you are making a graph for presentation rather than exploration, and your x-axis variable is categorical and doesn’t have a natural ordering, your graph will often be easier to read if the x-axis is sorted by the height of the y-axis. The way to do this will be demonstrated in the code examples below. | . ",
"url": "/Presentation/Figures/summary_graphs.html#keep-in-mind",
"relUrl": "/Presentation/Figures/summary_graphs.html#keep-in-mind"
- },"712": {
+ },"713": {
"doc": "Graphing a By-Group or Over-Time Summary Statistic",
"title": "Also Consider",
"content": ". | This page will cover how to calculate the summary statistic in the graph code itself. However, an alternate approach that provides a bit more control and flexibility is to calculate the by-group summary statistic by collapsing the data set so there is only one observation per group in the data. Then, just make a regular graph of whatever kind you like, with the group along the x-axis, and the summary statistic on the y-axis. See Line Graphs or Bar Graphs. | If you want a version of these graphs that has two groupings - one group along the x-axis and with different bars or lines for another group, see how to graph multiple lines on Line Graphs or multiple bars per x-axis point on Bar Graphs. | . ",
"url": "/Presentation/Figures/summary_graphs.html#also-consider",
"relUrl": "/Presentation/Figures/summary_graphs.html#also-consider"
- },"713": {
+ },"714": {
"doc": "Graphing a By-Group or Over-Time Summary Statistic",
"title": "Implementations",
"content": " ",
"url": "/Presentation/Figures/summary_graphs.html#implementations",
"relUrl": "/Presentation/Figures/summary_graphs.html#implementations"
- },"714": {
+ },"715": {
"doc": "Graphing a By-Group or Over-Time Summary Statistic",
"title": "R",
"content": "# We want ggplot2 for graphing and dplyr for the storms data library(tidyverse) data(storms) # First, a line graph with time on the x-axis # This uses stat_summary # Note that stat_summary_bin is also available, # which first bins the x-axis, if desired # Put the time variable in the x aesthetic, and the # variable to be summarized in y ggplot(storms, aes(x = year, y = wind)) + stat_summary(geom = 'line', # Do we want a line graph? Point? fun = mean) + # What function should be used to summarize? # Note another good option for geom is 'pointrange', the default # which you can get from just stat_summary(), # which also shows the range of data # Just decoration: labs(x = 'Year', y = 'Average Wind Speed', title = 'Average Wind Speed of Storms by Year') + theme_minimal() # Second, a bar graph with a category on the x-axis # Use reorder() to sort by which status has the most wind ggplot(storms, aes(x = reorder(status,-wind), y = wind)) + stat_summary(geom = 'bar', # Do we want a line graph? Point? fun = mean) + # Decoration: scale_x_discrete(labels = c('Hurricane','Tropical Storm','Tropical Depression')) + # make the labels more presentable # Decoration: labs(x = NULL, y = 'Average Wind Speed', title = 'Average Wind Speed by Storm Type') + theme_minimal() . This code produces: . ",
"url": "/Presentation/Figures/summary_graphs.html#r",
"relUrl": "/Presentation/Figures/summary_graphs.html#r"
- },"715": {
+ },"716": {
"doc": "Graphing a By-Group or Over-Time Summary Statistic",
"title": "Stata",
"content": "In Stata there is not a single graph command that will graph a summary statistic line graph for us (although there is for bar graphs). Instead, for line graphs, we must collapse the data set and graph the result. You could avoid collapsing by instead using bysort group: egen newvar = mean(oldvar) (or some egen function from help egen other than mean) to create by-group statistics in the original data, use egen tag = tag(group) to select only one observation per group, and then do the below graphing commands while adding if tag == 1 to them. ** Read in the data import delimited \"https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv\", clear * Keep the original data to return to after collapsing preserve * First, a line graph with time on the x-axis and average wind on y collapse (mean) wind, by(year) * Then, a line graph tw line wind year, xtitle(\"Year\") ytitle(\"Average Wind Speed\") restore * Now, a bar graph with a category on the x-axis graph bar (mean) wind, over(status, relabel(1 \"Hurricane\" 2 \"Tropical Depression\" 3 \"Tropical Storm\") /// Relabel the statuses to capitalize sort((mean) wind)) /// Put in height order automatically ytitle(\"Average Wind Speed\") . This code produces: . ",
"url": "/Presentation/Figures/summary_graphs.html#stata",
"relUrl": "/Presentation/Figures/summary_graphs.html#stata"
- },"716": {
+ },"717": {
"doc": "Support Vector Machine",
"title": "Support Vector Machine",
"content": "A support vector machine (hereinafter, SVM) is a supervised machine learning algorithm in that it is trained by a set of data and then classifies any new input data depending on what it learned during the training phase. SVM can be used both for classification and regression problems but here we focus on its use for classification. The idea is to separate two distinct groups by maximizing the distance between those points that are most hard to classify. To put it more formally, it maximizes the distance or margin between support vectors around the separating hyperplane. Support vectors here imply the data points that lie closest to the hyperplane. Hyperplanes are decision boundaries that are represented by a line (in two dimensional space) or a plane (in three dimensional space) that separate the two groups. Suppose a hypothetical problem of classifying apples from lemons. Support vectors in this case are apples that look closest to lemons and lemons that look closest to apples. They are the most difficult ones to classify. SVM draws a separating line or hyperplane that maximizes the distance or margin between support vectors, in this case the apples that look closest to the lemons and lemons that look closest to apples. Therefore support vectors are critical in determining the position as well as the slope of the hyperplane. For additional information about the support vector regression or support vector machine, refer to Wikipedia: Support-vector machine. ",
"url": "/Machine_Learning/support_vector_machine.html",
"relUrl": "/Machine_Learning/support_vector_machine.html"
- },"717": {
+ },"718": {
"doc": "Support Vector Machine",
"title": "Keep in Mind",
"content": ". | Note that optimization problem to solve for a linear separator is maximizing the margin which could be calculated as \\(\\frac{2}{\\lVert w \\rVert}\\). This could then be rewritten as minimizing \\(\\lVert w \\rVert\\), or minimizing a monotonic transformation version of it expressed as \\(\\frac{1}{2}\\lVert w \\rVert^2\\). Additional constraint of \\(y_i(w^T x_i + b) \\geq 1\\) needs to be imposed to ensure that the data points are still correctly classified. As such, the constrained optimization problem for SVM looks as the following: | . \\[\\text{min} \\frac{\\lVert w \\rVert ^2}{2}\\] s.t. \\(y_i(w^T x_i + b) \\geq 1\\), . where \\(w\\) is a weight vector, \\(x_i\\) is each data point, \\(b\\) is bias, and \\(y_i\\) is each data point’s corresponding label that takes the value of either \\(\\{-1, 1\\}\\). For detailed information about derivation of the optimization problem, refer to MIT presentation slides, The Math Behind Support Vector Machines, and Demystifying Maths of SVM - Part1. | If data points are not linearly separable, non-linear SVM introduces higher dimensional space that projects data points from original finite-dimensional space to gain linearly separation. Such process of mapping data points into a higher dimensional space is known as the Kernel Trick. There are numerous types of Kernels that can be used to create higher dimensional space including linear, polynomial, Sigmoid, and Radial Basis Function. | Setting the right form of Kernel is important as it determines the structure of the separator or hyperplane. | . ",
"url": "/Machine_Learning/support_vector_machine.html#keep-in-mind",
"relUrl": "/Machine_Learning/support_vector_machine.html#keep-in-mind"
- },"718": {
+ },"719": {
"doc": "Support Vector Machine",
"title": "Also Consider",
"content": ". | See the alternative classification method described on the K-Nearest Neighbor Matching. | . ",
"url": "/Machine_Learning/support_vector_machine.html#also-consider",
"relUrl": "/Machine_Learning/support_vector_machine.html#also-consider"
- },"719": {
+ },"720": {
"doc": "Support Vector Machine",
"title": "Implementations",
"content": " ",
"url": "/Machine_Learning/support_vector_machine.html#implementations",
"relUrl": "/Machine_Learning/support_vector_machine.html#implementations"
- },"720": {
+ },"721": {
"doc": "Support Vector Machine",
"title": "Python",
"content": "In this example, we will use scikit-learn, which is a very popular Python library for machine learning. We will look at two support vector machine models: LinearSVC, which performs linear support vector classification (example 1); and SVC, which can accept several different kernels (including non-linear ones). For the latter case, we’ll use the non-linear radial basis function kernel (example 2 below). The last part of the code example plots the decision boundary, ie the support vectors, for the second example. from sklearn.datasets import make_classification, make_gaussian_quantiles from sklearn.svm import LinearSVC, SVC from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt import numpy as np ########################### # Example 1: Linear SVM ### ########################### # Generate linearly separable data: X, y = make_classification(n_features=2, n_redundant=0, n_informative=1, n_clusters_per_class=1) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2) # Train linear SVM model svm = LinearSVC(tol=1e-5) svm.fit(X_train, y_train) # Test model test_score = svm.score(X_test, y_test) print(f'The test score is {test_score}') ############################### # Example 2: Non-linear SVM ### ############################### # Generate non-linearly separable data X, y = make_gaussian_quantiles(n_features=2, n_classes=2) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2) # Train non-linear SVM model nl_svm = SVC(kernel='rbf', C=50) nl_svm.fit(X_train, y_train) # Test model test_score = nl_svm.score(X_test, y_test) print(f'The non-linear test score is {test_score}') #################################### # Plot non-linear SVM boundaries ### #################################### plt.figure() decision_function = nl_svm.decision_function(X) support_vector_indices = np.where( np.abs(decision_function) <= 1 + 1e-15)[0] support_vectors = X[support_vector_indices] plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired) ax = plt.gca() xlim = ax.get_xlim() ylim = ax.get_ylim() xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 50), np.linspace(ylim[0], ylim[1], 50)) Z = nl_svm.decision_function(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--']) plt.scatter(support_vectors[:, 0], support_vectors[:, 1], s=100, linewidth=1, facecolors='none', edgecolors='k') plt.tight_layout() plt.show() . ",
"url": "/Machine_Learning/support_vector_machine.html#python",
"relUrl": "/Machine_Learning/support_vector_machine.html#python"
- },"721": {
+ },"722": {
"doc": "Support Vector Machine",
"title": "R",
"content": "There are a couple of ways to implement SVM in R. Here we’ll demonstrate using the e1071 package. To learn more about the package, check out its CRAN page, as well as this vignette. Note that we’ll also load the tidyverse to help with some data wrangling and plotting. Two examples are shown below that use linear SVM and non-linear SVM respectively. The first example shows how to implement linear SVM. We start by constructing data, separating them into training and test set. Using the training set, we fit the data using the svm() function. Notice that kernel argument for svm() function is specified as linear for our first example. Next, we predict the test data based on the model estimates using the predict() function. The first example result suggests that only one out of 59 data points is incorrectly classified. The second example shows how to implement non-linear SVM. The data in example two is generated in a way to have data points of one class centered around the middle whereas data points of the other class spread on two sides. Notice that kernel argument for the svm() function is specified as radial for our second example, based on the shape of the data. The second example result suggests that only two out of 58 data points are incorrectly classified. # Install and load the packages if (!require(\"tidyverse\")) install.packages(\"tidyverse\") if (!require(\"e1071\")) install.packages(\"e1071\") library(tidyverse) # package for data manipulation library(e1071) # package for SVM ########################### # Example 1: Linear SVM ### ########################### # Construct a completely separable data set ## Set seed for replication set.seed(0715) ## Make variable x x = matrix(rnorm(200, mean = 0, sd = 1), nrow = 100, ncol = 2) ## Make variable y that labels x by either -1 or 1 y = rep(c(-1, 1), c(50, 50)) ## Make x to have unilaterally higher value when y equals 1 x[y == 1,] = x[y == 1,] + 3.5 ## Construct data set d1 = data.frame(x1 = x[,1], x2 = x[,2], y = as.factor(y)) ## Split it into training and test data flag = sample(c(0,1), size = nrow(d1), prob=c(0.5,0.5), replace = TRUE) d1 = setNames(split(d1, flag), c(\"train\", \"test\")) # Plot ggplot(data = d1$train, aes(x = x1, y = x2, color = y, shape = y)) + geom_point(size = 2) + scale_color_manual(values = c(\"darkred\", \"steelblue\")) # SVM classification svmfit1 = svm(y ~ ., data = d1$train, kernel = \"linear\", cost = 10, scale = FALSE) print(svmfit1) plot(svmfit1, d1$train) # Predictability pred.d1 = predict(svmfit1, newdata = d1$test) table(pred.d1, d1$test$y) ############################### # Example 2: Non Linear SVM ### ############################### # Construct less separable data set ## Make variable x x = matrix(rnorm(200, mean = 0, sd = 1), nrow = 100, ncol = 2) ## Make variable y that labels x by either -1 or 1 y <- rep(c(-1, 1) , c(50, 50)) ## Make x to have extreme values when y equals 1 x[y == 1, ][1:25,] = x[y==1,][1:25,] + 3.5 x[y == 1, ][26:50,] = x[y==1,][26:50,] - 3.5 ## Construct data set d2 = data.frame(x1 = x[,1], x2 = x[,2], y = as.factor(y)) ## Split it into training and test data d2 = setNames(split(d2, flag), c(\"train\", \"test\")) # Plot data ggplot(data = d2$train, aes(x = x1, y = x2, color = y, shape = y)) + geom_point(size = 2) + scale_color_manual(values = c(\"darkred\", \"steelblue\")) # SVM classification svmfit2 = svm(y ~ ., data = d2$train, kernel = \"radial\", cost = 10, scale = FALSE) print(svmfit2) plot(svmfit2, d2$train) # Predictability pred.d2 = predict(svmfit2, newdata = d2$test) table(pred.d2, d2$test$y) . ",
"url": "/Machine_Learning/support_vector_machine.html#r",
"relUrl": "/Machine_Learning/support_vector_machine.html#r"
- },"722": {
+ },"723": {
"doc": "Support Vector Machine",
"title": "Stata",
"content": "The below code shows how to implement support vector machines in Stata using the svmachines command. To learn more about this community contriuted command, you can read this Stata Journal article. clear all set more off *Install svmachines ssc install svmachines *Import Data with a binary outcome for classification use http://www.stata-press.com/data/r16/fvex.dta, clear *First try logistic regression to benchmark the prediction quality of SVM against logit outcome group sex arm age distance y // Run the regression predict outcome_predicted // Generate predictions from the regression *Calculate the log loss - see https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html for more info gen log_loss = outcome*log(outcome_predicted)+(1-outcome)*log(1-outcome_predicted) *Run SVM svmachines outcome group sex arm age distance y, prob // Specifiying the prob option to generate predicted probabilities in the next line predict sv_outcome_predicted, probability . Next we will Calculate the log loss (or cross-entropy loss) for SVM. Note: Predictions following svmachines generate three variables from the stub you provide in the predict command (in this case sv_outcome_predicted). The first is just the same as the stub and stores the best-guess classification (the group with the highest probability out of the possible options). The next n variables store the probability that the given observation will fall into each of the possible classes (in the binary case, this is just n=2 possible classes). These new variables are the stub + the value of each class. In the case below, the suffixes are _0 and _1. We use sv_outcome_predicted_1 because it produces probabilities that are equivalent in their intepretation (probability of having a class of 1) to the probabilities produced by the logit model and that can be used in calculating the log loss. Calculating loss functions for multi-class classifiers is more complicated, and you can read more about that at the link above. gen log_loss_svm = outcome*log(sv_outcome_predicted_1)+(1-outcome)*log(1-sv_outcome_predicted_1) *Show log loss for both logit and SVM, remember lower is better sum log_loss log_loss_svm . ",
"url": "/Machine_Learning/support_vector_machine.html#stata",
"relUrl": "/Machine_Learning/support_vector_machine.html#stata"
- },"723": {
+ },"724": {
"doc": "Synthetic Control",
"title": "Synthetic Control Method (SCM)",
"content": "Synthetic Control Method is a way of estimating the causal effect of an intervention in comparative case studies. It is typically used with a small number of large units (e.g. countries, states, counties) to estimate the effects of aggregate interventions. The idea is to construct a convex combination of similar untreated units (often referred to as the “donor pool”) to create a synthetic control that closely resembles the treatment subject and conduct counterfactual analysis with it. We have \\(j = 1, 2, ..., J+1\\) units, assuming without loss of generality that the first unit is the treated unit, \\(Y_{1t}\\). Denoting the potential outcome without intervention as \\(Y_{1t}^N\\), our goal is to estimate the treatment effect: . \\[\\tau_{1t} = Y_{1t} - Y_{1t}^N\\] We won’t have data for \\(Y_{1t}^N\\) but we can use synthetic controls to estimate it. Let the \\(k\\) x \\(J\\) matrix \\(X_0 = [X_2 ... X_{J+1}]\\) represent characteristics for the untreated units and the \\(k\\)-length vector \\(X_1\\) represent characteristics for the treatment unit. Last, define our \\(J\\times 1\\) vector of weights as \\(W = (w_2, ..., w_{J+1})'\\). Recall, these weights are used to form a convex combination of the untreated units. Now we have our estimate for the treatment effect: . \\[\\hat{\\tau_{1t}} = Y_{1t} - \\hat{Y_{1t}^N}\\] where \\(\\hat{Y_{1t}^N} = \\sum_{j=2}^{J+1} w_j Y_{jt}\\). The matrix of weights is found by choosing \\(W*\\) to minimize \\(\\|X_1 - X_0W\\|\\) such that \\(W >> 0\\) and \\(\\sum_2^{J+2} w_j = 1\\). Once you’ve found the \\(W*\\), you can put together an estimated \\(\\hat{Y_{1t}}\\) (synthetic control) for all time periods \\(t\\). Because our synthetic control was constructed from untreated units, when the intervention occurs at time \\(T_0\\), the difference between the synthetic control and the treated unit gives us our estimated treatment effect. As a last bit of intuition, below is a graph depicting the upshot of the method. The synthetic control follows a very similar path to the treated unit pre-intervention. The difference between the two curves, post-intervention, gives us our estimated treatment effect. Here is an excellent resource by Alberto Abadie (the economist who developed the method) if you’re interested in getting a more comprehensive overview of synthetic controls. ",
"url": "/Model_Estimation/Research_Design/synthetic_control_method.html#synthetic-control-method-scm",
"relUrl": "/Model_Estimation/Research_Design/synthetic_control_method.html#synthetic-control-method-scm"
- },"724": {
+ },"725": {
"doc": "Synthetic Control",
"title": "Keep in Mind",
"content": ". | Unlike the difference-in-difference method, parallel trends aren’t a necessary assumption. However, the donor pool must still share similar characteristics to the treatment unit in order to construct an accurate estimate. | Panel data is necessary for the synthetic control method and, typically, requires observations over many time periods. Specifically, the pre-intervention time frame ought to be large enough to form an accurate estimate. | Aggregate data is required for this method. Examples include state-level per-capita GDP, country-level crime rates, and state-level alcohol consumption statistics. Additionally, if aggregate data doesn’t exist, you can sometimes aggregate micro-level data to estimate aggregate values. | As a caveat to the previous bullet point, be wary of structural breaks when using large pre-intervention periods. | Abadie and L’Hour (2020) also proposes a penalization method for performing the synthetic control method on disaggregated data. | . ",
"url": "/Model_Estimation/Research_Design/synthetic_control_method.html#keep-in-mind",
"relUrl": "/Model_Estimation/Research_Design/synthetic_control_method.html#keep-in-mind"
- },"725": {
+ },"726": {
"doc": "Synthetic Control",
"title": "Also Consider",
"content": ". | As stated before, this technique can be compared to difference-in-difference. If you don’t have aggregate data or don’t have sufficient data for the pre-intervention window and you have a control that you can confidently assume has a parallel trend to the treatment unit, then diff-in-diff might be better suited than SCM. | . ",
"url": "/Model_Estimation/Research_Design/synthetic_control_method.html#also-consider",
"relUrl": "/Model_Estimation/Research_Design/synthetic_control_method.html#also-consider"
- },"726": {
+ },"727": {
"doc": "Synthetic Control",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/Research_Design/synthetic_control_method.html#implementations",
"relUrl": "/Model_Estimation/Research_Design/synthetic_control_method.html#implementations"
- },"727": {
+ },"728": {
"doc": "Synthetic Control",
"title": "R",
"content": "To implement the synthetic control method in R, we will be using the package Synth. While not used here, the SynthTools package also has a number of functions for making it easier to work with the Synth package. As stated above, the key part of the synthetic control method is to estimate the weight matrix \\(W*\\) in order to form accurate estimates of the treatment unit. The Synth package provides you with the tools to find the weight matrix. From there, you can construct the synthetic control by interacting the \\(W*\\) and the \\(Y\\) values from the donor pool. # First we will load Synth and dplyr. # If you haven't already installed Synth, now would be the time to do so library(dplyr) library(Synth) # We're going to use simulated data included in the Synth package for our example. # This dataframe consists of panel data including 1 outcome variable and 3 predictor variables for 1 treatment unit and 7 control units (donor pool) over 21 years data(\"synth.data\") # The primary function that we will use is the synth() function. # However, this function needs four particularly formed matrices as inputs, so it is highly recommended that you use the dataprep() function to generate the inputs. # Once we've gathered our dataprep() output, we can just use that as our sole input for synth() and we'll be good to go. # One important note is that your data must be in long format with id variables (integers) and name variables (character) for each unit. dataprep_out = dataprep( foo = synth.data, # first input is our data predictors = c(\"X1\", \"X2\", \"X3\"), # identify our predictor variables predictors.op = \"mean\", # operation to be performed on the predictor variables for when we form our X_1 and X_0 matrices. time.predictors.prior = c(1984:1989), # pre-intervention window dependent = \"Y\", # outcome variable unit.variable = \"unit.num\", # identify our id variable unit.names.variable = \"name\", # identify our name variable time.variable = \"year\", # identify our time period variable treatment.identifier = 7, # integer that indicates the id variable value for our treatment unit controls.identifier = c(2, 13, 17, 29, 32, 36, 38), # vector that indicates the id variable values for the donor pool time.optimize.ssr = c(1984:1990), # identify the time period you want to optimize over to find the W*. Includes pre-treatment period and the treatment year. time.plot = c(1984:1996) # periods over which results are to be plotted with Synth's plot functions ) # Now we have our data ready in the form of a list. We have all the matrices we need to run synth() # Our output from the synth() function will be a list that includes our optimal weight matrix W* synth_out = dataprep_out %>% synth() # From here, we can plot the treatment variable and the synthetic control using Synth's plot function. # The variable tr.intake is an optional variable if you want a dashed vertical line where the intervention takes place. synth_out %>% path.plot(dataprep.res = dataprep_out, tr.intake = 1990) # Finally, we can construct our synthetic control variable if we wanted to conduct difference-in-difference analysis on it to estimate the treatment effect. synth_control = dataprep_out$Y0plot %*% synth_out$solution.w . ",
"url": "/Model_Estimation/Research_Design/synthetic_control_method.html#r",
"relUrl": "/Model_Estimation/Research_Design/synthetic_control_method.html#r"
- },"728": {
+ },"729": {
"doc": "Synthetic Control",
"title": "Stata",
"content": "To implement the synthetic control method in Stata, we will be using the synth and synth_runner packages. For a short tutorial on how to carry out the synthetic control method in Stata by Jens Hainmueller, there is a useful video here. *Install plottig graph scheme used below ssc install blindschemes *Install synth and synth_runner if they're not already installed (uncomment these to install) * ssc install synth, all * cap ado uninstall synth_runner //in-case already installed * net install synth_runner, from(https://raw.github.com/bquistorff/synth_runner/master/) replace *Import Dataset sysuse synth_smoking.dta, clear *Need to set the data as time series, using tsset tsset state year . Next we will run the synthetic control analysis using synth_runner, which adds some useful options for estimation. Note that this example uses the pre-treatment outcome for just three years (1988, 1980, and 1975), but any combination of pre-treatment outcome years can be specified. The nested option specifies a more computationally intensive but comprehensive method for estimating the synthetic control. The trunit() option specifies the ID of the treated entity (in this case, the state of California has an ID of 3). synth cigsale beer lnincome retprice age15to24 cigsale(1988) /// cigsale(1980) cigsale(1975), trunit(3) trperiod(1989) fig /// nested keep(synth_results_data.dta) replace /*Keeping the synth_results_data.dta stores a dataset of all the time series values of cigsale for each year for California (observed) and synthetic California (constructed using a weighted average of observed data from donor states). We can then import this dataset to create a synth plot whose attributes we can control. */ use synth_results_data.dta, clear drop _Co_Number _W_Weight // Drops the columns of the data that store the donor state weights twoway line (_Y_treated _Y_synthetic _time), scheme(plottig) xline(1989) /// xtitle(Year) ytitle(Cigarette Sales) legend(pos(6) rows(1)) ** Run the analysis using synth_runner *Import Dataset sysuse synth_smoking.dta, clear *Need to set the data as time series, using tsset tsset state year *Estimate Synthetic Control using synth_runner synth_runner cigsale beer(1984(1)1988) lnincome(1972(1)1988) retprice age15to24 cigsale(1988) cigsale(1980) /// cigsale(1975), trunit(3) trperiod(1989) gen_vars . We can plot the effects in two ways: displaying both the treated and synthetic time series together and displaying the difference between the two over the time series. The first plot is equivalent to the plot produced by specifying the fig option for synth, except you can control aspects of the figure. For both plots you can control the plot appearence by specifying effect_options() or tc_options(), depending on which plot you would like to control. effect_graphs, trlinediff(-1) effect_gname(cigsale1_effect) tc_gname(cigsale1_tc) /// effect_options(scheme(plottig)) tc_options(scheme(plottig)) /*Graph the outcome paths of all units and (if there is only one treated unit) a second graph that shows prediction differences for all units */ single_treatment_graphs, trlinediff(-1) raw_gname(cigsale1_raw) /// effects_gname(cigsale1_effects) effects_ylabels(-30(10)30) /// effects_ymax(35) effects_ymin(-35) . ",
"url": "/Model_Estimation/Research_Design/synthetic_control_method.html#stata",
"relUrl": "/Model_Estimation/Research_Design/synthetic_control_method.html#stata"
- },"729": {
+ },"730": {
"doc": "Synthetic Control",
"title": "Synthetic Control",
"content": " ",
"url": "/Model_Estimation/Research_Design/synthetic_control_method.html",
"relUrl": "/Model_Estimation/Research_Design/synthetic_control_method.html"
- },"730": {
+ },"731": {
"doc": "Task Scheduling with Github Actions",
"title": "The Problem We’ll Solve",
"content": "The United States Substance Abuse and Mental Health Services Administration (SAMHSA) is an agency inside the U.S. Department of Health and Human Services tasked with overseeing the country’s substance abuse and mental health initiatives. A major one of these initiatives is maintaining the list of “waived providers” who can prescribe opioids, something that is typically prohibited under the federal Controlled Substances Act. SAMHSA makes available a list of currently waived providers, but does not publish (at least easily) historical lists of providers. As such, we’ll write a small web scraper that pulls all the data from their website and writes it out to a CSV. This article, however, is not about web scrapers. Instead, our problem is that SAMHSA seems to update the list without fanfare at irregular intervals. So we would like to scrape their website every day. This article demonstrates how set up a Github repo to do just that. ",
"url": "/Other/task_scheduling_with_github_actions.html#the-problem-well-solve",
"relUrl": "/Other/task_scheduling_with_github_actions.html#the-problem-well-solve"
- },"731": {
+ },"732": {
"doc": "Task Scheduling with Github Actions",
"title": "Requirements",
"content": "You’ll need: . | A Github account and some familiarity with git | A program that can be run on the command line that accomplishes your data gathering task | The requirements for that program enumerated in one of several standard ways | . For the rest of this section, we’ll focus a bit on requirements (2) and (3). Requirement (2): A command line program . What you’ll be able to tell Github to do is run a series of commands. It is best to package these up into one command that will do everything for you. For instance, if you’re using python, you will probably want to have a file called main.py that looks something like this: . import csv import sys from datetime import datetime from typing import List, Union import requests URL = \"https://whereveryourdatais.com/\" def process_page(html: str) -> List[List[Union[int, str]]]: \"\"\" This is the meat of your web scraper: Pulling out the data you want from the HTML of the web page \"\"\" def pull_data(url: str) -> List[List[Union[int, str]]]: resp = requests.get(url) resp.raise_for_status() content = resp.content.decode('utf8') return process_page(content) def main(): # The program takes 1 optional argument: an output filename. If not present, # we will write the output a default filename, which is: filename = f\"data/output-{datetime.utcnow().strftime('%Y-%m-%d').csv\" if len(sys.argv) > 1: filename = sys.argv[1] print(f\"Will write data to {filename}\") print(f\"Pulling data from {URL}...\") data = pull_data(URL) print(f\"Done pulling data.\") print(\"Writing data...\") with open(filename, 'wt') as outfile: writer = csv.writer(outfile) writer.writerows(data) print(\"Done writing data.\") if __name__ == \"__main__\": main() . Here the meat of your web scraper goes into the pull_data and the process_page functions. These are then wrapped into the main function which you can call on the command line as: . python3 main.py . Similarly, if you’re using R, you’ll want to create a main.R file to similar effect. For instance, it might look something like: . library(readr) library(httr) URL <- \"https://whereveryourdatais.com/\" #' This hte meat of your web scraper: #' Pulling out the data you want from the HTML of the web page process_page <- function(html) { # Process html } #' Pull data from a single URL and return a tibble with it nice and ordered pull_data <- function(url) { resp <- GET(url) if (resp$status_code >= 400) { stop(paste0(\"Something bad occurred in trying to pull \", URL)) } return(process_page(content(resp))) } main <- function() { # The program takes 1 optional argument: an output filename. If not present, # we will write the output a default filename, which is: date <- Sys.time() attr(date, \"tzone\") <- \"UTC\" filename <- paste0(\"data/output-\", as.Date(date, format = \"%Y-%m-%d\")) args <- commandArgs(trailingOnly = TRUE) if (length(args) > 0) { filename <- args[1] } print(paste0(\"Will write data to \", filename)) print(paste0(\"Pulling data from \", URL)) data <- pull_data(URL) print(\"Done pulling data\") print(\"Writing data...\") write_csv(data, filename) print(\"Done writing data.\") } . Here the meat of your web scraper goes into the pull_data and the process_page functions. These are then wrapped into the main function which you can call on the command line as (note the --vanilla): . Rscript --vanilla main.R . Requirement (3): Enumerated lists of requirements . In order for Github to run your command, it will need to know what dependencies it needs to install. For experts, using a tool like poetry in Python or renv in R is probably what you actually want to do. However, for the purposes of this article, we’ll stick to a simple list. As such, you should create a file entitled requirements.txt in your project’s main folder. In this you should list, one requirement per line, the requirements of your script. For instance, in the python example above, your requirements.txt should look like . requests . The R example should have . httr readr . If you’re using R, you’ll also need to add the following script in a file called install.R to your project: . CRAN <- \"https://mirror.las.iastate.edu/CRAN/\" process_file <- function(filepath) { con <- file(filepath, \"r\") while (TRUE) { line <- trimws(readLines(con, n = 1)) if (length(line) == 0) { break } install.packages(line, repos = CRAN) } close(con) } process_file(\"requirements.txt\") . ",
"url": "/Other/task_scheduling_with_github_actions.html#requirements",
"relUrl": "/Other/task_scheduling_with_github_actions.html#requirements"
- },"732": {
+ },"733": {
"doc": "Task Scheduling with Github Actions",
"title": "Setting up the Action",
"content": "With all of the above accomplished, you should have a main.py or a main.R file and a requirements.txt file setup in your repository. If you’re using R, you’ll also have an install.R script present. With that, we move to setting up the Github Action! . In this section, we assume that your repository is already on Github. Throughout, we’ll assume that the repository is hosted at USERNAME/REPO, e.g., lost-stats/lost-stats.github.io. Telling it to run . Now you just need to add a file called .github/workflows/schedule.yml to your repo. Its contents should look like this: . name: Run scheduled action on: schedule: # You need to set your schedule here - cron: CRON_SCHEDULE jobs: pull_data: runs-on: ubuntu-20.04 steps: - name: Checkout code uses: actions/checkout@v2 with: persist-credentials: false fetch-depth: 0 # If using Python: - name: Set up Python 3.8 uses: actions/setup-python@v2 with: python-version: \"3.8\" # If using R: - name: Set up R 4.0.3 uses: r-lib/actions/setup-r@v1 with: r-version: \"4.0.3\" # If using Python: - name: Install dependencies run: pip install -r requirements.txt # If using R: - name: Install dependencies run: Rscript --vanilla install.R # If using Python: - name: Pull data run: python3 main.py # If using R: - name: Pull data run: Rscript --vanilla main.R # NOTE: This commits everything in the `data` directory. Make sure this matches your needs - name: Git commit run: | git add data git config --local user.email \"action@github.com\" git config --local user.name \"GitHub Action\" git commit -m \"Commiting data\" # NOTE: Check that your branch name is correct here - name: Git push run: | git push \"https://${GITHUB_ACTOR}:${TOKEN}@github.com/${GITHUB_REPOSITORY}.git\" HEAD:main env: TOKEN: ${{ secrets.GITHUB_TOKEN }} . You’ll need to edit this file and retain only the stanzas that pertain to whether you’re using Python or R. However, you’ll need to make a few adjustments. Let’s go through the file stanza by stanza to explain what it is doing: . name: Run scheduled action . This is just a descriptive name. Everything after the : is decorative. Name it whatever you like! . on: . This section describes when the action should run. Github actions supports several potential events, including push, pull_request, and repository_dispatch. However, since this is a scheduled action, we’re going to use the schedule event. The next line - cron: CRON_SCHEDULE tells Github how frequently to run the action. You need to replace CRON_SCHEDULE with your preferred frequency. You need to write this in “cron syntax,” which is an arcane but pretty universally recognized format for specifying event schedules. I recommend using a helper like this one to write this expression. For instance, let’s say we want to run this job at noon UTC every day. Then this line should become - cron: \"0 12 * * *\". jobs: . This tells us that we’re about to begin specifying the list of jobs to be run on the schedule described above. pull_data: . This is also just a descriptive name. It is best that it follow snake_casing, in particular, it should have no spaces or strange characters. runs-on: ubuntu-20.04 . This specifies which operating system to run your code on. Github supports a lot of choices, but generally, ubuntu-20.04 or ubuntu-latest is what you’ll want. steps: . In what follows, we list out the individual steps Github should take. Each step consists of several components: . | name: A descriptive name. Can be anything you’d like. It’s also optional, but I find it useful. | uses: Optionally reference an series of steps somebody else has already specified. | with: If using uses:, specificy any variables in calling that action. | run: Just simply run a (series of) commands in the shell, one per line. | env: Specify envrionment variables for use in the shell. | . We’ll see several examples of this below. Checkout code . This stanza tells the action to checkout this repository’s code. This will begin basically every Github action you build. Note that it uses: a standard action that is maintained by Github itself. Setup Python or R . These are actions that tell Github to make a specific version of Python or R available in your envrionment. You probably only need one, but you can use both if you need. Specify the exact version you want in the with: section. Install dependencies . This runs a script that installs all the dependencies you enumerated earlier in requirements.txt. Python comes with a built in dependency manager called pip, so we just point it to our list of dependencies. On the other hand, we tell R to execute our dependency installation script install.R. In either case, we’re using run: as we’re telling Github to execute a command in its own shell. Pull data . This is the task we’re actually going to run! Note that we’re calling either the main.py or main.R file we built before. After this is done, we assume there will be a new file in the data/ directory. Git commit . This stanza commits the new data to this repository and sets up the required git variables. Note that here we’re using run: |. In YAML, ending a line with | indicates that all the following lines that are at the same tab depth should be used as a single value. So here, we’re telling Github to run the commands, git add data, git config --local user.email \"action@github.com\", etc in order. Git push . This pushes the commit back up to the repository using git push. Note that if the name of your main branch is not main (for instance, it may be master), you will need to change HEAD:main to whatever your main branch is called (e.g., HEAD:master). Also note that we are setting an environment variable here. Specfically, in the env: section we’re setting the TOKEN environment variable to ${{ secrets.GITHUB_TOKEN }}. This is a a special value that Github generates for each run of your action that allows your action to manipulate its own repository. In this case, it’s allowing it to push a commit back to the central repository. ",
"url": "/Other/task_scheduling_with_github_actions.html#setting-up-the-action",
"relUrl": "/Other/task_scheduling_with_github_actions.html#setting-up-the-action"
- },"733": {
+ },"734": {
"doc": "Task Scheduling with Github Actions",
"title": "And that’s all!",
"content": "And that’s it! With that file commited, you Github action should run every day at noon UTC. From here, there are a lot of simple extensions to be made and tried. Here are some challenges to make sure you know what’s going on above: . | Instead making the job run every day at noon UTC, make it run on Wednesdays at 4pm UTC. | Instead of returning at tibble, return a data.frame in R. Note that you’ll need to expand the collection of requirements! | Instead of returning a list of lists in Python, return a pandas data frame. Note that you’ll need to expand the collection of requirements! | . ",
"url": "/Other/task_scheduling_with_github_actions.html#and-thats-all",
"relUrl": "/Other/task_scheduling_with_github_actions.html#and-thats-all"
- },"734": {
+ },"735": {
"doc": "Task Scheduling with Github Actions",
"title": "One final note: API keys",
"content": "A very common need to pull data is some sort of API key. Your cron job will need access to your API key. Conveniently, Github has provided a nice functionality to do exactly this: Secrets. To get your API key to your script, follow these steps: . | Setup your secret according to the above instructions. Let’s give it the name API_KEY for convenience. | Modify your main.py or main.R file to look for the API_KEY environemnt variable. For instance, in Python you might do: | . import os api_key = os.environ.get(\"API_KEY\", \"some_other_way\") . or in R you might do . api_key <- Sys.getenv(\"API_KEY\", unset = \"some_other_way\") . | Amend the Pull data step in your action to set the API_KEY environment variable. For instance, it might look like: | . - name: Pull data run: python3 main.py env: API_KEY: ${{ secrets.API_KEY }} . ",
"url": "/Other/task_scheduling_with_github_actions.html#one-final-note-api-keys",
"relUrl": "/Other/task_scheduling_with_github_actions.html#one-final-note-api-keys"
- },"735": {
+ },"736": {
"doc": "Task Scheduling with Github Actions",
"title": "Task Scheduling with Github Actions",
"content": "Typically when performing statistical analyses, we write code to be run approximately once. But software more generally is frequently run multiple times. Web servers run constantly, executing the same code over and over in response to user commands. A video game is rerun on demand, each time you turn it on. In statistical analyses, though, if code is to be run multiple times, it often needs to be run on a schedule. For instance, you may want to scrape weather data every hour to build an archive for later analysis. Or perhaps you want to perform the same statistical analyses each week on new data as it comes in. In our experience, this is the worst kind of tasks for humans to do: They have to reliably remember to run a piece of code at a specified time, aggregate the results in a consistent format, and then walk away. One mistimed meeting or baby feeding and it’s likely the reseaercher will forget to hit “go.” . Thankfully, in addition to doing things over and over or on demand, computers are also reasonably good at keeping time. In this article, we’ll describe the role of a task scheduler and demonstrate how to use Github Actions to run a simple data gathering task at regular intervals and commit that data to a repository. ",
"url": "/Other/task_scheduling_with_github_actions.html",
"relUrl": "/Other/task_scheduling_with_github_actions.html"
- },"736": {
+ },"737": {
"doc": "Tobit Regression",
"title": "Tobit Regression",
"content": "If you have ever encountered data that is censored in some way, then the Tobit method is worth a detailed look. Perhaps the measurement tools only detect at a minimum threashold or up until some maximum threshold, or there’s a physical limitation or natural constraint that cuts off the range of outcomes. If the dependent variable has a limited range in any way, then an OLS regression will capture the relationship between variables with a cluster of zeros or maximums distorting the relationship. James Tobin’s big idea was essentially to modify the likelihood function to represent the unequal sampling probability of observations depending if a latent dependent variable is smaller than or larger than that range. The Tobit model is also called a Censored Regression Model for this reason, as it allows flexility to account of either left or right side censorship. There is flexibility in the mathematics depending on how the censoring occurs. To learn more to match the mathematics/functional form to your practical application, wikipedia has a great page here along with links to outside practical applications. Para Español, dale click en el siguiente enlance aqui. Estas notas tienen las lecciones importantes de esta pagina en Ingles. ",
"url": "/Model_Estimation/GLS/tobit.html",
"relUrl": "/Model_Estimation/GLS/tobit.html"
- },"737": {
+ },"738": {
"doc": "Tobit Regression",
"title": "Keep in Mind",
"content": ". | Tobit is used with Censored Data, which IS NOT the same as Truncated Data (see next section) | Tobit can produce a kinked relationship after a zero cluster | Tobit can find the correct relationship underneath a maximum cluster | For non-parametric tobit, you will need a CLAD operator (see link in next section) | . ",
"url": "/Model_Estimation/GLS/tobit.html#keep-in-mind",
"relUrl": "/Model_Estimation/GLS/tobit.html#keep-in-mind"
- },"738": {
+ },"739": {
"doc": "Tobit Regression",
"title": "Also Consider",
"content": ". | If you are new to the concept of limited dependent variables or OLS Regression, click these links. | Deciphering whether data is censored or truncated is important. If all observations are observed in “X” but the true value of “Y” isn’t known outside some range, then it is Censored. At the Chernobyl disaster the radioactive isotope meter only read up until a maximum threshold, all areas (“X”) are observed but the true value of the radioactive level (“Y”) is right censored at a maximum. When there is not a full set of “X” observed, then data is truncated, or in other words, a censored Y value does not get it’s input x observed thus the set {Y,X} is not complete. For more info try these UH slides from Bauer School of Business (they also have relatively easily digestable theory). | The Tobit model type I (the main one people are talking about without specification) is really a morphed maximum likelihood estimation of a probit, more background from those links. | If you find yourself needing non-parametric form, you will need to use a CLAD operator as well as new variance estimation techniques, I recommend Bruce Hansen’s from University of Wisconsin, notes here. | . ",
"url": "/Model_Estimation/GLS/tobit.html#also-consider",
"relUrl": "/Model_Estimation/GLS/tobit.html#also-consider"
- },"739": {
+ },"740": {
"doc": "Tobit Regression",
"title": "Implementations",
"content": " ",
"url": "/Model_Estimation/GLS/tobit.html#implementations",
"relUrl": "/Model_Estimation/GLS/tobit.html#implementations"
- },"740": {
+ },"741": {
"doc": "Tobit Regression",
"title": "R",
"content": "We can use the AER package (link) to run a tobit model in R. # install.packages(\"AER\") # Install first if you don't have it yet library(AER) data(\"Affairs\") # Use the \"Affairs\" dataset provided with AER # Aside: this example replicates Table 22.4 in Greene (2003) tob_mod1 = tobit(affairs ~ age + yearsmarried + religiousness + occupation + rating, data = Affairs) summary(tob_mod1) # The default left- and right-hand side limts for the censored dependent variable # are 0 and Inf, respectively. You might want to change these after inspecting your # data. hist(Affairs$affairs tob_mod2 = tobit(affairs ~ age + yearsmarried + religiousness + occupation + rating, data = Affairs, right = 4) # RHS censored now at 4 summary(tob_mod2) . For another example check out M Clark’s Models by Example Page. ",
"url": "/Model_Estimation/GLS/tobit.html#r",
"relUrl": "/Model_Estimation/GLS/tobit.html#r"
- },"741": {
+ },"742": {
"doc": "2x2 Difference in Difference",
"title": "2X2 Difference-in-Differences",
"content": "Causal inference with cross-sectional data is fundamentally tricky. | People, firms, etc. are different from one another in lots of ways. | Can only get a clean comparison when you have a (quasi-)experimental setup, such as an experiment or an regression discontinuity. | . Difference-in-difference makes use of a treatment that was applied to one group at a given time but not another group. It compares how each of those groups changed over time (comparing them to themselves to eliminate between-group differences) and then compares the treatment group difference to the control group difference (both of which contain the same time gaps, eliminating differences over time). ",
"url": "/Model_Estimation/Research_Design/two_by_two_difference_in_difference.html#2x2-difference-in-differences",
"relUrl": "/Model_Estimation/Research_Design/two_by_two_difference_in_difference.html#2x2-difference-in-differences"
- },"742": {
+ },"743": {
"doc": "2x2 Difference in Difference",
"title": "KEEP IN MIND",
"content": ". | For Difference-in-differences to work, parallel trends must hold. That is, nothing else should be changing the gap between treated and control states at the same time as the treatment. While it is not a formal test of parallel trends, researchers often look at whether the gap between treated and control states is constant in pre-treatment years. | Suppose in \\(t = 0\\) (“Pre-period”), and \\(t = 1\\) (“Post-period”). We want to estimate \\(\\tau = Post - Pre\\), or \\(Y(post)-Y(pre)= Y(t=1)-Y(t=0)=\\tau\\). | . ",
"url": "/Model_Estimation/Research_Design/two_by_two_difference_in_difference.html#keep-in-mind",
"relUrl": "/Model_Estimation/Research_Design/two_by_two_difference_in_difference.html#keep-in-mind"
- },"743": {
+ },"744": {
"doc": "2x2 Difference in Difference",
"title": "ALSO CONSIDER",
"content": ". | This page discusses “2x2” difference-in-difference design, meaning there are two groups, and treatment occurs at a single point in time. Many difference-in-difference applications instead use many groups, and treatments that are implemented at different times (a “rollout” design). Traditionally these models have been estimated using fixed effects for group and time period, i.e. “two-way” fixed effects. However, this approach with difference-in-difference can heavily bias results if treatment effects differ across groups, and alternate estimators are preferred. See Goodman-Bacon 2018 and Callaway and Sant’anna 2019. | . ",
"url": "/Model_Estimation/Research_Design/two_by_two_difference_in_difference.html#also-consider",
"relUrl": "/Model_Estimation/Research_Design/two_by_two_difference_in_difference.html#also-consider"
- },"744": {
+ },"745": {
"doc": "2x2 Difference in Difference",
"title": "IMPLEMENTATIONS",
"content": " ",
"url": "/Model_Estimation/Research_Design/two_by_two_difference_in_difference.html#implementations",
"relUrl": "/Model_Estimation/Research_Design/two_by_two_difference_in_difference.html#implementations"
- },"745": {
+ },"746": {
"doc": "2x2 Difference in Difference",
"title": "Python",
"content": "# Step 1: Load libraries and import data import pandas as pd import statsmodels.api as sm # for certain versions of jupyter: # %matplotlib inline url = ( \"https://raw.githubusercontent.com/LOST-STATS/LOST-STATS\" \".github.io/master/Model_Estimation/Data/\" \"Two_by_Two_Difference_in_Difference/did_crime.xlsx\" ) df = pd.read_excel(url) # Step 2: indicator variables # whether treatment has occured at all df['after'] = df['year'] >= 2014 # whether it has occurred to this entity df['treatafter'] = df['after'] * df['treat'] # Step 3: # use pandas basic built in plot functionality to get a visual # perspective of our parallel trends assumption ax = df.pivot(index='year', columns='treat', values='murder').plot( figsize=(20, 10), marker='.', markersize=20, title='Murder and Time', xlabel='Year', ylabel='Murder Rate', # to make sure each year is displayed on axis xticks=df['year'].drop_duplicates().sort_values().astype('int') ) # the function returns a matplotlib.pyplot.Axes object # we can use this axis to add additional decoration to our plot ax.axvline(x=2014, color='gray', linestyle='--') # treatment year ax.legend(loc='upper left', title='treat', prop={'size': 20}) # move and label legend # Step 4: # statsmodels has two separate APIs # the original API is more complete both in terms of functionality and documentation X = sm.add_constant(df[['treat', 'treatafter', 'after']].astype('float')) y = df['murder'] sm_fit = sm.OLS(y, X).fit() # the formula API is more familiar for R users # it can be accessed through an alternate constructor bound to each model class smff_fit = sm.OLS.from_formula('murder ~ 1 + treat + treatafter + after', data=df).fit() # it can also be accessed through a separate namespace import statsmodels.formula.api as smf smf_fit = smf.ols('murder ~ 1 + treat + treatafter + after', data=df).fit() # if using jupyter, rich output is displayed without the print function # we should see three identical outputs print(sm_fit.summary()) print(smff_fit.summary()) print(smf_fit.summary()) . ",
"url": "/Model_Estimation/Research_Design/two_by_two_difference_in_difference.html#python",
"relUrl": "/Model_Estimation/Research_Design/two_by_two_difference_in_difference.html#python"
- },"746": {
+ },"747": {
"doc": "2x2 Difference in Difference",
"title": "R",
"content": "In this case, we need to discover whether legalized marijuana could change the murder rate. Some states legalized marijuana in 2014. So we measure the how the murder rate changes from before 2014 to after between legalized states and states without legalization. Step 1: . | First of all, we need to load Data and Package, we call this data set “DiD”. | . library(tidyverse) library(broom) library(readxl) library(httr) # Download the Excel file from a URL tf <- tempfile(fileext = \".xlsx\") GET( \"https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Two_by_Two_Difference_in_Difference/did_crime.xlsx\", write_disk(tf) ) DiD <- read_excel(tf) . Step 2: . Notice that the data has already been collapsed to the treated-year level. That is, there is one observation of the murder rate for each year for the treated states (all averaged together), and one observation of the murder rate for each year for the untreated states (all averaged together). We create the indicator variable called after to indicate whether it is in the treated period of being after the year of 2014 (1), or the before period of between 2000-2013 (0). The variable treat indicates that the state legalizes marijuana in 2014. Notice that treat = 1 in these states even before 2014. If the year is after 2014 and the state decided to legalize marijuana, the indicator variable “treatafter” is “1” . DiD <- DiD %>% mutate(after = year >= 2014) %>% mutate(treatafter = after*treat) . Step 3: . Then we need to plot the graph to visualize the impact of legalize marijuana on murder rate by using ggplot. mt <- ggplot(DiD,aes(x=year, y=murder, color = treat)) + geom_point(size=3)+geom_line() + geom_vline(xintercept=2014,lty=4) + labs(title=\"Murder and Time\", x=\"Year\", y=\"Murder Rate\") mt . It looks like, before the legalization occurred, murder rates in treated and untreated states were very similar, lending plausibility to the parallel trends assumption. Step 4: . We need to measure the impact of impact of legalize marijuana. If we include treat, after, and treatafter in a regression, the coefficient on treatafter can be interpreted as “how much bigger was the before-after difference for the treated group?” which is the DiD estimate. reg<-lm(murder ~ treat+treatafter+after, data = DiD) summary(reg) . After legalization, the murder rate dropped by 0.3% more in treated than untreated states, suggesting that legalization reduced the murder rate. ",
"url": "/Model_Estimation/Research_Design/two_by_two_difference_in_difference.html#r",
"relUrl": "/Model_Estimation/Research_Design/two_by_two_difference_in_difference.html#r"
- },"747": {
+ },"748": {
"doc": "2x2 Difference in Difference",
"title": "2x2 Difference in Difference",
"content": " ",
"url": "/Model_Estimation/Research_Design/two_by_two_difference_in_difference.html",
"relUrl": "/Model_Estimation/Research_Design/two_by_two_difference_in_difference.html"
- },"748": {
+ },"749": {
"doc": "Home",
"title": "Home",
"content": "# Home Welcome to the **Library of Statistical Techniques** (LOST)! LOST is a publicly-editable website with the goal of making it easy to execute statistical techniques in statistical software. Each page of the website contains a statistical technique — which may be an estimation method, a data manipulation or cleaning method, a method for presenting or visualizing results, or any of the other kinds of things that statistical software typically does. For each of those techniques, the LOST page will contain code for performing that method in a variety of packages and languages. It may also contain information (or links) with thorough descriptions of the method, but the focus here is on implementation. How can you do it in your language of choice? If there are multiple ways, how are those ways different? Is the way you used to do it outdated, or does it do something unexpected? What's the `R` equivalent of that command you know about in `Stata` or `SAS`, or vice versa? In short, LOST is a Rosetta Stone for statistical software. If you are interested in contributing to LOST, please see the [Contributing](https://lost-stats.github.io/Contributing/Contributing.html) page. LOST was originated in 2019 by Nick Huntington-Klein and is maintained by volunteer contributors. The project's GitHub page is [here](https://github.com/LOST-STATS/lost-stats.github.io). ",
diff --git a/feed.xml b/feed.xml
index e8943962..686a601e 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1 +1 @@
-Jekyll2024-08-15T21:20:07+00:00/feed.xmlLOST
\ No newline at end of file
+Jekyll2024-08-19T18:43:02+00:00/feed.xmlLOST
\ No newline at end of file