diff --git a/Data_Manipulation/creating_a_variable_with_group_calculations.md b/Data_Manipulation/creating_a_variable_with_group_calculations.md index be28dabf..15aec074 100644 --- a/Data_Manipulation/creating_a_variable_with_group_calculations.md +++ b/Data_Manipulation/creating_a_variable_with_group_calculations.md @@ -76,6 +76,34 @@ storms['mean_wind'] = storms.groupby(['name','year','month','day'])['wind'].tran ``` +Though the above may be a great way to do it, it certainly seems complex. There is a much easier way to achieve similar results that is easier on the eyes (and brain!). This is using panda's aggregate() method with tuple assignments. This results in the most easy-to-understand way, by using the aggregate method after grouping since this would allow us to follow a very simple format of `new_column_name = ('old_column', 'agg_funct')`. So, for example: + +```python +import pandas as pd + +# Pull in data on storms +storms = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv') + +# Use groupby and group the columns and perform group calculations + +# The below calculations aren't particularly indicative of a good analysis, +# but give a quick look at a few of the calculations you can do +df = ( + storms + .groupby(by=['name', 'year', 'month', 'day']) #group + .aggregate( + avg_wind = ('wind', 'mean'), + max_wind = ('wind', 'max'), + med_wind = ('wind', 'median'), + std_pressure = ('pressure', 'std'), + first_year = ('year', 'first') + ) + .reset_index() # Somewhat similar to ungroup. Removes the grouping from the index +) + +``` + + ## R In R, we can use either the **dplyr** or **data.table** package to do this. diff --git a/Data_Manipulation/creating_categorical_variables.md b/Data_Manipulation/creating_categorical_variables.md index 625d6a18..43c50810 100644 --- a/Data_Manipulation/creating_categorical_variables.md +++ b/Data_Manipulation/creating_categorical_variables.md @@ -60,6 +60,64 @@ mtcars['classification'] = mtcars.apply(lambda x: next(key for ``` There's quite a bit to unpack here! `.apply(lambda x: ..., axis=1)` applies a lambda function rowwise to the entire dataframe, with individual columns accessed by, for example, `x['mpg']`. (You can apply functions on an index using `axis=0`.) The `next` keyword returns the next entry in a list that evaluates to true or exists (so in this case it will just return the first entry that exists). Finally, `key for key, value in conds_dict.items() if value(x)` iterates over the pairs in the dictionary and returns only the condition names (the 'keys' in the dictionary) for conditions (the 'values' in the dictionary) that evaluate to true. +Once again, just like R, Python has *many* ways of doing the same thing. Some with more complex, but efficient (runtime) manners, while others being slightly slower but many times easier to understand and follow-along with it's closeness of natural-language syntax. So, for this example, we will use numpy and pandas together, to achieve both an efficient runtime and a relatively simple syntax. + +```python +from seaborn import load_dataset +import pandas as pd +import numpy as np + +mtcars = load_dataset('mpg') + +# Create our list of index selections +conditionList = [ + (mtcars['mpg'] <= 19) & (mtcars['horsepower'] <= 123), + (mtcars['mpg'] > 19) & (mtcars['horsepower'] <= 123), + (mtcars['mpg'] <= 19) & (mtcars['horsepower'] > 123), + (mtcars['mpg'] > 19) & (mtcars['horsepower'] > 123) +] + +# Create the results we will pair with the above index selections +resultList = [ + 'Efficient and Non-powerful', + 'Inefficient and Non-powerful', + 'Efficient and Powerful', + 'Inefficient and Powerful' +] + + +df = ( + mtcars + .assign( + # Run the numpy select + classification = np.select(condlist=conditionList, + choicelist=resultList, + default='Not Considered' + ) + ) + # Convert from object to categorical + .astype({'classification' :'category'}) +) + + + +""" +Be a more purposeful programmer/analyst/data scientist: + +Using the default parameter in np.select() allows you to +fill in the values with that specific text wherever your criteria +is not considered. For example, if you search this data, you will see +there are a few rows where horesepower is null. +The original criteria we built does not considering null, so +it would be populated with "Not Considered" allowing you to find those +values and correct them, or set checks for them in a pipeline. + +""" + +``` + + + ## R We will create a categorical variable in two ways, first using `case_when()` from the **dplyr** package, and then using the faster `fcase()` from the **data.table** package. diff --git a/Machine_Learning/Nearest_Neighbor.md b/Machine_Learning/Nearest_Neighbor.md index 4fc2842e..294152ab 100644 --- a/Machine_Learning/Nearest_Neighbor.md +++ b/Machine_Learning/Nearest_Neighbor.md @@ -158,6 +158,60 @@ if __name__ == "__main__": main() ``` +A *very* simple way to also get a very basic KNN down in Python is leverage the knowledge of the many smart people that contribute to sci-kit learn library (sklean) as it is a powerhouse of machine learning models, as well as other very useful tools like data splitting, model evaluation, and feature selections. + +```python +#Import Libraries +from seaborn import load_dataset +import seaborn as sns +from sklearn.model_selection import train_test_split +from sklearn.neighbors import KNeighborsClassifier +from sklearn.metrics import accuracy_score + +# Load a sample dataset +iris_df = load_dataset('iris') + + +# Quick and rough sketch comparing the petal feature to species +sns.scatterplot(data=iris_df, x='petal_length', y='petal_width', hue='species') + + +# Quick and rough sketch comparing the sepals feature to species +sns.scatterplot(data=iris_df, x='sepal_length', y='sepal_width', hue='species') + + + +# Let's seperate the data into X and Y (features and target) +X = iris_df.drop(columns='species') +Y = iris_df['species'] + + +# Split the data into training and testing for model evaluations +X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=.70, shuffle=True, + random_state=777) + + +# Iterate through different neighbors to find the best accuracy with N neighbors. +accuracies = {} +errors = {} +for i in range(1, 15): + clf = KNeighborsClassifier(n_neighbors=i) + + clf.fit(X=X_train, y=y_train) + y_pred = clf.predict(X_test) + + accu_score = accuracy_score(y_true=y_test, y_pred=y_pred) + accuracies[i] = accu_score + +sns.lineplot(x=accuracies.keys(), y=accuracies.values()).set_title('Accuracies by N-Neighbors') + +# Looks like about 8 is the first best accuracy, so we'll go with that. +print(f"{accuracies[8]:.1%}") #100% accuracy for 8 neighbors. + +``` + + + ## R The simplest way to perform KNN in R is with the package **class**. It has a KNN function that is rather user friendly and does not require you to do distance computing as it runs everything with Euclidean distance. For more advanced types of nearest neighbors matching it would be best to use the `matchit` function from the [**matchit** package.](https://kosukeimai.github.io/MatchIt/reference/matchit.html) To verify results this example also used the `confusionMatrix` function from the package **caret**. diff --git a/Other/import_a_foreign_data_file.md b/Other/import_a_foreign_data_file.md index a81f325e..5a674906 100644 --- a/Other/import_a_foreign_data_file.md +++ b/Other/import_a_foreign_data_file.md @@ -70,6 +70,34 @@ using XLSX df = DataFrame(XLSX.readtable("filename.xlsx", "mysheet")) ``` +## Python +You'll most often be relying on Pandas to read in data. Though many other forms exist, the reason you'll be pulling in data is usually to work with the data, transform, and manipulate it. Panda lends itself extremely well for this purpose. Sometime you may have to work with much more messy data with APIs where you'll navigate through hierarchies of dictionaries using the .keys() method and selecting levels, but that is handled on a case-by-case basis and impossible to cover here. However, some of the most common will be covered. Those are csv, excel (xlsx), and .RData files. + +You, of course, always have the default open() function, but that can get much more complex. + +```python +# Reading .RData files +import pyreadr + +rds_data = pyreadr.read_r('sales_data.Rdata') #Object is a dictionary + +#Sales is the name of the dataframe, if unnamed, you may have to pass "None" as the name (no quotes) +df_r = rds_data['sales'] +df_r.head() + + +# Other common file reads, all use pandas. Most common two shown (csv/xlsx) +import pandas as pd + +csv_file = pd.read_csv('filename.csv') +xlsx_file = pd.read_excel('filename.xlsx', sheet_name='Sheet1') + +#Pandas can also read html, jsons, etc.... + +``` + + + ## R ```r?skip=true&skipReason=files_dont_exist