Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some python additions and/or much simpler ways of doing things. #216

Merged
merged 4 commits into from
Aug 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions Data_Manipulation/creating_a_variable_with_group_calculations.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,34 @@ storms['mean_wind'] = storms.groupby(['name','year','month','day'])['wind'].tran

```

Though the above may be a great way to do it, it certainly seems complex. There is a much easier way to achieve similar results that is easier on the eyes (and brain!). This is using panda's aggregate() method with tuple assignments. This results in the most easy-to-understand way, by using the aggregate method after grouping since this would allow us to follow a very simple format of `new_column_name = ('old_column', 'agg_funct')`. So, for example:

```python
import pandas as pd

# Pull in data on storms
storms = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv')

# Use groupby and group the columns and perform group calculations

# The below calculations aren't particularly indicative of a good analysis,
# but give a quick look at a few of the calculations you can do
df = (
storms
.groupby(by=['name', 'year', 'month', 'day']) #group
.aggregate(
avg_wind = ('wind', 'mean'),
max_wind = ('wind', 'max'),
med_wind = ('wind', 'median'),
std_pressure = ('pressure', 'std'),
first_year = ('year', 'first')
)
.reset_index() # Somewhat similar to ungroup. Removes the grouping from the index
)

```


## R

In R, we can use either the **dplyr** or **data.table** package to do this.
Expand Down
58 changes: 58 additions & 0 deletions Data_Manipulation/creating_categorical_variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,64 @@ mtcars['classification'] = mtcars.apply(lambda x: next(key for
```
There's quite a bit to unpack here! `.apply(lambda x: ..., axis=1)` applies a lambda function rowwise to the entire dataframe, with individual columns accessed by, for example, `x['mpg']`. (You can apply functions on an index using `axis=0`.) The `next` keyword returns the next entry in a list that evaluates to true or exists (so in this case it will just return the first entry that exists). Finally, `key for key, value in conds_dict.items() if value(x)` iterates over the pairs in the dictionary and returns only the condition names (the 'keys' in the dictionary) for conditions (the 'values' in the dictionary) that evaluate to true.

Once again, just like R, Python has *many* ways of doing the same thing. Some with more complex, but efficient (runtime) manners, while others being slightly slower but many times easier to understand and follow-along with it's closeness of natural-language syntax. So, for this example, we will use numpy and pandas together, to achieve both an efficient runtime and a relatively simple syntax.

```python
from seaborn import load_dataset
import pandas as pd
import numpy as np

mtcars = load_dataset('mpg')

# Create our list of index selections
conditionList = [
(mtcars['mpg'] <= 19) & (mtcars['horsepower'] <= 123),
(mtcars['mpg'] > 19) & (mtcars['horsepower'] <= 123),
(mtcars['mpg'] <= 19) & (mtcars['horsepower'] > 123),
(mtcars['mpg'] > 19) & (mtcars['horsepower'] > 123)
]

# Create the results we will pair with the above index selections
resultList = [
'Efficient and Non-powerful',
'Inefficient and Non-powerful',
'Efficient and Powerful',
'Inefficient and Powerful'
]


df = (
mtcars
.assign(
# Run the numpy select
classification = np.select(condlist=conditionList,
choicelist=resultList,
default='Not Considered'
)
)
# Convert from object to categorical
.astype({'classification' :'category'})
)



"""
Be a more purposeful programmer/analyst/data scientist:

Using the default parameter in np.select() allows you to
fill in the values with that specific text wherever your criteria
is not considered. For example, if you search this data, you will see
there are a few rows where horesepower is null.
The original criteria we built does not considering null, so
it would be populated with "Not Considered" allowing you to find those
values and correct them, or set checks for them in a pipeline.

"""

```



## R

We will create a categorical variable in two ways, first using `case_when()` from the **dplyr** package, and then using the faster `fcase()` from the **data.table** package.
Expand Down
54 changes: 54 additions & 0 deletions Machine_Learning/Nearest_Neighbor.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,60 @@ if __name__ == "__main__":
main()
```

A *very* simple way to also get a very basic KNN down in Python is leverage the knowledge of the many smart people that contribute to sci-kit learn library (sklean) as it is a powerhouse of machine learning models, as well as other very useful tools like data splitting, model evaluation, and feature selections.

```python
#Import Libraries
from seaborn import load_dataset
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load a sample dataset
iris_df = load_dataset('iris')


# Quick and rough sketch comparing the petal feature to species
sns.scatterplot(data=iris_df, x='petal_length', y='petal_width', hue='species')


# Quick and rough sketch comparing the sepals feature to species
sns.scatterplot(data=iris_df, x='sepal_length', y='sepal_width', hue='species')



# Let's seperate the data into X and Y (features and target)
X = iris_df.drop(columns='species')
Y = iris_df['species']


# Split the data into training and testing for model evaluations
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=.70, shuffle=True,
random_state=777)


# Iterate through different neighbors to find the best accuracy with N neighbors.
accuracies = {}
errors = {}
for i in range(1, 15):
clf = KNeighborsClassifier(n_neighbors=i)

clf.fit(X=X_train, y=y_train)
y_pred = clf.predict(X_test)

accu_score = accuracy_score(y_true=y_test, y_pred=y_pred)
accuracies[i] = accu_score

sns.lineplot(x=accuracies.keys(), y=accuracies.values()).set_title('Accuracies by N-Neighbors')

# Looks like about 8 is the first best accuracy, so we'll go with that.
print(f"{accuracies[8]:.1%}") #100% accuracy for 8 neighbors.

```



## R

The simplest way to perform KNN in R is with the package **class**. It has a KNN function that is rather user friendly and does not require you to do distance computing as it runs everything with Euclidean distance. For more advanced types of nearest neighbors matching it would be best to use the `matchit` function from the [**matchit** package.](https://kosukeimai.github.io/MatchIt/reference/matchit.html) To verify results this example also used the `confusionMatrix` function from the package **caret**.
Expand Down
28 changes: 28 additions & 0 deletions Other/import_a_foreign_data_file.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,34 @@ using XLSX
df = DataFrame(XLSX.readtable("filename.xlsx", "mysheet"))
```

## Python
You'll most often be relying on Pandas to read in data. Though many other forms exist, the reason you'll be pulling in data is usually to work with the data, transform, and manipulate it. Panda lends itself extremely well for this purpose. Sometime you may have to work with much more messy data with APIs where you'll navigate through hierarchies of dictionaries using the .keys() method and selecting levels, but that is handled on a case-by-case basis and impossible to cover here. However, some of the most common will be covered. Those are csv, excel (xlsx), and .RData files.

You, of course, always have the default open() function, but that can get much more complex.

```python
# Reading .RData files
import pyreadr

rds_data = pyreadr.read_r('sales_data.Rdata') #Object is a dictionary

#Sales is the name of the dataframe, if unnamed, you may have to pass "None" as the name (no quotes)
df_r = rds_data['sales']
df_r.head()


# Other common file reads, all use pandas. Most common two shown (csv/xlsx)
import pandas as pd

csv_file = pd.read_csv('filename.csv')
xlsx_file = pd.read_excel('filename.xlsx', sheet_name='Sheet1')

#Pandas can also read html, jsons, etc....

```



## R

```r?skip=true&skipReason=files_dont_exist
Expand Down
Loading