Skip to content

Commit

Permalink
Deploying to master from @ c2ffb9b 🚀
Browse files Browse the repository at this point in the history
  • Loading branch information
NickCH-K committed Aug 19, 2024
1 parent 242eba4 commit f0f7f05
Show file tree
Hide file tree
Showing 6 changed files with 438 additions and 282 deletions.
24 changes: 24 additions & 0 deletions Data_Manipulation/creating_a_variable_with_group_calculations.html
Original file line number Diff line number Diff line change
Expand Up @@ -329,6 +329,30 @@ <h2 id="python">
</span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">storms</span><span class="p">[</span><span class="s">'mean_wind'</span><span class="p">]</span> <span class="o">=</span> <span class="n">storms</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'name'</span><span class="p">,</span><span class="s">'year'</span><span class="p">,</span><span class="s">'month'</span><span class="p">,</span><span class="s">'day'</span><span class="p">])[</span><span class="s">'wind'</span><span class="p">].</span><span class="n">transform</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">)</span>

</code></pre></div></div>
<p>Though the above may be a great way to do it, it certainly seems complex. There is a much easier way to achieve similar results that is easier on the eyes (and brain!). This is using panda’s aggregate() method with tuple assignments. This results in the most easy-to-understand way, by using the aggregate method after grouping since this would allow us to follow a very simple format of <code class="language-plaintext highlighter-rouge">new_column_name = ('old_column', 'agg_funct')</code>. So, for example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>

<span class="c1"># Pull in data on storms
</span><span class="n">storms</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv'</span><span class="p">)</span>

<span class="c1"># Use groupby and group the columns and perform group calculations
</span>
<span class="c1"># The below calculations aren't particularly indicative of a good analysis,
# but give a quick look at a few of the calculations you can do
</span><span class="n">df</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">storms</span>
<span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="p">[</span><span class="s">'name'</span><span class="p">,</span> <span class="s">'year'</span><span class="p">,</span> <span class="s">'month'</span><span class="p">,</span> <span class="s">'day'</span><span class="p">])</span> <span class="c1">#group
</span> <span class="p">.</span><span class="n">aggregate</span><span class="p">(</span>
<span class="n">avg_wind</span> <span class="o">=</span> <span class="p">(</span><span class="s">'wind'</span><span class="p">,</span> <span class="s">'mean'</span><span class="p">),</span>
<span class="n">max_wind</span> <span class="o">=</span> <span class="p">(</span><span class="s">'wind'</span><span class="p">,</span> <span class="s">'max'</span><span class="p">),</span>
<span class="n">med_wind</span> <span class="o">=</span> <span class="p">(</span><span class="s">'wind'</span><span class="p">,</span> <span class="s">'median'</span><span class="p">),</span>
<span class="n">std_pressure</span> <span class="o">=</span> <span class="p">(</span><span class="s">'pressure'</span><span class="p">,</span> <span class="s">'std'</span><span class="p">),</span>
<span class="n">first_year</span> <span class="o">=</span> <span class="p">(</span><span class="s">'year'</span><span class="p">,</span> <span class="s">'first'</span><span class="p">)</span>
<span class="p">)</span>
<span class="p">.</span><span class="n">reset_index</span><span class="p">()</span> <span class="c1"># Somewhat similar to ungroup. Removes the grouping from the index
</span><span class="p">)</span>

</code></pre></div></div>
<h2 id="r">
<a href="#r" aria-labelledby="r" class="anchor-heading"><svg viewBox="0 0 16 16" aria-hidden="true"><use xlink:href="#svg-link"></use></svg></a> R
Expand Down
53 changes: 53 additions & 0 deletions Data_Manipulation/creating_categorical_variables.html
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,59 @@ <h2 id="python">
<span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>There’s quite a bit to unpack here! <code class="language-plaintext highlighter-rouge">.apply(lambda x: ..., axis=1)</code> applies a lambda function rowwise to the entire dataframe, with individual columns accessed by, for example, <code class="language-plaintext highlighter-rouge">x['mpg']</code>. (You can apply functions on an index using <code class="language-plaintext highlighter-rouge">axis=0</code>.) The <code class="language-plaintext highlighter-rouge">next</code> keyword returns the next entry in a list that evaluates to true or exists (so in this case it will just return the first entry that exists). Finally, <code class="language-plaintext highlighter-rouge">key for key, value in conds_dict.items() if value(x)</code> iterates over the pairs in the dictionary and returns only the condition names (the ‘keys’ in the dictionary) for conditions (the ‘values’ in the dictionary) that evaluate to true.</p>
<p>Once again, just like R, Python has <em>many</em> ways of doing the same thing. Some with more complex, but efficient (runtime) manners, while others being slightly slower but many times easier to understand and follow-along with it’s closeness of natural-language syntax. So, for this example, we will use numpy and pandas together, to achieve both an efficient runtime and a relatively simple syntax.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">seaborn</span> <span class="kn">import</span> <span class="n">load_dataset</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="n">mtcars</span> <span class="o">=</span> <span class="n">load_dataset</span><span class="p">(</span><span class="s">'mpg'</span><span class="p">)</span>

<span class="c1"># Create our list of index selections
</span><span class="n">conditionList</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">(</span><span class="n">mtcars</span><span class="p">[</span><span class="s">'mpg'</span><span class="p">]</span> <span class="o">&lt;=</span> <span class="mi">19</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">mtcars</span><span class="p">[</span><span class="s">'horsepower'</span><span class="p">]</span> <span class="o">&lt;=</span> <span class="mi">123</span><span class="p">),</span>
<span class="p">(</span><span class="n">mtcars</span><span class="p">[</span><span class="s">'mpg'</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">19</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">mtcars</span><span class="p">[</span><span class="s">'horsepower'</span><span class="p">]</span> <span class="o">&lt;=</span> <span class="mi">123</span><span class="p">),</span>
<span class="p">(</span><span class="n">mtcars</span><span class="p">[</span><span class="s">'mpg'</span><span class="p">]</span> <span class="o">&lt;=</span> <span class="mi">19</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">mtcars</span><span class="p">[</span><span class="s">'horsepower'</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">123</span><span class="p">),</span>
<span class="p">(</span><span class="n">mtcars</span><span class="p">[</span><span class="s">'mpg'</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">19</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">mtcars</span><span class="p">[</span><span class="s">'horsepower'</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">123</span><span class="p">)</span>
<span class="p">]</span>

<span class="c1"># Create the results we will pair with the above index selections
</span><span class="n">resultList</span> <span class="o">=</span> <span class="p">[</span>
<span class="s">'Efficient and Non-powerful'</span><span class="p">,</span>
<span class="s">'Inefficient and Non-powerful'</span><span class="p">,</span>
<span class="s">'Efficient and Powerful'</span><span class="p">,</span>
<span class="s">'Inefficient and Powerful'</span>
<span class="p">]</span>


<span class="n">df</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">mtcars</span>
<span class="p">.</span><span class="n">assign</span><span class="p">(</span>
<span class="c1"># Run the numpy select
</span> <span class="n">classification</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="n">condlist</span><span class="o">=</span><span class="n">conditionList</span><span class="p">,</span>
<span class="n">choicelist</span><span class="o">=</span><span class="n">resultList</span><span class="p">,</span>
<span class="n">default</span><span class="o">=</span><span class="s">'Not Considered'</span>
<span class="p">)</span>
<span class="p">)</span>
<span class="c1"># Convert from object to categorical
</span> <span class="p">.</span><span class="n">astype</span><span class="p">({</span><span class="s">'classification'</span> <span class="p">:</span><span class="s">'category'</span><span class="p">})</span>
<span class="p">)</span>



<span class="s">"""
Be a more purposeful programmer/analyst/data scientist:

Using the default parameter in np.select() allows you to
fill in the values with that specific text wherever your criteria
is not considered. For example, if you search this data, you will see
there are a few rows where horesepower is null.
The original criteria we built does not considering null, so
it would be populated with "Not Considered" allowing you to find those
values and correct them, or set checks for them in a pipeline.

"""</span>

</code></pre></div></div>
<h2 id="r">
<a href="#r" aria-labelledby="r" class="anchor-heading"><svg viewBox="0 0 16 16" aria-hidden="true"><use xlink:href="#svg-link"></use></svg></a> R
</h2>
Expand Down
49 changes: 49 additions & 0 deletions Machine_Learning/Nearest_Neighbor.html
Original file line number Diff line number Diff line change
Expand Up @@ -394,6 +394,55 @@ <h2 id="python">

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="n">main</span><span class="p">()</span>
</code></pre></div></div>
<p>A <em>very</em> simple way to also get a very basic KNN down in Python is leverage the knowledge of the many smart people that contribute to sci-kit learn library (sklean) as it is a powerhouse of machine learning models, as well as other very useful tools like data splitting, model evaluation, and feature selections.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#Import Libraries
</span><span class="kn">from</span> <span class="nn">seaborn</span> <span class="kn">import</span> <span class="n">load_dataset</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="kn">from</span> <span class="nn">sklearn.neighbors</span> <span class="kn">import</span> <span class="n">KNeighborsClassifier</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">accuracy_score</span>

<span class="c1"># Load a sample dataset
</span><span class="n">iris_df</span> <span class="o">=</span> <span class="n">load_dataset</span><span class="p">(</span><span class="s">'iris'</span><span class="p">)</span>


<span class="c1"># Quick and rough sketch comparing the petal feature to species
</span><span class="n">sns</span><span class="p">.</span><span class="n">scatterplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">iris_df</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s">'petal_length'</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s">'petal_width'</span><span class="p">,</span> <span class="n">hue</span><span class="o">=</span><span class="s">'species'</span><span class="p">)</span>


<span class="c1"># Quick and rough sketch comparing the sepals feature to species
</span><span class="n">sns</span><span class="p">.</span><span class="n">scatterplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">iris_df</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s">'sepal_length'</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s">'sepal_width'</span><span class="p">,</span> <span class="n">hue</span><span class="o">=</span><span class="s">'species'</span><span class="p">)</span>



<span class="c1"># Let's seperate the data into X and Y (features and target)
</span><span class="n">X</span> <span class="o">=</span> <span class="n">iris_df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="s">'species'</span><span class="p">)</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">iris_df</span><span class="p">[</span><span class="s">'species'</span><span class="p">]</span>


<span class="c1"># Split the data into training and testing for model evaluations
</span><span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">,</span> <span class="n">train_size</span><span class="o">=</span><span class="p">.</span><span class="mi">70</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="n">random_state</span><span class="o">=</span><span class="mi">777</span><span class="p">)</span>


<span class="c1"># Iterate through different neighbors to find the best accuracy with N neighbors.
</span><span class="n">accuracies</span> <span class="o">=</span> <span class="p">{}</span>
<span class="n">errors</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">15</span><span class="p">):</span>
<span class="n">clf</span> <span class="o">=</span> <span class="n">KNeighborsClassifier</span><span class="p">(</span><span class="n">n_neighbors</span><span class="o">=</span><span class="n">i</span><span class="p">)</span>

<span class="n">clf</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="o">=</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">y_train</span><span class="p">)</span>
<span class="n">y_pred</span> <span class="o">=</span> <span class="n">clf</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>

<span class="n">accu_score</span> <span class="o">=</span> <span class="n">accuracy_score</span><span class="p">(</span><span class="n">y_true</span><span class="o">=</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_pred</span><span class="o">=</span><span class="n">y_pred</span><span class="p">)</span>
<span class="n">accuracies</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">accu_score</span>

<span class="n">sns</span><span class="p">.</span><span class="n">lineplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">accuracies</span><span class="p">.</span><span class="n">keys</span><span class="p">(),</span> <span class="n">y</span><span class="o">=</span><span class="n">accuracies</span><span class="p">.</span><span class="n">values</span><span class="p">()).</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Accuracies by N-Neighbors'</span><span class="p">)</span>

<span class="c1"># Looks like about 8 is the first best accuracy, so we'll go with that.
</span><span class="k">print</span><span class="p">(</span><span class="s">f"</span><span class="si">{</span><span class="n">accuracies</span><span class="p">[</span><span class="mi">8</span><span class="p">]:.</span><span class="mi">1</span><span class="o">%</span><span class="si">}</span><span class="s">"</span><span class="p">)</span> <span class="c1">#100% accuracy for 8 neighbors.
</span>
</code></pre></div></div>
<h2 id="r">
<a href="#r" aria-labelledby="r" class="anchor-heading"><svg viewBox="0 0 16 16" aria-hidden="true"><use xlink:href="#svg-link"></use></svg></a> R
Expand Down
Loading

0 comments on commit f0f7f05

Please sign in to comment.