diff --git a/README.md b/README.md index fb3cfec..40437ef 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,8 @@ jupyter notebook ,numpy,pandas,matplotlib - [Series的索引与基本操作](datahandling/22-SerieIndexAndOperation/22-seriesIndexAndOperation.ipynb) - pandas - [dataframe创建、基本属性与索引切片](datahandling/23-PandasDataframeBasic/dataframeBasic.ipynb) - - [dataframe中的方法与索引技巧] + - [dataframe中的方法与索引技巧](datahandling/24-PandasDataframeMethodAndIndex/dataframeMethodAndIndex.ipynb) + - [dataframe统计运算和逻辑运算](datahandling/25-PandasDataframeStatAndLogic/dataframeStatAndLogic.ipynb) diff --git a/datahandling/24-PandasDataframeMethodAndIndex/dataframeMethodAndIndex.ipynb b/datahandling/24-PandasDataframeMethodAndIndex/dataframeMethodAndIndex.ipynb new file mode 100644 index 0000000..2a750db --- /dev/null +++ b/datahandling/24-PandasDataframeMethodAndIndex/dataframeMethodAndIndex.ipynb @@ -0,0 +1,652 @@ +{ + "cells": [ + { + "cell_type": "code", + "id": "initial_id", + "metadata": { + "collapsed": true, + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.107478Z", + "start_time": "2024-08-29T06:59:13.739479Z" + } + }, + "source": [ + "import pandas as pd \n", + "res=pd.read_csv(\"../data/score.csv\")\n", + "print(res)" + ], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 姓名 语文 数学 英语 物理 生物 化学\n", + "0 张飞 93 67.0 80 66 70 78\n", + "1 关羽 92 78.0 81 80 62 59\n", + "2 赵云 79 98.0 48 39 70 68\n", + "3 貂蝉 59 46.0 67 90 76 79\n" + ] + } + ], + "execution_count": 1 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### describe()\n", + "通过该方法能够获取各列数据的统计数据。" + ], + "id": "dd9e6aed7e050fa4" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.123811Z", + "start_time": "2024-08-29T06:59:14.109152Z" + } + }, + "cell_type": "code", + "source": "print(res.describe())", + "id": "87bd596957ed9cbf", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 语文 数学 英语 物理 生物 化学\n", + "count 4.000000 4.000000 4.000000 4.00000 4.000000 4.000000\n", + "mean 80.750000 72.250000 69.000000 68.75000 69.500000 71.000000\n", + "std 15.840349 21.700614 15.383974 22.14159 5.744563 9.416298\n", + "min 59.000000 46.000000 48.000000 39.00000 62.000000 59.000000\n", + "25% 74.000000 61.750000 62.250000 59.25000 68.000000 65.750000\n", + "50% 85.500000 72.500000 73.500000 73.00000 70.000000 73.000000\n", + "75% 92.250000 83.000000 80.250000 82.50000 71.500000 78.250000\n", + "max 93.000000 98.000000 81.000000 90.00000 76.000000 79.000000\n" + ] + } + ], + "execution_count": 2 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### Info() 方法 \n", + "\n", + "方法:info(),通过该方法我们能够快速了解 DataFrame 对象的重要属性。" + ], + "id": "31927393fa12d6d2" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.134038Z", + "start_time": "2024-08-29T06:59:14.125315Z" + } + }, + "cell_type": "code", + "source": "print(res.info())", + "id": "9247710eb6546a0", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 4 entries, 0 to 3\n", + "Data columns (total 7 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 姓名 4 non-null object \n", + " 1 语文 4 non-null int64 \n", + " 2 数学 4 non-null float64\n", + " 3 英语 4 non-null int64 \n", + " 4 物理 4 non-null int64 \n", + " 5 生物 4 non-null int64 \n", + " 6 化学 4 non-null int64 \n", + "dtypes: float64(1), int64(5), object(1)\n", + "memory usage: 352.0+ bytes\n", + "None\n" + ] + } + ], + "execution_count": 3 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": "### head() 和 tail()", + "id": "731c472f9b75813f" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.144410Z", + "start_time": "2024-08-29T06:59:14.137731Z" + } + }, + "cell_type": "code", + "source": [ + "print(res.head(2))#查看前两条数据 \n", + "print(res.tail(2))#查看后两条数据" + ], + "id": "5aabc31328fe6d16", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 姓名 语文 数学 英语 物理 生物 化学\n", + "0 张飞 93 67.0 80 66 70 78\n", + "1 关羽 92 78.0 81 80 62 59\n", + " 姓名 语文 数学 英语 物理 生物 化学\n", + "2 赵云 79 98.0 48 39 70 68\n", + "3 貂蝉 59 46.0 67 90 76 79\n" + ] + } + ], + "execution_count": 4 + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.152171Z", + "start_time": "2024-08-29T06:59:14.148075Z" + } + }, + "cell_type": "code", + "source": "res2=pd.read_csv(\"../data/score1.csv\")", + "id": "3d38ad90595dae5c", + "outputs": [], + "execution_count": 5 + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.157839Z", + "start_time": "2024-08-29T06:59:14.153255Z" + } + }, + "cell_type": "code", + "source": "print(res2)", + "id": "c97149859ac76964", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 姓名 语文 数学 英语 物理 生物 化学\n", + "0 貂蝉 59 46.0 67 90 76 79\n", + "1 刘备 98 32.0 43 35 40 70\n", + "2 曹操 59 98.0 98 100 76 32\n" + ] + } + ], + "execution_count": 6 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### 数据合并方法\n", + "Concat()\n", + "\n", + "- 方法:concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=None)\n", + "\n", + "参数介绍:axis 指定按哪个轴连接,默认为 0 即按列堆叠,axis=1 按行堆叠;join 即指定连接方式。其他参数不太常用,大家用时查找文档即可。\n", + "\n" + ], + "id": "1e6ff6c58617a292" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.163825Z", + "start_time": "2024-08-29T06:59:14.159324Z" + } + }, + "cell_type": "code", + "source": [ + "Score_all=pd.concat([res,res2],ignore_index=True,axis=0) \n", + "print(Score_all)" + ], + "id": "6a61052c4b230bc9", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 姓名 语文 数学 英语 物理 生物 化学\n", + "0 张飞 93 67.0 80 66 70 78\n", + "1 关羽 92 78.0 81 80 62 59\n", + "2 赵云 79 98.0 48 39 70 68\n", + "3 貂蝉 59 46.0 67 90 76 79\n", + "4 貂蝉 59 46.0 67 90 76 79\n", + "5 刘备 98 32.0 43 35 40 70\n", + "6 曹操 59 98.0 98 100 76 32\n" + ] + } + ], + "execution_count": 7 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": "### 数据清洗方法", + "id": "5a8d566a6bf5cb31" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": "#### 去重函数:drop_duplicates()", + "id": "6013058cd0c046bb" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "Drop_duplicates 方法用于去除 DataFrame 对象中的重复值。重复数据的存在会增加数据处理的时间和成本、增大数据占用存储空间、降低模型的准确性,使得用此数据集训练出来的模型对于某方面预测得很好而其他方面效果很差,降低了模型的泛化能力,所以我们需要对数据进行去重。\n", + "\n", + "方法:drop_duplicates(subset, keep, inplace)\n", + "\n", + "\n", + "参数:subset 即按照那几列去重;keep 有三个参数,first 参数保留第一个重复值,last 保留最后一个重复值,False 删除所有重复值;inplace 是否修改原数据。通常来说,keep 默认为 first,inplace 默认为 False。" + ], + "id": "2a7d6049dc2aefc1" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.173627Z", + "start_time": "2024-08-29T06:59:14.165048Z" + } + }, + "cell_type": "code", + "source": [ + "Score_all.drop_duplicates(inplace=False) \n", + "print(res)\n", + "print(res2)" + ], + "id": "2f35dcee2aabb15b", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 姓名 语文 数学 英语 物理 生物 化学\n", + "0 张飞 93 67.0 80 66 70 78\n", + "1 关羽 92 78.0 81 80 62 59\n", + "2 赵云 79 98.0 48 39 70 68\n", + "3 貂蝉 59 46.0 67 90 76 79\n", + " 姓名 语文 数学 英语 物理 生物 化学\n", + "0 貂蝉 59 46.0 67 90 76 79\n", + "1 刘备 98 32.0 43 35 40 70\n", + "2 曹操 59 98.0 98 100 76 32\n" + ] + } + ], + "execution_count": 8 + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.181919Z", + "start_time": "2024-08-29T06:59:14.175885Z" + } + }, + "cell_type": "code", + "source": "print(Score_all)", + "id": "668042c91265cb2", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 姓名 语文 数学 英语 物理 生物 化学\n", + "0 张飞 93 67.0 80 66 70 78\n", + "1 关羽 92 78.0 81 80 62 59\n", + "2 赵云 79 98.0 48 39 70 68\n", + "3 貂蝉 59 46.0 67 90 76 79\n", + "4 貂蝉 59 46.0 67 90 76 79\n", + "5 刘备 98 32.0 43 35 40 70\n", + "6 曹操 59 98.0 98 100 76 32\n" + ] + } + ], + "execution_count": 9 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": "### 查看数据是否存在空值:Isna()", + "id": "4347c6a685e3b89" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.196444Z", + "start_time": "2024-08-29T06:59:14.186355Z" + } + }, + "cell_type": "code", + "source": "print(Score_all.isna())\n", + "id": "ef3f17fab7d59c34", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 姓名 语文 数学 英语 物理 生物 化学\n", + "0 False False False False False False False\n", + "1 False False False False False False False\n", + "2 False False False False False False False\n", + "3 False False False False False False False\n", + "4 False False False False False False False\n", + "5 False False False False False False False\n", + "6 False False False False False False False\n" + ] + } + ], + "execution_count": 10 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": "### any,该方法会判断各列数据中是否存在 True", + "id": "164beac9edbf335e" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.204111Z", + "start_time": "2024-08-29T06:59:14.198389Z" + } + }, + "cell_type": "code", + "source": [ + "Score_bool=Score_all.isna() \n", + "print(Score_bool.any())" + ], + "id": "4ad23a7700b2e677", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "姓名 False\n", + "语文 False\n", + "数学 False\n", + "英语 False\n", + "物理 False\n", + "生物 False\n", + "化学 False\n", + "dtype: bool\n" + ] + } + ], + "execution_count": 11 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### 处理缺失值\n", + "使用 dropna() 函数直接删除或者使用 fillna() 进行缺失值填充。" + ], + "id": "aec16abadad8e40d" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "#### 删除存在缺失值的数据:dropna()\n", + "方法:dropna(axis=,how=,inplace=)\n", + "\n", + "参数:axis 指出按行还是按列删除,axis=0 为按行删除,axis=1 为按列。how 为 any 即只要有一个空缺值就删除,为 all 即全部都是空缺值才删除。Inplace 即是否在原来的数据对象上进行修改。" + ], + "id": "eec3c54c53fe9c75" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.219995Z", + "start_time": "2024-08-29T06:59:14.208802Z" + } + }, + "cell_type": "code", + "source": "print(Score_all.dropna(axis=0,how='any'))\n", + "id": "dc9295349b3d0cf1", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 姓名 语文 数学 英语 物理 生物 化学\n", + "0 张飞 93 67.0 80 66 70 78\n", + "1 关羽 92 78.0 81 80 62 59\n", + "2 赵云 79 98.0 48 39 70 68\n", + "3 貂蝉 59 46.0 67 90 76 79\n", + "4 貂蝉 59 46.0 67 90 76 79\n", + "5 刘备 98 32.0 43 35 40 70\n", + "6 曹操 59 98.0 98 100 76 32\n" + ] + } + ], + "execution_count": 12 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "#### 空缺值填充:fillna()\n", + "方法:fillna(value=None, method=None, axis=None, inplace=False, limit=None)\n", + "\n", + "- value:要用来填充缺失值的值或字典。\n", + "- method:用来填充缺失值的方法,可以是 ffill(向前填充)或 bfill(向后填充),默认为 None。\n", + "- axis:对缺失值填充的轴,可以是 0(默认,按列填充)或 1(按行填充)。\n", + "- inplace:是否在原数据帧中进行填充,默认为 False。\n", + "- limit:限制向前或向后填充的最大数量。\n" + ], + "id": "3c5f4291a64a7bfd" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.228664Z", + "start_time": "2024-08-29T06:59:14.222300Z" + } + }, + "cell_type": "code", + "source": [ + "Score_all.fillna(axis=0,value={'数学':80},inplace=True) \n", + "print(Score_all)" + ], + "id": "616e99acdee2c1e2", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 姓名 语文 数学 英语 物理 生物 化学\n", + "0 张飞 93 67.0 80 66 70 78\n", + "1 关羽 92 78.0 81 80 62 59\n", + "2 赵云 79 98.0 48 39 70 68\n", + "3 貂蝉 59 46.0 67 90 76 79\n", + "4 貂蝉 59 46.0 67 90 76 79\n", + "5 刘备 98 32.0 43 35 40 70\n", + "6 曹操 59 98.0 98 100 76 32\n" + ] + } + ], + "execution_count": 13 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### 按值排序方法\n", + "方法:sort_values(by,ascending,axis),按值排序,通过指出以哪列数据为标准进行按值排序。\n", + "\n", + "- by:指出按照哪列数据进行排序。\n", + "- ascending:由小到大还是由大到小,默认由小到大往下排。\n", + "- axis:0 为横行,1 为竖列,通常默认为 0。\n", + "- inplace:修改原 DataFrame 还是 return a new DataFrame。" + ], + "id": "283af6f8fcc2c892" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.237354Z", + "start_time": "2024-08-29T06:59:14.232894Z" + } + }, + "cell_type": "code", + "source": [ + "Score_all.sort_values(by='语文',inplace=True,ascending=False,ignore_index=True) \n", + "print(Score_all)" + ], + "id": "67850f57bccf797a", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 姓名 语文 数学 英语 物理 生物 化学\n", + "0 刘备 98 32.0 43 35 40 70\n", + "1 张飞 93 67.0 80 66 70 78\n", + "2 关羽 92 78.0 81 80 62 59\n", + "3 赵云 79 98.0 48 39 70 68\n", + "4 貂蝉 59 46.0 67 90 76 79\n", + "5 貂蝉 59 46.0 67 90 76 79\n", + "6 曹操 59 98.0 98 100 76 32\n" + ] + } + ], + "execution_count": 14 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "## 设置索引方法\n", + "\n", + "方法:set_index(keys, drop=True)\n", + "\n", + "参数:keys,列索引名称或者列索引名称的列表;drop,默认为 True,当设置为新的索引,删除原来的列。\n", + "\n" + ], + "id": "c42f3e64d867aeee" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.244858Z", + "start_time": "2024-08-29T06:59:14.238726Z" + } + }, + "cell_type": "code", + "source": [ + "Score_all.set_index('姓名',drop=True,inplace=True) \n", + "print(Score_all)" + ], + "id": "448d2be6a755f5d2", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 语文 数学 英语 物理 生物 化学\n", + "姓名 \n", + "刘备 98 32.0 43 35 40 70\n", + "张飞 93 67.0 80 66 70 78\n", + "关羽 92 78.0 81 80 62 59\n", + "赵云 79 98.0 48 39 70 68\n", + "貂蝉 59 46.0 67 90 76 79\n", + "貂蝉 59 46.0 67 90 76 79\n", + "曹操 59 98.0 98 100 76 32\n" + ] + } + ], + "execution_count": 15 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "## 删除元素方法\n", + "方法:drop(lable,axis)\n", + "\n", + "参数:lable 即要删除的行或者列;axis 指定删除的方向,axis=0,表示沿行方向删除;axis=1,表示沿列方向删除。\n", + "\n" + ], + "id": "69f2b3405b2ab058" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.254858Z", + "start_time": "2024-08-29T06:59:14.246340Z" + } + }, + "cell_type": "code", + "source": [ + "Score_all.drop('曹操',axis=0,inplace=True) \n", + "print(Score_all)" + ], + "id": "d997b17b70325211", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 语文 数学 英语 物理 生物 化学\n", + "姓名 \n", + "刘备 98 32.0 43 35 40 70\n", + "张飞 93 67.0 80 66 70 78\n", + "关羽 92 78.0 81 80 62 59\n", + "赵云 79 98.0 48 39 70 68\n", + "貂蝉 59 46.0 67 90 76 79\n", + "貂蝉 59 46.0 67 90 76 79\n" + ] + } + ], + "execution_count": 16 + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:14.257264Z", + "start_time": "2024-08-29T06:59:14.255732Z" + } + }, + "cell_type": "code", + "source": "", + "id": "b507e34d063a32be", + "outputs": [], + "execution_count": 16 + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/datahandling/25-PandasDataframeStatAndLogic/dataframeStatAndLogic.ipynb b/datahandling/25-PandasDataframeStatAndLogic/dataframeStatAndLogic.ipynb new file mode 100644 index 0000000..3cae58d --- /dev/null +++ b/datahandling/25-PandasDataframeStatAndLogic/dataframeStatAndLogic.ipynb @@ -0,0 +1,575 @@ +{ + "cells": [ + { + "metadata": {}, + "cell_type": "markdown", + "source": "# DataFrame 的统计运算和 DataFrame 的逻辑运算", + "id": "4e4c4883eb5ff523" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": "## 统计运算", + "id": "3efc9e6e27ccdf4e" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### mean()——求均值\n", + "\n", + "方法:mean(axis),axis 表示轴向,axis=0 表示按列进行统计,axis=1 表示按行进行统计。" + ], + "id": "dab0241d16c5c4cc" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:27.496350Z", + "start_time": "2024-08-29T06:59:27.136947Z" + } + }, + "cell_type": "code", + "source": [ + "import pandas as pd\n", + "Score=pd.read_csv(\"../data/score.csv\")\n", + "Score.set_index(['姓名'],inplace=True)\n", + "print(Score.mean(axis=0))" + ], + "id": "6c5e258f48aa45fd", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "语文 80.75\n", + "数学 72.25\n", + "英语 69.00\n", + "物理 68.75\n", + "生物 69.50\n", + "化学 71.00\n", + "dtype: float64\n" + ] + } + ], + "execution_count": 1 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### Max()——求最大值\n", + "\n", + "方法:max(axis),axis 表示轴向,axis=0 表示按列进行统计,axis=1 表示按行进行统计。\n", + "\n" + ], + "id": "3e88749b743dc492" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:27.502413Z", + "start_time": "2024-08-29T06:59:27.498287Z" + } + }, + "cell_type": "code", + "source": "print(Score.max(axis=0))\n", + "id": "f26b026ce552488e", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "语文 93.0\n", + "数学 98.0\n", + "英语 81.0\n", + "物理 90.0\n", + "生物 76.0\n", + "化学 79.0\n", + "dtype: float64\n" + ] + } + ], + "execution_count": 2 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": "### Min()——查看最小值\n", + "id": "fd03736ebcb159cf" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:27.507292Z", + "start_time": "2024-08-29T06:59:27.503740Z" + } + }, + "cell_type": "code", + "source": "print(Score.min(axis=0))\n", + "id": "1ad797f5db73f4ef", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "语文 59.0\n", + "数学 46.0\n", + "英语 48.0\n", + "物理 39.0\n", + "生物 62.0\n", + "化学 59.0\n", + "dtype: float64\n" + ] + } + ], + "execution_count": 3 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### Var()——求方差、Std()——求标准差\n", + "\n" + ], + "id": "944409becf1b785c" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:27.512798Z", + "start_time": "2024-08-29T06:59:27.508549Z" + } + }, + "cell_type": "code", + "source": [ + "print(Score.var(axis=0)) \n", + "print(Score.std(axis=0))" + ], + "id": "315fb3c080d6cec7", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "语文 250.916667\n", + "数学 470.916667\n", + "英语 236.666667\n", + "物理 490.250000\n", + "生物 33.000000\n", + "化学 88.666667\n", + "dtype: float64\n", + "语文 15.840349\n", + "数学 21.700614\n", + "英语 15.383974\n", + "物理 22.141590\n", + "生物 5.744563\n", + "化学 9.416298\n", + "dtype: float64\n" + ] + } + ], + "execution_count": 4 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### Cov 方法——求两组数据之间的协方差\n", + "\n", + "协方差是用来描述两个变量之间的线性相关性的统计量,它度量了两个变量同时变化的程度。如果协方差越大,说明两个变量同时变化的程度越大,反之则说明两个变量同时变化的程度越小。\n", + "\n" + ], + "id": "e0c14f75c51d59b2" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:27.518664Z", + "start_time": "2024-08-29T06:59:27.515041Z" + } + }, + "cell_type": "code", + "source": "print(Score['数学'].cov(Score['物理']))\n", + "id": "c64ca702e9b8a481", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "-414.91666666666663\n" + ] + } + ], + "execution_count": 5 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": "### nunique()——统计有多少不同的值\n", + "id": "299c6db0b114108d" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:27.522621Z", + "start_time": "2024-08-29T06:59:27.519995Z" + } + }, + "cell_type": "code", + "source": "print(Score['数学'].nunique())\n", + "id": "3f596af192a66ed7", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "4\n" + ] + } + ], + "execution_count": 6 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### value_counts()——对每种值进行计数\n", + "\n", + "value_counts(values,sort=True,ascending=False,normalize=False,bins=None,dropna=True)\n", + "\n", + "- sort=True:是否要进行排序,默认进行排序。\n", + "- ascending=False:默认降序排列。\n", + "- normalize=False:是否要对计算结果进行标准化并显示标准化后的结果,默认是 False。\n", + "- bins=None:可以自定义分组区间,默认是否。\n", + "- dropna=True:是否删除缺失值 nan,默认删除。\n", + "\n" + ], + "id": "33b69be092a7072c" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:27.527410Z", + "start_time": "2024-08-29T06:59:27.523684Z" + } + }, + "cell_type": "code", + "source": "print(Score['数学'].value_counts())\n", + "id": "a2224dd7fb9a085c", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "数学\n", + "67.0 1\n", + "78.0 1\n", + "98.0 1\n", + "46.0 1\n", + "Name: count, dtype: int64\n" + ] + } + ], + "execution_count": 7 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": "### describe()——整体统计描述\n", + "id": "945b8c69bef53dde" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:27.540883Z", + "start_time": "2024-08-29T06:59:27.528532Z" + } + }, + "cell_type": "code", + "source": "print(Score.describe())\n", + "id": "c67237ac425a4a4f", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 语文 数学 英语 物理 生物 化学\n", + "count 4.000000 4.000000 4.000000 4.00000 4.000000 4.000000\n", + "mean 80.750000 72.250000 69.000000 68.75000 69.500000 71.000000\n", + "std 15.840349 21.700614 15.383974 22.14159 5.744563 9.416298\n", + "min 59.000000 46.000000 48.000000 39.00000 62.000000 59.000000\n", + "25% 74.000000 61.750000 62.250000 59.25000 68.000000 65.750000\n", + "50% 85.500000 72.500000 73.500000 73.00000 70.000000 73.000000\n", + "75% 92.250000 83.000000 80.250000 82.50000 71.500000 78.250000\n", + "max 93.000000 98.000000 81.000000 90.00000 76.000000 79.000000\n" + ] + } + ], + "execution_count": 8 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "```text\n", + "df.count() #非空元素计算\n", + "df.min() #最小值\n", + "df.max() #最大值\n", + "df.idxmin() #最小值的位置\n", + "df.idxmax() #最大值的位置\n", + "df.sum() #求和\n", + "df.mean() #均值\n", + "df.median() #中位数\n", + "df.mode() #众数\n", + "df.var() #方差\n", + "df.std() #标准差\n", + "df.mad() #平均绝对偏差\n", + "df.describe() #一次性输出多个描述性统计指标\n", + "df.abs() #求绝对值\n", + "df.prod #元素乘积\n", + "df.cumsum #累计和\n", + "```\n" + ], + "id": "670871e41d00bd57" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "## DataFrame 的逻辑运算\n", + "\n", + "DataFrame 支持的逻辑运算符有:<、>、==、!==、<=、>=、|、&、~。" + ], + "id": "40df9cee66b792e" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "```text\n", + "逻辑运算符\t含义与作用\n", + "< \t小于\n", + "> \t大于\n", + "==\t判断是否相等\n", + "!==\t判断是否不相等\n", + "<=\t小于等于\n", + ">=\t大于等于\n", + "|\t或运算\n", + "&\t与运算\n", + "~\t非运算\n", + "```" + ], + "id": "9775802e721d9219" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:27.548279Z", + "start_time": "2024-08-29T06:59:27.542176Z" + } + }, + "cell_type": "code", + "source": [ + "print((Score[\"数学\"]<=50) & (Score[\"物理\"]<=50))\n", + "Score_shuwu = Score[(Score[\"数学\"]<=50) & (Score[\"物理\"]<=50)]\n", + "print(Score_shuwu)" + ], + "id": "13f1e3a3d2865d72", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "姓名\n", + "张飞 False\n", + "关羽 False\n", + "赵云 False\n", + "貂蝉 False\n", + "dtype: bool\n", + "Empty DataFrame\n", + "Columns: [语文, 数学, 英语, 物理, 生物, 化学]\n", + "Index: []\n" + ] + } + ], + "execution_count": 9 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### 逻辑运算函数\n", + "\n", + "DataFrame 主要提供了三个逻辑函数,分别为 query()、isin() 和 between() 这三个方法。" + ], + "id": "d22b01d20c0ca2af" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:27.555601Z", + "start_time": "2024-08-29T06:59:27.549949Z" + } + }, + "cell_type": "code", + "source": [ + "# 要找出数学低于 50 分并且物理低于 50 分的同学\n", + "Score_shuwu = Score.query(\"数学<=50 & 物理<=50\")\n", + "print(Score_shuwu)" + ], + "id": "ca7af1770995c171", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Empty DataFrame\n", + "Columns: [语文, 数学, 英语, 物理, 生物, 化学]\n", + "Index: []\n" + ] + } + ], + "execution_count": 10 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": "### isin() 函数", + "id": "b9f84275f4908906" + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": "isin() 函数,可以帮助我们判断 DataFrame 中是否含有某个值或某些值。", + "id": "65e37f71d457131b" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T06:59:27.563288Z", + "start_time": "2024-08-29T06:59:27.557183Z" + } + }, + "cell_type": "code", + "source": [ + "Score_100 = Score.isin([100])\n", + "print(Score_100)" + ], + "id": "3f5a65c73094cd87", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 语文 数学 英语 物理 生物 化学\n", + "姓名 \n", + "张飞 False False False False False False\n", + "关羽 False False False False False False\n", + "赵云 False False False False False False\n", + "貂蝉 False False False False False False\n" + ] + } + ], + "execution_count": 11 + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T07:00:55.066491Z", + "start_time": "2024-08-29T07:00:55.060796Z" + } + }, + "cell_type": "code", + "source": [ + "Score_math = Score[\"数学\"].isin([98,80])\n", + "print(Score_math)\n", + "print(Score[Score_math])" + ], + "id": "76fcdcb5798b9084", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "姓名\n", + "张飞 False\n", + "关羽 False\n", + "赵云 True\n", + "貂蝉 False\n", + "Name: 数学, dtype: bool\n", + " 语文 数学 英语 物理 生物 化学\n", + "姓名 \n", + "赵云 79 98.0 48 39 70 68\n" + ] + } + ], + "execution_count": 17 + }, + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "### between() 函数\n", + "\n", + "between: 左闭右闭区间。" + ], + "id": "4c1c9f1c67e6bbd9" + }, + { + "metadata": { + "ExecuteTime": { + "end_time": "2024-08-29T07:02:31.026325Z", + "start_time": "2024-08-29T07:02:31.021805Z" + } + }, + "cell_type": "code", + "source": [ + "Score_shuwu = Score[Score[\"数学\"].between(0,50)&Score[\"物理\"].between(0,50)]\n", + "print(Score_shuwu)" + ], + "id": "4cf6a8edac472f7a", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Empty DataFrame\n", + "Columns: [语文, 数学, 英语, 物理, 生物, 化学]\n", + "Index: []\n" + ] + } + ], + "execution_count": 20 + }, + { + "metadata": {}, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": "", + "id": "f63335aa71b6dd01" + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/datahandling/data/score.csv b/datahandling/data/score.csv new file mode 100644 index 0000000..efea8f9 --- /dev/null +++ b/datahandling/data/score.csv @@ -0,0 +1,6 @@ +姓名,语文,数学,英语,物理,生物,化学 +张飞,93,67.0,80,66,70,78 +关羽,92,78.0,81,80,62,59 +赵云,79,98,48,39,70,68 +貂蝉,59,46.0,67,90,76,79 + diff --git a/datahandling/data/score1.csv b/datahandling/data/score1.csv new file mode 100644 index 0000000..c1f5bae --- /dev/null +++ b/datahandling/data/score1.csv @@ -0,0 +1,4 @@ +姓名,语文,数学,英语,物理,生物,化学 +貂蝉,59,46.0,67,90,76,79 +刘备,98,32,43,35,40,70 +曹操,59,98.0,98,100,76,32 \ No newline at end of file diff --git "a/machinelearning/04\351\200\273\350\276\221\345\233\236\345\275\222.md" "b/machinelearning/04\351\200\273\350\276\221\345\233\236\345\275\222.md" index 86fbccc..f46e83c 100644 --- "a/machinelearning/04\351\200\273\350\276\221\345\233\236\345\275\222.md" +++ "b/machinelearning/04\351\200\273\350\276\221\345\233\236\345\275\222.md" @@ -2,7 +2,7 @@ 逻辑回归是解决二分类问题的利器 -应用场景如: +### 应用场景 - 疾病是否是阳性 - 银行卡房贷款是否房贷 - 预测广告点击率(是否点击) @@ -11,7 +11,9 @@ -逻辑回归的原理 +### 逻辑回归的原理 + + - 逻辑回归中,其输入值是什么 - 逻辑回归的输入就是一个线性方程 - h(w) = w1x1 + w2x2 + .... + b @@ -32,3 +34,7 @@ sigmod函数可导,是单调递增函数。 导函数公式: f'(x) = f(x) (1-f(x)) + +逻辑回归最终的分类是通过属于某个类别的概率值来判断是否属于某个类别,并且这个类别默认标记为1(正例),另外的一个类别会标记为0(反例)。(方便损失计算) + +我们用均方误差来衡量线性回归的损失,在逻辑回归中,当预测结果不对的时候,我们该怎么衡量其损失呢?