Skip to content

xiao7462/python-for-data-analyse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

python 学习过程中的一些问题记录

目录
-环境搭建
-python_learning_note
-numpy
-pandas
-matplotlib-seaborn
-爬虫

环境搭建




python_learning_note

  1. 字符串格式化

字符串正则化

函数

  1. 调用函数时加与不加括号的区别
  2. 函数的默认参数
  3. enumberate() -- 作用于一个可遍历的对象,同时返回key 和values
>>> list(enumerate(seasons, start=1))       # 下标从 1 开始
[(1, 'Spring'), (2, 'Summer'), (3, 'Fall'), (4, 'Winter')]

>>>seq = ['one', 'two', 'three']
>>> for i, element in enumerate(seq):
...     print i, element
... 
0 one
1 two
2 three

面向对象

  1. 面向对象简介
  2. python中的下划线和双下划线
  3. self
  4. _init_
  5. public and private



numpy

  • 显示全部array 输入np.set_printoptions(threshold=np.inf)

  • 数据加载

  1. npz file 加载 data = np.load(file.npz) # 有时直接的load 网页数据无法下载,可以通过其他方式下载到本地再加载

这是data是有一个npz class ,不能直接的去看里面的内容

# 查看data里面的数据
>>> npx.files
>>> ['y','x']
>>> npz.f.x   or npz['x']
>>> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])  #得到里面的array
  • 矩阵索引,切片

  • random模块

  1. permutation
  2. seed
  3. uniform
  4. randint 基本用法 :
    np.random.randint(1,5,(3,3))   (最低值,最高值,元组(output shape))
    array([[6, 1, 6],
       [4, 5, 7],
       [4, 4, 7]])
    
  • linalg模块
  1. norm



pandas

  • dataframe获取列名
  1. df.columns.values
  2. list(df)
  • df.values 返回df的值, np.arrays

  • pd.cut & pd.quct cut是根据values来平均划分,而qcut是根据分位数来划分,4分位数,中位数等

  • 查看null值

  • pd.groupby

  1. 示例 参数as_index作用 What is as_index in groupby in pandas?

    当as_index = True时 , df.loc[] 只能用label来 比如'bk1'.

    当as_index = False时 ,df.loc[] 只能用索引 0,1,2,

    但是都能用 df.iloc[1], 结果一致

  2. agg vs filter vs transform 链接

df.groupby('day')['total_bill'].mean()
df.groupby('day').filter(lambda x : x['total_bill'].mean() > 20)
df.groupby('day')['total_bill'].transform(lambda x : x/x.mean())

if we want to get a single value for each group -> use aggregate()
if we want to get a subset of the input rows -> use filter()
if we want to get a new value for each input row -> use transform()

  • pd.drop 丢掉行或者列
    1. 丢掉列 df.drop(['lable'],axis = 1,inpalce = True) axis丢掉列,inplace 是否返回改变df
    2. 丢掉行 why can't pd.drop() by index number row
      df.drop(df.index[[0, 2]]) or df.drop(df.index[[np.arange(0,2)]])

why sort_values() is diifferent form sort_values().values
1.df = df.apply( lambda x: x.sort_values()) 会考虑到索引再合并 2.df.apply(lambda x: x.sort_values().values) 先返回numpy的arrays,再将arrays合并为dataframe

find maximum value in col C in pandas dataframe while group by both col A and B

  1. df.groupby(['RT','Similarity','Name'],as_index=False)['Quality'].sum() How to replace one col values with another col values in conditions [duplicate]
  2. 通过mask来删选条件 , mask会返回False的objectdf['RT'] = df['RT'].mask(df['similarity'] > 0.99, df['patch']) Pandas mask / where methods versus NumPy np.where

链接
if we want to get a single value for each group -> use aggregate()
if we want to get a subset of the input rows -> use filter()
if we want to get a new value for each input row -> use transform()

  • np.c_ : 将array转换为列向量, 并将所有的列向量合并
Examples
--------
>>> np.c_[np.array([1,2,3]), np.array([4,5,6])]
array([[1, 4],
       [2, 5],
       [3, 6]])
>>> np.c_[np.array([[1,2,3]]), 0, 0, np.array([[4,5,6]])]
array([[1, 2, 3, 0, 0, 4, 5, 6]])



matplotlib-seaborn

  1. 多组feature同时显示countplot
  • scatter
  • bar
  1. barplot官方example
  1. 参数bins 代表用多少个长方形 ,bins= False表示直接用kernel 分布曲线



爬虫

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published