-
Step 1: Ask questions
Ask questions based on data or before gather data to help to focus on relevant parts of data and direct analysis towards meaniful insights.
-
Step 2: Wrangle data
- gather: retrieve and read data
- assess: identify problems in data's quality or structure
- clean : by modifying, replacing, or removing data
-
Step 3: Perform EDA (Exploratory Data Analysis) Explore and augment data to maximize the potential of analysis, visualizations, and models including
- finding pattern
- visulizing relationship
- building intuition about the data
- feature engineering (remove outliers and create better features)
-
Step 4: Draw conclusions (or even make predictions) Typically approached with machine learning or inferential statistics.
-
Step 5: Communicate your results
- Justify and convey meaning in the insights found.
- Or, if the end goal is to build a system, then need to share what been built, explain how we reached design decisions, and report how well it performs.
-
glob
from file handle Opening files with similar path sturctureimport glob for file in glob.glob('subfolder/*.txt'): with open(file, encoding ='utf-8' ) as file: # if wantt read only 1 line file.readline() # read the context review_text = file.read()
-
Request online
import requests import os folder_name = "online_content" if not os.path.exists(folder_name): os.makedirs(folder_name) url = "http: xxx" response = requests.get(url) # can get the content inside response response.content
-
read HTML
- workbook with bs4 (link)
-
unzip
import zipfile with ZipFile('file.zip', 'r') as myzip: myzip.extractall()
-
Read
label
to change the column nameheader
if 0 then the frist line of the data; None then nothing; 1 is the second line of the dataindex
False to ingnore the indexsep
sepecify the delimiterlabels = ['col1', 'col2', 'col3'] df = pd.read_csv('data.csv', names=labels, index = False)
-
More
Completeness
: All the records we shoud have?Vality
: are the records valid? e.g conform to defined schema or not (4 digit zip code?)Accuracy
: wrong data but valid data e.g. (10 pounds adult)Consistency
: valid and accurate! (for example showing NY and New york)
-
Goal Detect and Document issues need to be addressed. Possible documenting phrase:
Erroneous datatype (zip_code, ); Multiple data format in column xxx; x is a float not string; Full state names sometimes, abbreviations other times
- basic funtions
shape()
|dtypes
|describe()
|info()
|unique()
|value_counts()
|sort_values()
Stpes
- Define: convert assessment into defined cleaning tasks, start with action word. Like how-to guide of pseudo code:
Remove xx before every animal name using string slicing; Replace ! with . in body weight and brain weight columns; Recalculate| Melt| Extract| Isolate
- Code: convert definition to code
- Test: test the dataset (like using assert statement)
- assert
check if every text in a
asap_list
is indf.startDate.values
for phrase in asap_list: assert phrase not in df.startDate.values
- assert
check if every text in a
-
rename columns
# Work from df df.rename(columns ={'A' :'A1'}, inplace = True) df.rename(INDEX ={0:'A1'})
- replace space with underscores and lowercase label
df.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)
- check two dataframe's columns if identical
(df.columns == df2.columns).all()
- replace space with underscores and lowercase label
-
incorrect data types
df['timestamp'] = pd.to_datetime(df['timestamp'])
df[col] = df[col].astype(str/int/float/category)
string in coldf[df.col.str.contains('?')]
-
Missing values
-
check number of rows have missing value
df.isnull().any(axis = 1).sum()
number of col have missing valuedf.isnull().any(axis = 0).sum()
# example impute by mean mean = df.col.mean() df.col = df.col.fillna(mean, inplace = True)
-
drop missing records
df.drop(['col','col'], axis = 1, inplace = True)
- drop rows with NA
df.dropna(axis = 0, inplace = True)
or drop row with idxdf.drop((df.query('age == -1')).index[0])
- drop cols with NA
df.dropna(axis = 1, inplace = True)
- drop rows with NA
-
Imputing
- filling missing data values with other values.
- Zip code pad to 5 digit
df.zip_code.astype(str).str.pad(5,fillchar='0')
-
-
duplicates
df.duplicated()
gives boolean resultsum(df.duplicates())
gives the number of duplicates recordsdf.drop_duplicates()
df[df.duplicated("col", keep= False)]
to show all the duplicated information by col; ifkeep= True
then only show de-duplicated df
-
query
:- selected rows by indexing with a mask
df.query('col == "Y"')
- easy to subset
low = df.query('alcohol < {}'.format(df['alcohol'].median()))
- select condition
df.query('col in ["val1", "vla2"]')
- selected rows by indexing with a mask
-
String extraction Review: Regex
serious.str.extract(r'([ab])?(\d)')
-
Reshape
- melt()
pd.melt(df, id_vars =[] #column(s) to use as identifier variables value_vars =[] #column(s) to unpivot), var_name = '' #name to use for variable's name, value_name = '' # used for value column,
-
Current working directory
pwd
-
List files in current dir
os.listdir()
-
List files in subdir
listOfFiles = list() for (dirpath, dirnames, filenames) in os.walk(subdir): listOfFiles += [os.path.join(dirpath, f) for f in filenames]
-
New a directory in current dir
os.makedirs()
-
Make a directory if it doesn't already exist
folder_name = 'ebert_reviews' if not os.path.exists(folder_name): os.makedirs(folder_name)
-
-
Full Path
-
Loop file in a folder
- use
os
for file_ele in os.listdir(folder): with open(os.path.join(folder, file_ele)) as file: soup = BeautifulSoup(file, 'lxml')
- use
glob
import glob for file in glob.glob('subdoler/*.txt'):
- use
- SQL Alchemy
from sqlalchemy import create_engine
create engineengine = create_engine('sqlite:///bestofrt.db')
store data to enginedf.to_sql('master', engine, index=False)
Read datadf_gather = pd.read_sql('SELECT * FROM master', engine)
Expression | Character | Notes |
---|---|---|
Letters | abc | |
Digit | \d | any digit from 0 to 9 |
Non-digit Character | \D | |
Wildcard | . | match any signle charater (letter,digit,whitespace..) |
only a,b, or c startwith |
[abc] | only match a single a, b, or c letter and nothing else |
not a,b, nor c notstartwith |
[^abc] | match any single character except for |
characters a to z | [a-z] | |
numbers 0 to 0 | [0-9] | |
any alphanumeric character | \w | |
any non-alphanumeric character | \W | |
m Repetitions | {m} | a{3} match a character exactly three times [xyz]{5} five characters, each of which can be x,y, or z |
zero or more repetitions | * | .* zero or more of any character |
1 or more repetitions | + | x+ one or more x |
optional character | ? | ab?c matches "abc" or "ac" |
any whitespace | \s | |
any non-whitespace | \S | |
starts and ends | ^...$ | ^success only matches the word success [^..] in the bracket is for excluding characters |
capture group | (...) | ^(file.+).pdf$ matches file_record_transcript.pdf and file_099.pdf |
capture subgroup | (a(bc)) | |
capture all | (.*) | |
matches abc or def | (abc|def) | ([cb]ats*|[dh]ogs?) would match either cats or bats, or dogs or hogs |
-
Match all numbers numberset =
[3.145, -255.31, 128, 1.9e1, 123,340.0 ]
Code =^-?\d+(,\d+)*(\.\d+(e\d+)?)?$
-
Capture HTML tags Text =
[ <a>This is a link</a>, <a href='https://regexone.com'>Link</a>, <div class='test_style'>Test</div> ]
Code =(\w+)
-
Phone numbers with format =
xxx-xxx-xxxx, +1 (xxx)xxx-xxxx, xxx xxx xxxx
((?:\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})
-
Email address
([a-zA-Z][a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+[a-zA-Z])
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.