Skip to content

Latest commit

 

History

History
244 lines (202 loc) · 6.18 KB

01-starting-with-data.md

File metadata and controls

244 lines (202 loc) · 6.18 KB

##Pandas

We've talked a whole bunch about looping and opening files and reading them. So far we've worked with plain-text files. But some files have unusual dialects. One common example of such a file is an Excel file.

You should have downloaded a file called 'Sites_simple.xls'. This is the same file as we used earlier, but written in Excel format. To use it, we need to parse that file. The library 'pandas' can do this.

We will first import pandas.

>>> import pandas

Many users have written their own libraries for doing fun things. These users may choose to distribute these functions for general use by the community. Pandas (http://pandas.pydata.org/) is such a library.

When we use a library that we downloaded from the internet, it is not part of Python's base functionality. Therefore, we tell Python that we'd like to use this extra library with the import command.

Now we're going to take our spreadsheet and use Pandas to import it.

>>> xl = pandas.ExcelFile('surveys.xlsx')
>>> xl
<pandas.io.parsers.ExcelFile object at 0x102a39610>

This should seem familiar - much like with our initial exercises in opening files, we are referencing the file and not the data within it. So let's liberate that data!

Excel documents are made up of sheets, which are linked documents containing different subsets of data. We're keeping it simple - all our data is in one sheet.

>>> xl.sheet_names
[u'Sheet1', u'Sheet2', u'Sheet3']
>>> df = xl.parse('Sheet1')

       record_id  month  day  year  plot species  sex  wgt
0              1      7   16  1977     2     NaN    M  NaN
1              2      7   16  1977     3     NaN    M  NaN
2              3      7   16  1977     2      DM    F  NaN
3              4      7   16  1977     7      DM    M  NaN
4              5      7   16  1977     3      DM    M  NaN
5              6      7   16  1977     1      PF    M  NaN
6              7      7   16  1977     2      PE    F  NaN
7              8      7   16  1977     1      DM    M  NaN
8              9      7   16  1977     1      DM    F  NaN
[35549 rows x 8 columns]

If I wanted to see a specific observation in the zeroeth row, say, the third one, I could index like so:

>>> df.ix[0,3]
180.0

What happens if we try the following:

df[0,3]

How about:

df[:3]

Alternatively, I could choose to view the third column for all rows:

>>> df.ix[:,3]
0     1977
1     1977
2     1977
3     1977
4     1977
5     1977
6     1977
7     1977
8     1977
9     1977
10    1977
11    1977
12    1977
13    1977
14    1977
...
35534    2002
35535    2002
35536    2002
35537    2002
35538    2002
35539    2002
35540    2002
35541    2002
35542    2002
35543    2002
35544    2002
35545    2002
35546    2002
35547    2002
35548    2002
Name: year, Length: 35549, dtype: int64

Try slicing up your data in different ways. Does everything give you the expected output? What does the dtype keyword mean?

That's all well and good, but we have these nice row and column names! Let's use them!

>>> df = xl.parse('Sheet1', index_col = 0,header=0)
>>> df.head

<bound method DataFrame.head of            month  day  year  plot species  sex  wgt
record_id                                          
1              7   16  1977     2     NaN    M  NaN
2              7   16  1977     3     NaN    M  NaN
3              7   16  1977     2      DM    F  NaN
4              7   16  1977     7      DM    M  NaN
5              7   16  1977     3      DM    M  NaN
6              7   16  1977     1      PF    M  NaN
7              7   16  1977     2      PE    F  NaN
8              7   16  1977     1      DM    M  NaN
9              7   16  1977     1      DM    F  NaN
10             7   16  1977     6      PF    F  NaN
11             7   16  1977     5      DS    F  NaN
12             7   16  1977     7      DM    M  NaN
13             7   16  1977     3      DM    M  NaN
14             7   16  1977     8      DM  NaN  NaN
15             7   16  1977     6      DM    F  NaN
16             7   16  1977     4      DM    F  NaN
17             7   16  1977     3      DS    F  NaN

df['month']
Out[26]: 
record_id
1            7
2            7
3            7
4            7
5            7
6            7
7            7
8            7
9            7
10           7
11           7
12           7
13           7
14           7
15           7
...
35535        12
35536        12
35537        12
35538        12
35539        12
35540        12
35541        12
35542        12
35543        12
35544        12
35545        12
35546        12
35547        12
35548        12
35549        12
Name: month, Length: 35549, dtype: int64

By specifying an index column as the zeroeth column, we are now able to access the data by the name of the column.

When we specify the header as the zeroeth row, we get to do fun things like so:

>>> df.ix[0:5,'month']

record_id
1            7
2            7
3            7
4            7
5            7
Name: month, dtype: int64

In this way, we can break our data down for manipulation. We can use this to build more complex functions:

>>> a = df.ix[:,'wgt']
>>> a
df['wgt']

record_id
1           NaN
2           NaN
3           NaN
4           NaN
5           NaN
6           NaN
7           NaN
8           NaN
9           NaN
10          NaN
11          NaN
12          NaN
13          NaN
14          NaN
15          NaN
...
35535        53
35536        42
35537        46
35538        31
35539        68
35540        23
35541        31
35542        29
35543        34
35544       NaN
35545       NaN
35546       NaN
35547        14
35548        51
35549       NaN
Name: wgt, Length: 35549, dtype: float64

a behaves like a list, so we can do list-based operations on it.

Because Excel is kind of terrible (and writing it out requires another library), you can output your data to a csv like so:

>>> outfile = open('output.csv', 'w')
>>> df.to_csv(outfile)
>>> outfile.close()

##Exercise

I've attached a sample of different numpy functions in a file called 'functions.md'. Have a look at them. Try writing a two function set: one function that returns a slice of data, and another that does a function of your choice on that slice.