It isn't often that your dataset is "clean" enough to run one of our toolkits on it in a meaningful way right after import. Such is life...data is messy. SFrames enable you to complete data cleaning tasks in a scalable way, even on datasets that are much larger than your computer's memory.
A problem you may
have noticed in the song metadata is that some songs' year value is 0.
Suppose we want to change those to a missing value, so that they do not skew a
summary statistic over the column (e.g. mean
or min
). If we knew about this
before parsing, or are willing to parse the file again, we could add 0 to the
na_values
option of read_csv
. Alternatively, we can apply an arbitrary
function to one or multiple columns of an
SFrame. Here's how to replace those zeroes with missing values with a Python
lambda function:
songs['year'] = songs['year'].apply(lambda x: None if x == 0 else x)
songs.head(5)
+--------------------+----------------------+--------------------------------+
| song_id | title | release |
+--------------------+----------------------+--------------------------------+
| SOQMMHC12AB0180CB8 | Silent Night | Monster Ballads X-Mas |
| SOVFVAK12A8C1350D9 | Tanssi vaan | Karkuteill\xc3\xa4 |
| SOGTUKN12AB017F4F1 | No One Could Ever | Butter |
| SOBNYVR12A8C13558C | Si Vos Quer\xc3\xa9s | De Culo |
| SOHSBXH12A8C13B0DF | Tangle Of Aspens | Rene Ablaze Presents Winte ... |
+--------------------+----------------------+--------------------------------+
+------------------+------+
| artist_name | year |
+------------------+------+
| Faster Pussy cat | 2003 |
| Karkkiautomaatti | 1995 |
| Hudson Mohawke | 2006 |
| Yerba Brava | 2003 |
| Der Mystic | None |
+------------------+------+
[5 rows x 5 columns]
Notice we had to reassign the resulting column back to our SFrame. This is because the content of the SFrame's columns (a separate data structure called an SArray, is immutable. SFrames can add and subtract columns liberally though, as it essentially is just a carrier of references to SArrays.
We used a lambda function here because it is the simplest way to instantiate a
small function like that. The apply
method will take a named function (a normal Python
function that starts with def
) as well. As long as the function takes one
parameter and returns one value, it can be applied to a column.
We can also apply a function to multiple columns. Suppose we want
to add a column of the number of times the word 'love' is used in the title
and artist
column:
songs['love_count'] = songs[['title', 'artist_name']].apply(
lambda row: sum(x.lower().split(' ').count('love') for x in row.values()))
songs.topk('love_count', k=5)
+--------------------+--------------------------------+
| song_id | title |
+--------------------+--------------------------------+
| SOMYDCX12A8AE4836B | The Love Story (Part 1) In ... |
| SONXAVM12AB017AA1D | Document 15 |
| SOAWJOC12A8C1367A7 | Black Black Window |
| SOXAVWF12A8AE4922C | One Piece |
| SOAJRDR12A8C1383B1 | Love Love Love |
+--------------------+--------------------------------+
+--------------------------------+--------------------------------+------+------------+
| release | artist_name | year | love_count |
+--------------------------------+--------------------------------+------+------------+
| Skid Row / 34 Hours | Skid Row | None | 4 |
| Feels_ Feathers_ Bog and Bees | Low Low Low La La La Love ... | None | 3 |
| Ends Of June | Low Low Low La La La Love ... | 2007 | 3 |
| Low Low Low La La La Love ... | Low Low Low La La La Love ... | 2007 | 3 |
| Radio Hitz | Spider Murphy Gang | None | 3 |
+--------------------------------+--------------------------------+------+------------+
[5 rows x 6 columns]
A few things to note here. We first select a subset of columns using a list within square brackets. This is useful in general, but it helps us with performance in this case, as fewer values are scanned. Also, when apply is called on an SFrame instead of an SArray as shown here, the input of the lambda function is a dictionary where the keys are the column names, and the values correspond to that row's values.
Another very useful and common operation is the ability to select columns based on types. For instance, this extracts all the columns containing strings.
song[str]
Data:
+--------------------+-------------------------------+
| song_id | title |
+--------------------+-------------------------------+
| SOQMMHC12AB0180CB8 | Silent Night |
| SOVFVAK12A8C1350D9 | Tanssi vaan |
| SOGTUKN12AB017F4F1 | No One Could Ever |
| SOBNYVR12A8C13558C | Si Vos Querés |
| SOHSBXH12A8C13B0DF | Tangle Of Aspens |
| SOZVAPQ12A8C13B63C | Symphony No. 1 G minor "Si... |
| SOQVRHI12A6D4FB2D7 | We Have Got Love |
| SOEYRFT12AB018936C | 2 Da Beat Ch'yall |
| SOPMIYT12A6D4F851E | Goodbye |
| SOJCFMH12A8C13B0C2 | Mama_ mama can't you see ? |
+--------------------+-------------------------------+
+-------------------------------+-------------------------------+
| release | artist_name |
+-------------------------------+-------------------------------+
| Monster Ballads X-Mas | Faster Pussy cat |
| Karkuteillä | Karkkiautomaatti |
| Butter | Hudson Mohawke |
| De Culo | Yerba Brava |
| Rene Ablaze Presents Winte... | Der Mystic |
| Berwald: Symphonies Nos. 1... | David Montgomery |
| Strictly The Best Vol. 34 | Sasha / Turbulence |
| Da Bomb | Kris Kross |
| Danny Boy | Joseph Locke |
| March to cadence with the ... | The Sun Harbor's Chorus-Do... |
+-------------------------------+-------------------------------+
Now, you may not feel comfortable transforming a column without inspecting more
than the first 10 rows of it, as we did with the year
column. To quickly get
a summary of the column, we can do:
songs['year'].sketch_summary()
+--------------------+---------------+----------+
| item | value | is exact |
+--------------------+---------------+----------+
| Length | 1000000 | Yes |
| Min | 0.0 | Yes |
| Max | 2011.0 | Yes |
| Mean | 1030.325652 | Yes |
| Sum | 1030325652.0 | Yes |
| Variance | 997490.582407 | Yes |
| Standard Deviation | 998.744503067 | Yes |
| # Missing Values | 0 | Yes |
| # unique values | 90 | No |
+--------------------+---------------+----------+
Most frequent items:
+-------+--------+-------+-------+-------+-------+-------+-------+-------+-------+
| value | 0 | 2007 | 2006 | 2005 | 2008 | 2009 | 2004 | 2003 | 2002 |
+-------+--------+-------+-------+-------+-------+-------+-------+-------+-------+
| count | 484424 | 39414 | 37546 | 34960 | 34770 | 31051 | 29618 | 27389 | 23472 |
+-------+--------+-------+-------+-------+-------+-------+-------+-------+-------+
+-------+
| 2001 |
+-------+
| 21604 |
+-------+
Quantiles:
+-----+-----+-----+-----+--------+--------+--------+--------+--------+
| 0% | 1% | 5% | 25% | 50% | 75% | 95% | 99% | 100% |
+-----+-----+-----+-----+--------+--------+--------+--------+--------+
| 0.0 | 0.0 | 0.0 | 0.0 | 1970.0 | 2002.0 | 2008.0 | 2009.0 | 2011.0 |
+-----+-----+-----+-----+--------+--------+--------+--------+--------+
It appears that there are no other bogus years than 0. This summary splits its values into "exact" and "approximate". The approximate values could certainly be returned as exact values, but for the summary we use approximate values to make sure exploring large datasets is scalable. The methods we use only do a single pass of the data in the column, and each operation has well- defined bounds on how wrong the answer will be, which are listed in our API Reference.
Using the most frequent items and quantiles described here, you can probably almost picture the distribution of years, where the tallest part is squarely within the 2000s. Fortunately, we don't have to just picture it in our heads. GraphLab Canvas provides visualizations of SFrames, as well as other data structures. GraphLab Canvas is covered in depth in the Visualization section. To view a histogram of these approximate quantiles, we run:
songs['year'].show()
We have done some exploration, transformation, and feature generation. Let's spend some time filtering values we won't care about later. For example, perhaps we'll need an SFrame with only dated songs. This basic filter operation looks like this:
dated_songs = songs[songs['year'] != None]
dated_songs
+--------------------+--------------------------------+
| song_id | title |
+--------------------+--------------------------------+
| SOQMMHC12AB0180CB8 | Silent Night |
| SOVFVAK12A8C1350D9 | Tanssi vaan |
| SOGTUKN12AB017F4F1 | No One Could Ever |
| SOBNYVR12A8C13558C | Si Vos Quer\xc3\xa9s |
| SOEYRFT12AB018936C | 2 Da Beat Ch'yall |
| SOYGNWH12AB018191E | L'antarctique |
| SOLJTLX12AB01890ED | El hijo del pueblo |
| SOMPVQB12A8C1379BB | Pilots |
| SOSDCFG12AB0184647 | 006 |
| SOBARPM12A8C133DFF | (Looking For) The Heart Of ... |
+--------------------+--------------------------------+
+--------------------------------+------------------+------+------------+
| release | artist_name | year | love_count |
+--------------------------------+------------------+------+------------+
| Monster Ballads X-Mas | Faster Pussy cat | 2003 | 0 |
| Karkuteill\xc3\xa4 | Karkkiautomaatti | 1995 | 0 |
| Butter | Hudson Mohawke | 2006 | 0 |
| De Culo | Yerba Brava | 2003 | 0 |
| Da Bomb | Kris Kross | 1993 | 0 |
| Des cobras des tarentules | 3 Gars Su'l Sofa | 2007 | 0 |
| 32 Grandes \xc3\x89xitos CD 2 | Jorge Negrete | 1997 | 0 |
| The Loyal | Tiger Lou | 2005 | 0 |
| Lena 20 \xc3\x85r | Lena Philipsson | 1998 | 0 |
| Cover Girl | Shawn Colvin | 1994 | 0 |
+--------------------------------+------------------+------+------------+
[? rows x 6 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.
The output does a good job at explaining what is happening here. SFrames will not do any work if it isn't required right away. This way, if you decide this filter operation isn't for you after looking at the first few rows, GraphLab Create won't waste computation time doing it anyways. However, it is important to verify that the missing values were indeed removed, and that we indeed removed 484424 rows, so we'll force materialization of the new SFrame.
len(dated_songs)
515576
Why does this filtering syntax work? What we're actually doing is placing an
SArray in the square brackets. A comparison operator applied to an SArray
returns a new SArray of the same length as the original, but with values that
correspond to true
or false
based on the operator. Here's what it looks
like:
songs['year'] != None
dtype: int
Rows: 1000000
[1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, ... ]
We may want to use more than one comparison operator, for which you must place each comparison statement in parentheses and use the bitwise boolean logic operators to combine the statements, as Python does not allow the overloading of logical operators. Perhaps we are building a music recommender with this data, and we would like to only use "reasonable" play counts, for some definition of reasonable:
reasonable_usage = usage_data[(usage_data['listen_count'] >= 10) & (usage_data['listen_count'] <= 500)]
len(reasonable_usage)
114026
You can also write a lambda function to filter using the filter
function,
which you can read about in the API Reference.
Another important way to filter a dataset is to get rid of duplicate data. A nice way to search for duplicate data is to visualize the SFrame using GraphLab Canvas.
songs.show()
It appears our song_id
is not completely unique. This would make merging the
songs
and usage_data
SFrames error-prone if all we want to do is add song
title information to the existing usage data. In this particular dataset, these
repeat songs are included because they may have been released on several
different albums (movie soundtracks, radio singles, etc.). If we do not care
about which album release is included in the dataset, we can filter those
duplicates like this:
other_cols = songs.column_names()
other_cols.remove('song_id')
agg_list = [gl.aggregate.SELECT_ONE(i) for i in other_cols]
unique_songs = songs.groupby('song_id', dict(zip(other_cols, agg_list)))
This code block needs some further explanation. It is centered around calling
groupby
with the SELECT_ONE
aggregator. This selects a random
representative row from each group. You must explicitly denote which columns
will use this aggregator, so the list comprehension gathers all other columns
than the one we are grouping by and uses the SELECT_ONE
aggregator for each
one. When used like this, SELECT_ONE
will use the same random row for each
column. This is great if it is not important which of the duplicates you pick.
If it is, another aggregator like MIN
or MAX
may be in order.
Suppose we actually want to see the songs that have the highest play count? Now we can correctly group each song and aggregate its listen count, and then join the result to the song metadata to see the song titles.
tmp = usage_data.groupby(['song_id'], {'total_listens': gl.aggregate.SUM('listen_count'),
'num_unique_users': gl.aggregate.COUNT('user_id')})
tmp.join(songs, ['song_id']).topk('total_listens')
+--------------------+------------------+---------------+
| song_id | num_unique_users | total_listens |
+--------------------+------------------+---------------+
| SOBONKR12A58A7A7E0 | 6412 | 54136 |
| SOAUWYT12A81C206F1 | 7032 | 49253 |
| SOSXLTC12AF72A7F54 | 6145 | 41418 |
| SOEGIYH12A6D4FC0E3 | 5385 | 31153 |
| SOFRQTD12A81C233C0 | 8277 | 31036 |
| SOAXGDH12A8C13F8A1 | 6949 | 26663 |
| SONYKOW12AB01849C9 | 5841 | 22100 |
| SOPUCYA12A8C13A694 | 3526 | 21019 |
| SOUFTBI12AB0183F65 | 2887 | 19645 |
| SOVDSJC12A58A7A271 | 2866 | 18309 |
+--------------------+------------------+---------------+
+--------------------------------+--------------------------------+
| title | release |
+--------------------------------+--------------------------------+
| You're The One | If There Was A Way |
| Undo | Vespertine Live |
| Revelry | Only By The Night |
| Horn Concerto No. 4 in E f ... | Mozart - Eine kleine Nacht ... |
| Sehr kosmisch | Musik von Harmonia |
| Dog Days Are Over (Radio Edit) | Now That's What I Call Mus ... |
| Secrets | Waking Up |
| Canada | The End Is Here |
| Invalid | Fermi Paradox |
| Ain't Misbehavin | Summertime |
+--------------------------------+--------------------------------+
+--------------------------------+------+------------+
| artist_name | year | love_count |
+--------------------------------+------+------------+
| Dwight Yoakam | 1990 | 0 |
| Bj\xc3\xb6rk | 2001 | 0 |
| Kings Of Leon | 2008 | 0 |
| Barry Tuckwell/Academy of ... | None | 0 |
| Harmonia | None | 0 |
| Florence + The Machine | None | 0 |
| OneRepublic | 2009 | 0 |
| Five Iron Frenzy | None | 0 |
| Tub Ring | 2002 | 0 |
| Sam Cooke | None | 0 |
+--------------------------------+------+------------+
[10 rows x 8 columns]
The usage_data
table is already in a great format for feeding to a recommender
algorithm as it has user and song identifiers, and some form of a metric (number
of listens) to rate how much the user liked the song. The problem here is that
users that listened to a song a lot would skew the recommendations. At some
point, once a user plays a song enough times, you know they really like it.
Perhaps we could translate the listen_count
into a rating instead. Here is a simple way to do it:
s = usage_data['listen_count'].sketch_summary()
import numpy
buckets = numpy.linspace(s.quantile(.005), s.quantile(.995), 5)
def bucketize(x):
cur_bucket = 0
for i in range(0,5):
cur_bucket += 1
if x <= buckets[i]:
break
return cur_bucket
usage_data['rating'] = usage_data['listen_count'].apply(bucketize)
usage_data
Columns:
user_id str
song_id str
listen_count int
rating int
Rows: 2000000
Data:
+--------------------------------+--------------------+--------------+--------+
| user_id | song_id | listen_count | rating |
+--------------------------------+--------------------+--------------+--------+
| b80344d063b5ccb3212f76538f ... | SOAKIMP12A8C130995 | 1 | 1 |
| b80344d063b5ccb3212f76538f ... | SOBBMDR12A8C13253B | 2 | 2 |
| b80344d063b5ccb3212f76538f ... | SOBXHDL12A81C204C0 | 1 | 1 |
| b80344d063b5ccb3212f76538f ... | SOBYHAJ12A6701BF1D | 1 | 1 |
| b80344d063b5ccb3212f76538f ... | SODACBL12A8C13C273 | 1 | 1 |
| b80344d063b5ccb3212f76538f ... | SODDNQT12A6D4F5F7E | 5 | 2 |
| b80344d063b5ccb3212f76538f ... | SODXRTY12AB0180F3B | 1 | 1 |
| b80344d063b5ccb3212f76538f ... | SOFGUAY12AB017B0A8 | 1 | 1 |
| b80344d063b5ccb3212f76538f ... | SOFRQTD12A81C233C0 | 1 | 1 |
| b80344d063b5ccb3212f76538f ... | SOHQWYZ12A6D4FA701 | 1 | 1 |
| ... | ... | ... | ... |
+--------------------------------+--------------------+--------------+--------+
[2000000 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
SArrays are strongly-typed and some operations only work on certain types. Two
types deserve some special consideration in this user guide: list
and dict
.
These types can hold values of any type supported by SArrays, including
themselves. So you can have an SFrame with a column of dictionaries that each
have values that are lists of lists of dicts with mixed strings and
integers... anyway, you get the idea!
The dataset we're working with right now does not have any of these types, so we can show how to convert one or several columns to one of these iterable types. For instance, suppose we want a list of all albums with a list of each song that was on the album. We would obtain this list from our metadata like this:
albums = songs.groupby(['release','artist_name'], {'tracks': gl.aggregate.CONCAT('title'),
'years': gl.aggregate.CONCAT('year')})
albums
+--------------------------------+--------------------------------+
| artist_name | release |
+--------------------------------+--------------------------------+
| Veruca Salt | Eight Arms To Hold You |
| Les Compagnons De La Chanson | Heritage - Le Chant De Mal ... |
| Nelly / Fat Joe / Young Tr ... | Sweat |
| The Grouch | My Baddest B*tches |
| Ozzy Osbourne | Diary of a madman / Bark a ... |
| Peter Hunnigale | Reggae Hits Vol. 32 |
| Burning Spear | Studio One Classics |
| Bond | New Classix 2008 |
| Lee Coombs feat. Katherine ... | Control |
| Stevie Wonder | Songs In The Key Of Life |
+--------------------------------+--------------------------------+
+--------------------------------+--------------------------------+
| tracks | years |
+--------------------------------+--------------------------------+
| [\'One Last Time\', \'With ... | array('d', [1997.0, 1997.0 ... |
| ['I Commedianti', 'Il Est ... | None |
| ['Grand Hang Out'] | array('d', [2004.0]) |
| ['Silly Putty (Zion I Feat ... | array('d', [1999.0]) |
| ["You Can\'t Kill Rock And ... | array('d', [1981.0, 1983.0 ... |
| ['Weeks Go By'] | None |
| ['Rocking Time'] | array('d', [1973.0]) |
| ['Allegretto', 'Kashmir'] | None |
| ['Control (10Rapid Remix)' ... | None |
| ['Ngiculela-Es Una Histori ... | array('d', [1976.0]) |
| ... | ... |
+--------------------------------+--------------------------------+
[230558 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
The CONCAT aggregator simply creates a list of all values in the given column for each group. I included the year for debugging purposes, since I didn't know if the invariant of "every song on the same release has the same year" was true in this dataset. Even looking at the first row, this is clearly not true:
albums[0]
{'artist_name': 'Veruca Salt',
'release': 'Eight Arms To Hold You',
'tracks': ['With David Bowie',
'One Last Time',
'The Morning Sad',
'Awesome',
'Stoneface',
"Don't Make Me Prove It",
'Straight',
'Earthcrosser',
'Benjamin',
'Volcano Girls',
'Shutterbug',
'Sound Of The Bell',
'Loneliness Is Worse',
'Venus Man Trap'],
'years': array('d', [1997.0, 1997.0, 1997.0, 1997.0, 1997.0, 1997.0, 1994.0, 1997.0, 1997.0, 1997.0, 1997.0, 1997.0, 1997.0, 1997.0])}
In light of this complication, and for demonstration purposes, let's have our 'tracks' column contain dictionaries instead, where the key is the year and the value is a list of tracks.
albums = songs.groupby(['release','artist_name','year'], {'tracks':gl.aggregate.CONCAT('title')})
albums = albums.unstack(column=['year','tracks'], new_column_name='track_dict')
albums
+--------------------------------+--------------------------------+
| artist_name | release |
+--------------------------------+--------------------------------+
| Veruca Salt | Eight Arms To Hold You |
| Les Compagnons De La Chanson | Heritage - Le Chant De Mal ... |
| Nelly / Fat Joe / Young Tr ... | Sweat |
| The Grouch | My Baddest B*tches |
| Ozzy Osbourne | Diary of a madman / Bark a ... |
| Peter Hunnigale | Reggae Hits Vol. 32 |
| Burning Spear | Studio One Classics |
| Bond | New Classix 2008 |
| Lee Coombs feat. Katherine ... | Control |
| Stevie Wonder | Songs In The Key Of Life |
+--------------------------------+--------------------------------+
+--------------------------------+
| track_dict |
+--------------------------------+
| {1994: [\'Straight\'], 199 ... |
| None |
| {2004: ['Grand Hang Out']} |
| {1999: ['Simple Man']} |
| {1986: [\'Secret Loser\'], ... |
| None |
| {1973: ['Rocking Time']} |
| None |
| None |
| {1976: ['Have A Talk With ... |
| ... |
+--------------------------------+
[230558 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
The unstack
function is essentially a convenience function for calling
groupby
on all other columns with the CONCAT
aggregator. This operation
can be undone with stack
, which I won't take time to show in detail.
With our data organized this way, we can get some statistics on how many albums have this oddity and see that it's actually somewhat prevalent for this to occur:
albums['num_years'] = albums['track_dict'].item_length()
albums['num_years'].show()
We can still recover the full track listing with some dict operations, so
suppose we want both a list
and a dict
representation of the tracks.
albums['track_list'] = albums['track_dict'].dict_values()
albums
+--------------------------------+--------------------------------+
| artist_name | release |
+--------------------------------+--------------------------------+
| Veruca Salt | Eight Arms To Hold You |
| Les Compagnons De La Chanson | Heritage - Le Chant De Mal ... |
| Nelly / Fat Joe / Young Tr ... | Sweat |
| The Grouch | My Baddest B*tches |
| Ozzy Osbourne | Diary of a madman / Bark a ... |
| Peter Hunnigale | Reggae Hits Vol. 32 |
| Burning Spear | Studio One Classics |
| Bond | New Classix 2008 |
| Lee Coombs feat. Katherine ... | Control |
| Stevie Wonder | Songs In The Key Of Life |
+--------------------------------+--------------------------------+
+--------------------------------+-----------+--------------------------------+
| track_dict | num_years | track_list |
+--------------------------------+-----------+--------------------------------+
| {1994: [\'Straight\'], 199 ... | 2 | [[\'Straight\'], [\'With D ... |
| None | None | None |
| {2004: ['Grand Hang Out']} | 1 | [['Grand Hang Out']] |
| {1999: ['Simple Man']} | 1 | [['Simple Man']] |
| {1986: [\'Secret Loser\'], ... | 3 | [["You Can\'t Kill Rock An ... |
| None | None | None |
| {1973: ['Rocking Time']} | 1 | [['Rocking Time']] |
| None | None | None |
| None | None | None |
| {1976: ['Have A Talk With ... | 1 | [['Have A Talk With God']] |
| ... | ... | ... |
+--------------------------------+-----------+--------------------------------+
[230558 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
We have almost recovered our old column, but we have a list of lists instead of
just a single list. Fixing that is a simple apply
away:
import itertools
albums['track_list'] = albums['track_list'].apply(lambda x: list(itertools.chain(*x)))
albums
+--------------------------------+--------------------------------+
| artist_name | release |
+--------------------------------+--------------------------------+
| Veruca Salt | Eight Arms To Hold You |
| Les Compagnons De La Chanson | Heritage - Le Chant De Mal ... |
| Nelly / Fat Joe / Young Tr ... | Sweat |
| The Grouch | My Baddest B*tches |
| Ozzy Osbourne | Diary of a madman / Bark a ... |
| Peter Hunnigale | Reggae Hits Vol. 32 |
| Burning Spear | Studio One Classics |
| Bond | New Classix 2008 |
| Lee Coombs feat. Katherine ... | Control |
| Stevie Wonder | Songs In The Key Of Life |
+--------------------------------+--------------------------------+
+--------------------------------+-----------+--------------------------------+
| track_dict | num_years | track_list |
+--------------------------------+-----------+--------------------------------+
| {1994: [\'Straight\'], 199 ... | 2 | [\'Straight\', \'With Davi ... |
| None | None | None |
| {2004: ['Grand Hang Out']} | 1 | ['Grand Hang Out'] |
| {1999: ['Simple Man']} | 1 | ['Simple Man'] |
| {1986: [\'Secret Loser\'], ... | 3 | ["You Can\'t Kill Rock And ... |
| None | None | None |
| {1973: ['Rocking Time']} | 1 | ['Rocking Time'] |
| None | None | None |
| None | None | None |
| {1976: ['Have A Talk With ... | 1 | ['Have A Talk With God'] |
| ... | ... | ... |
+--------------------------------+-----------+--------------------------------+
[230558 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
We can also filter elements from within a list or dictionary. Here is how to remove all songs made in 1994 or 1999:
albums['track_dict'] = albums['track_dict'].dict_trim_by_keys([1994, 1999])
albums
+--------------------------------+--------------------------------+
| artist_name | release |
+--------------------------------+--------------------------------+
| Veruca Salt | Eight Arms To Hold You |
| Les Compagnons De La Chanson | Heritage - Le Chant De Mal ... |
| Nelly / Fat Joe / Young Tr ... | Sweat |
| The Grouch | My Baddest B*tches |
| Ozzy Osbourne | Diary of a madman / Bark a ... |
| Peter Hunnigale | Reggae Hits Vol. 32 |
| Burning Spear | Studio One Classics |
| Bond | New Classix 2008 |
| Lee Coombs feat. Katherine ... | Control |
| Stevie Wonder | Songs In The Key Of Life |
+--------------------------------+--------------------------------+
+--------------------------------+-----------+--------------------------------+
| track_dict | num_years | track_list |
+--------------------------------+-----------+--------------------------------+
| {1997: [\'With David Bowie ... | 2 | [\'Straight\', \'With Davi ... |
| None | None | None |
| {2004: ['Grand Hang Out']} | 1 | ['Grand Hang Out'] |
| {} | 1 | ['Simple Man'] |
| {1986: [\'Secret Loser\'], ... | 3 | ["You Can\'t Kill Rock And ... |
| None | None | None |
| {1973: ['Rocking Time']} | 1 | ['Rocking Time'] |
| None | None | None |
| None | None | None |
| {1976: ['Have A Talk With ... | 1 | ['Have A Talk With God'] |
| ... | ... | ... |
+--------------------------------+-----------+--------------------------------+
[230558 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
Since dictionaries are good for item lookups, we can filter by elements we find in dictionaries. Here is a query that filters albums by whether any of their songs came out in 1965:
albums[albums['track_dict'].dict_has_any_keys(1965)]
+----------------------------+--------------------------------+
| artist_name | release |
+----------------------------+--------------------------------+
| Jr. Walker & The All Stars | The Motown Story: The Sixties |
| The Seekers | The Best Sixties Album In ... |
| Jr. Walker & The All Stars | Sparks Present Motown Made ... |
| The Who | Who's Next |
| Bobby Vinton | The Best Of Bobby Vinton |
| Righteous Brothers | The Collection |
| Little Milton | Chess Blues |
| The Who | Singles Box |
| Dionne Warwick | Here I Am |
| Albert Ayler | Nuits De La Fondation Maeg ... |
+----------------------------+--------------------------------+
+--------------------------------+-----------+--------------------------------+
| track_dict | num_years | track_list |
+--------------------------------+-----------+--------------------------------+
| {1965: ['Shotgun']} | 1 | ['Shotgun'] |
| {1965: ['The Carnival Is O ... | 1 | ['The Carnival Is Over'] |
| {1965: ["Cleo\'s Back"]} | 1 | ["Cleo\'s Back"] |
| {1995: ["I Don\'t Even Kno ... | 4 | [\'My Generation\', \'Behi ... |
| {1991: [\'Please Love Me F ... | 6 | [\'Roses Are Red (My Love) ... |
| {2006: ['Let The Good Time ... | 4 | ['Justine', 'Stagger Lee', ... |
| {1965: ["We\'re Gonna Make ... | 1 | ["We\'re Gonna Make It"] |
| {1965: ['Shout And Shimmy']} | 1 | ['Shout And Shimmy'] |
| {1965: ['How Can I Hurt Yo ... | 1 | ['How Can I Hurt You? (LP ... |
| {1969: ['Music Is The Heal ... | 4 | ['Spirits', 'Spirits Rejoi ... |
+--------------------------------+-----------+--------------------------------+
[? rows x 5 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.
Converting to a list
or dict
doesn't need to group by the rest of the values in
the row. If for some reason you want to turn this into a table with one column
and all values packed into a list, you can!
big_list = albums.pack_columns(albums.column_names())
big_list
+--------------------------------+
| X1 |
+--------------------------------+
| [\'Veruca Salt\', \'Eight ... |
| ['Les Compagnons De La Cha ... |
| ['Nelly / Fat Joe / Young ... |
| ['The Grouch', 'My Baddest ... |
| [\'Ozzy Osbourne\', \'Diar ... |
| ['Peter Hunnigale', 'Regga ... |
| ['Burning Spear', 'Studio ... |
| ['Bond', 'New Classix 2008 ... |
| ['Lee Coombs feat. Katheri ... |
| ['Stevie Wonder', 'Songs I ... |
| ... |
+--------------------------------+
[230558 rows x 1 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
The unpack
function accomplishes the reverse task. These examples may be a
bit contrived, but these functions are very useful when working with
unstructured data like text, as you will be able to see in the Text Analysis
chapter of this guide.
To find out more, check out the API Reference for SFrames and the hands-on exercises at the end of the chapter.