Skip to content

Latest commit

 

History

History
200 lines (151 loc) · 5.08 KB

dataframes_select.md

File metadata and controls

200 lines (151 loc) · 5.08 KB

Select {#sec:select}

Whereas filter removes rows, select removes columns. However, select is much more versatile than just removing columns, as we will discuss in this section. First, let's create a dataset with multiple columns:

@sco responses()

Here, the data represents answers for five questions (q1, q2, ..., q5) in a given questionnaire. We will start by "selecting" a few columns from this dataset. As usual, we use symbols to specify columns:

s = "select(responses(), :id, :q1)"
sco(s, process=without_caption_label)

We can also use strings if we want:

s = """select(responses(), "id", "q1", "q2")"""
sco(s, process=without_caption_label)

To select everything except one or more columns, use Not with either a single column:

s = """select(responses(), Not(:q5))"""
sco(s, process=without_caption_label)

Or, with multiple columns:

s = """select(responses(), Not([:q4, :q5]))"""
sco(s, process=without_caption_label)

It's also fine to mix and match columns that we want to preserve with columns that we do Not want to select:

s = """select(responses(), :q5, Not(:q5))"""
sco(s, process=without_caption_label)

Note how q5 is now the first column in the DataFrame returned by select. There is a more clever way to achieve the same using :. The colon : can be thought of as "all the columns that we didn't include yet". For example:

s = """select(responses(), :q5, :)"""
sco(s, process=without_caption_label)

Or, to put q5 at the second position1:

s = "select(responses(), 1, :q5, :)"
sco(s, process=without_caption_label)

NOTE: As you might have observed there are several ways to select a column. These are known as column selectors.

We can use:

  • Symbol: select(df, :col)

  • String: select(df, "col")

  • Integer: select(df, 1)

Even renaming columns is possible via select using the source => target pair syntax:

s = """select(responses(), 1 => "participant", :q1 => "age", :q2 => "nationality")"""
sco(s, process=without_caption_label)

Additionally, thanks to the "splat" operator ... (see @sec:splat), we can also write:

s = """
    renames = (1 => "participant", :q1 => "age", :q2 => "nationality")
    select(responses(), renames...)
    """
sco(s, process=without_caption_label)

Types and Missing Data {#sec:missing_data}

Try to combine with transformations

categorical
allowmissing
disallowmissing

As discussed in @sec:load_save, CSV.jl will do its best to guess what kind of types your data have as columns. However, this won't always work perfectly. In this section, we show why suitable types are important and we fix wrong data types. To be more clear about the types, we show the text output for DataFrames instead of a pretty-formatted table. In this section, we work with the following dataset:

@sco process=string post=output_block wrong_types()

Because the date column has the wrong type, sorting won't work correctly:

Whoa! You haven't introduced the reader to sorting with `sort` yet.
s = "sort(wrong_types(), :date)"
scsob(s)

To fix the sorting, we can use the Date module from Julia's standard library as described in @sec:dates:

@sco process=string post=output_block fix_date_column(wrong_types())

Now, sorting will work as intended:

s = """
    df = fix_date_column(wrong_types())
    sort(df, :date)
    """
scsob(s)

For the age column, we have a similar problem:

s = "sort(wrong_types(), :age)"
scsob(s)

This isn't right, because an infant is younger than adults and adolescents. The solution for this issue and any sort of categorical data is to use CategoricalArrays.jl:

using CategoricalArrays

With the CategoricalArrays.jl package, we can add levels that represent the ordering of our categorical variable to our data:

@sco process=string post=output_block fix_age_column(wrong_types())

NOTE: Also note that we are passing the argument ordered=true which tells CategoricalArrays.jl's categorical function that our categorical data is "ordered". Without this any type of sorting or bigger/smaller comparisons would not be possible.

Now, we can sort the data correctly on the age column:

s = """
    df = fix_age_column(wrong_types())
    sort(df, :age)
    """
scsob(s)

Because we have defined convenient functions, we can now define our fixed data by just performing the function calls:

@sco process=string post=output_block correct_types()

Since age in our data is ordinal (ordered=true), we can properly compare categories of age:

s = """
    df = correct_types()
    a = df[1, :age]
    b = df[2, :age]
    a < b
    """
scob(s)

which would give wrong comparisons if the element type were strings:

s = "\"infant\" < \"adult\""
scob(s)

Footnotes

  1. thanks to Sudete on Discourse (https://discourse.julialang.org/t/pull-dataframes-columns-to-the-front/60327/4) for this suggestion.