-
Notifications
You must be signed in to change notification settings - Fork 7
Multitable API
This document is a draft proposal for supporting multitables in Scrunch. The Crunch API Documentation for multitables can be found at: http://docs.crunch.io/#multitables.
The multitable entities within a Dataset
object can be accessed through the dict-like multitables
property, using the name
of the multitable as the key:
ds = get_dataset('My dataset')
benchmark_mt = ds.multitables['Benchmark']
Each of the items returned by the multitables
property is a Multitable
object (described in more detail later in this document).
As is expected, using an invalid name on the multitables
property raises a KeyError
exception.
Dataset
objects provide a create_multitable
method that allows creating new multitable entities within a crunch dataset. Note that numeric variables can be included, either by their alias alone (in which case unique values from the data will be categorized into their own columns) or they can be automatically grouped into bins using the bin
function as is shown. Here are some examples:
new_mt = ds.create_multitable(
name='new multitable',
template=['bin(age)', 'race', 'gender'],
is_public=True
)
new_mt = ds.create_multitable(
name='new multitable',
template=[
'bin(age)',
'race',
{
'query': 'gender',
'transform': {
'categories': [
{'id': 3, 'name': 'Undisclosed'},
{'id': 2, 'name': 'Male'},
{'id': 1, 'name': 'Female'}
]
}
}
],
is_public=True
)
The arguments to the create_multitable
method are:
-
name
: the name for the new multitable. If the name is invalid, aValueError
exception is raised. -
template
: an iterable of one or more template queries, which define the conditioning variables for the new multitable (see the Template queries section in this document for more details). -
is_public
: a boolean that determines if the new multitable belongs to the dataset (True
) or if it is a personal multitable (False
). The default value isFalse
.
Upon success, the create_multitable
method returns a Multitable
object; otherwise a MultitableCreateError
is raised with further details on the failure.
The Multitable
class provides high-level, convenient helpers for accessing and manipulating a multitable entity from the Crunch API.
For starters, the class provides the following read-only attributes:
-
name
: the name of the multitable, which is unique within a givenDataset
. -
id
: the crunch-generated entity id. -
is_public
: whether the multitable is dataset-owned (True
) or personal (False
). -
owner
: an instance of theUser
object to who owns the multitable. -
template
: the current set of template queries that provide the multitable's conditioning variables.
>>> mt = ds.multitables['Benchmark']
>>> print(mt.id)
abcde1234
>>> print(mt.is_public)
True
>>> print(mt.template)
[
{'query': 'bin(age)'},
{'query': 'race'},
{
'query': 'gender',
'transform': {
'categories': [
{'id': 3, 'name': 'Undisclosed'},
{'id': 2, 'name': 'Male'},
{'id': 1, 'name': 'Female'}
]
}
}
]
The edit
method in the Multitable
class is the means of manipulating multitables in scrunch.
The arguments to the edit
method are:
-
name
: a new name for the multitable. If the name is invalid, aValueError
exception is thrown. -
template
: an iterable of one or more template queries, which will overwrite the current ones for the existing multitable entity (see the Template queries section in this document for more details). -
is_public
: a boolean that determines if the multitable belongs to the dataset (True
) or if it is a personal multitable (False
).
Upon successful edition, edit
returns the current Multitable
object. Else, a MultitableEditError
is thrown with further details of the edit failure.
An example of using the edit
method would be:
mt = ds.multitables['Benchmark']
mt.edit(name='My_Benchmark', template=['bin(age)'], is_public=True)
There are two ways of deleting a Multitable
object:
- Using its
delete
method or its alias remove():
mt = ds.multitables['Benchmark']
mt.delete()
Tab books contain a set of variables in the dataset crosstabbed with a given multitable.
In this regard, the Multitable
class provides the export
method, which receives a path in the filesystem where the exported tab book for the current multitable will be saved. The format
argument can be either 'xlsx'
(default) or 'json'
.
Parameters: http://docs.crunch.io/#post-body-parameters. :path: Local Filesystem path to save the file to :filter: Name of Filter instance of a Dataset's filter :where: list of variables to include; ['varA', 'varB'] :options: Display options as python dictionary :weight: Name of the weight_variable existing in the Dataset
Examples of usage:
mt = ds.multitables['Benchmark']
mt.export(path='/home/User/Benchmark.xlsx')
Or a more detailed export:
resp = mt.export(path='/home/User/Benchmark.xlsx', format='xlsx',
filter=ds.filters['männlich'], where=['languagepreference', 'religion'],
weight='sample_weight')
mt = ds.multitables['Benchmark']
mt.export(path='/home/User/Benchmark.json', format='json')
By default, the export
method selects all variables in the dataset to be crosstabbed with the current multitable. Nevertheless, the select
optional argument can be used to make an explicit selection of the variables to use. The select
argument is expected to be an iterable of the names of the variables to use in the tab book export:
mt = ds.multitables['Benchmark']
mt.export(path='/home/User/Benchmark.xlsx', where=['age', 'gender', 'race'])
The Multitable
class provides the dataframe
method, which returns the multitable as a Pandas DataFrame
object:
df = ds.multitables['Benchmark'].dataframe()
By default, the dataframe
method selects all variables in the dataset to be included in the pandas DataFrame object. The select
optional argument can be used to make an explicit selection of the variables to use. The select
argument is expected to be an iterable of the names of the variables to use in the construction of the DataFrame object:
df = ds.multitables['Benchmark'].dataframe(select=('age', 'gender', 'race'))
Both the Dataset.create_multitable
and the Multitable.edit
methods take a template
parameter which is expected to be an iterable of one or more template queries.
In scrunch, template queries are represented as dictionary objects with a query
element that defines a single conditioning variable expression for the multitable. The query
expression might be a simple variable reference (e.g. {'query': 'race'}
) or a function over a variable (e.g. {'query': 'bin(age)'}
).
Template queries may also contain an optional transform
element that provides a declarative definition of what the conditioning variable's dimension should look like after computation. For example:
new_mt = ds.create_multitable(
name='new multitable',
template=[
{'query': 'bin(age)'},
{'query': 'race'},
{
'query': 'gender',
'transform': {
'categories': [
{'id': 3, 'name': 'Undisclosed'},
{'id': 2, 'name': 'Male'},
{'id': 1, 'name': 'Female'}
]
}
}
]
)
The format of the transform
element is described in http://docs.crunch.io/#transforming-analyses-for-presentation.
For template queries that don't have a transform
element, scrunch supports using simple str
objects to define them, in order to simplify things. With this in mind, the last example can be rewritten as:
new_mt = ds.create_multitable(
name='new multitable',
template=[
'bin(age)',
'race',
{
'query': 'gender',
'transform': {
'categories': [
{'id': 3, 'name': 'Undisclosed'},
{'id': 2, 'name': 'Male'},
{'id': 1, 'name': 'Female'}
]
}
}
]
)
More information on the underlying usage and format of template queries in the crunch API can be found at http://docs.crunch.io/#template-query.