-
Notifications
You must be signed in to change notification settings - Fork 7
TreeSlicer: Example 1
In this example we set the sampling proportion (or sampling rate) to 0 during the period from the origin until the first sample is collected. Thus, the sampling proportion is 0 between t0 and t1 (slice x1) and constant between t1 and t2 (slice x2).
Usually no sampling effort was made until (or shortly before) the first sample was collected. For example, in infectious disease outbreaks, there is often a period of cryptic transmission before an outbreak is detected. By default, BDSKY assumes a constant (or at least non-zero) sampling proportion between the origin (start of the epidemic) and the present. During periods where no sampling effort was made, setting a non-zero sampling rate in the model results in biased estimates of other model parameters.
With a small modification the examples below can be adapted to set the sampling proportion to 0 during further periods when no sampling effort was made (e.g. seasonal influenza where samples are only collected during winter months).
Example XML: dengue4_bdsky_equidistant5.xml
It is possible to achieve this result without using TreeSlicer. Change the birth-death skyline tree-prior to:
<distribution id="BirthDeathSkySerial"
spec="beast.evolution.speciation.BirthDeathSkylineModel"
tree="@Tree"
reproductiveNumber="@reproductiveNumber"
becomeUninfectiousRate="@becomeUninfectiousRate"
samplingProportion="@samplingProportion"
origin="@origin">
<samplingRateChangeTimes spec="RealParameter" value="0 38.0000001"/>
<reverseTimeArrays spec="BooleanParameter" value="false false true false false false"/>
</distribution>
samplingRateChangeTimes
is a RealParameter
(vector of floating point numbers), which is used to set the change-point times for the sampling proportion. Here we add a single change-point time at 38.0000001, which is just a little bit more than the time between the most recent sample (1994) and the oldest sample (1956). We also need to add a 0 for a reason I don't quite understand (anything else either throws an error or causes BEAST2 to crash).
We also need to indicate to BDSKY that the sampling proportion change-point times should be reversed, meaning that change-point times are measured as time between the present and the change-point time (heights on the tree), instead of between the origin and the change-point time (default behaviour for BDSKY). We do this using the reverseTimeArrays
array. Sampling proportion change-point times are the third in the array (see BDSKY code/documentation), which is why the third element is set to true
. (Note: since we are also inferring the origin of the epidemic, the time between the origin and the oldest sample keeps changing as the MCMC chain samples the origin. Hence, it is not possible to fix the change-point time to the time of the oldest sample without reversing the time array).
To actually set the first element of the sampling proportion to 0, we set its initial value to 0, where the sampling proportion parameter is declared (since we only operate on the sampling proportion with scale operators, 0 elements will always remain at 0):
<parameter id="samplingProportion" name="stateNode" lower="0.0" upper="1.0">0 0.01</parameter>
There is one problem left. The prior probability is evaluated at the value of every element in the sampling proportion parameter vector. This is usually not an issue (although it is somewhat inelegant to add a prior density on a parameter that is not sampled, it doesn't affect the result of the MCMC analysis), but many distributions are not defined at 0. To get around this we can use beast.core.util.Slice
or ExcludablePrior
(can be found in multitypetree.distributions
or beast.math.distributions
). I prefer to use Slice
, since it is part of the BEAST2 core and it involves less typing when the parameter changes through time. First declare the slice somewhere in the XML file (not inside the MCMC object):
<function spec="beast.core.util.Slice" id="samplingProportionSlice" arg="@samplingProportion" index="1" count="1"/>
This tells BEAST2 to select one element from samplingProportion
, starting at element 1 (Slice
starts numbering at 0). Then, when defining the prior, simply make it point to the slice and not the complete sampling proportion:
<prior id="samplingProportionPrior" name="distribution" x="@samplingProportionSlice">
Example XML file: examples/treeslicer/comparison/dengue4_bdsky_equidistant5.xml
Example XML: dengue4_bdsky_equidistant5_treeslicer.xml
With TreeSlicer, we change samplingRateChangeTimes
to a TreeSlicer object, and explicitly tell it to add one change-point time at the time of the oldest sample:
<distribution id="BirthDeathSkySerial"
spec="beast.evolution.speciation.BirthDeathSkylineModel"
tree="@Tree"
reproductiveNumber="@reproductiveNumber"
becomeUninfectiousRate="@becomeUninfectiousRate"
samplingProportion="@samplingProportion"
origin="@origin">
<samplingRateChangeTimes spec="TreeSlicer" id="SamplingTreeSlice" tree="@Tree"
dimension="2" to="oldestsample" inclusive="true"/>
<reverseTimeArrays spec="BooleanParameter" value="false false true false false false"/>
</distribution>
Note: We still have to set reverseTimeArrays
, set the first element of the sampling proportion to 0 and use Slice
or ExcludablePrior
to ensure the prior probability is not evaluated at 0.
Note: The dimension of TreeSlicer is the number of slices the tree is divided into, which is always equal to the dimension of the parameter (in this case the sampling proportion), or one more than the number of shifts.
We can also log the TreeSlicer object,
<log idref="SamplingTreeSlice"/>
which will record the time between the present and the sampling proportion change-point times (heights on the tree) in the log file. Alternatively, we can directly log the dates of the change-point times with TreeSliceDateLogger
,
<log spec="TreeSliceDateLogger" treeSlice="@SamplingTreeSlice"/>
Unless tip dates are sampled this is not very useful in this scenario, because the sampling times are constant, and thus the change-point times are also constant. Thus, every sample in the log-file will be identical.
-
Security: With TreeSlicer, the time of the oldest sample is part of the data only, not the model. That means the sequencing data in the XML file can be changed, without needing to change the model specification, which makes the code more secure and general. Suppose we add a new sequence sampled at 1996. Without TreeSlicer, we would need to change the sampling proportion change-point time to 40.0000001. Now suppose we discover an older sample, from 1950. We have to change the change-point time again! With TreeSlicer, no changes need to be made, which means there are less opportunities to introduce errors.
-
Ease of use: With TreeSlicer there is no need to calculate the time between the most recent and the oldest sample. It's easy enough in this example (1994 - 1956 = 38). But suppose the most recent sample was from 24 March 2013, and the oldest sample from 29 February 2000. To calculate the time between them you first have to convert the dates to years (with a fraction) and then subtract. In doing so you also have to account for 2000 being a leap-year and ensure that you are doing the calculation in exactly the same way as BEAST2. This can easily become very tedious when it has to be done for several datasets and it is very easy to introduce errors. (In fact it is nearly impossible not to introduce some round-off errors).
-
Add more change-point times: With TreeSlicer it is a simple matter to add multiple change-point times between the time of the oldest sample and the most recent sample. Simply change the model to
<samplingRateChangeTimes spec="TreeSlicer" id="SamplingTreeSlice" tree="@Tree" dimension="5" stop="oldestsample" includeLast="true"/>
to add 3 equidistant shifts between the oldest sample and the present (shown below). Without TreeSlicer, you would need to calculate 38/4 = 9.5 and then change the model to:<samplingRateChangeTimes spec="RealParameter" value="0 9.5 19 28.5 38.0000001"/>
. Again, it is easy to see how this becomes tedious and makes it easy to introduce errors when the month and day of sampling times are known.
Louis du Plessis, 2018