-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct application of lifetime model #1
Comments
Two effects overlap here: First: In a discrete case the inflow happens instantaneously for each time interval, and the question is whether one allocates the outflow during the first year of residence of an inflow quantity to the current year, to the next year, or somewhere in between (possible reconciliation of the two approaches). Second: Both Excel and scipy.stats.norm don’t give the continuous but the discrete pdf with time interval 1. That means they return the integral of the continuous pdf for 1 year [which explains why it is so slow in Python!], and the question is then: for the last year, for the coming year, or for +/- 0.5 years? To make it more complicated, the answer differs whether you take the discrete pdf or the cdf, (t +/- 0.5 years for pdf, -infty till t for cdf)! That difference is very likely also the reason why there is the cdf for Weibull in the current code but the pdf for normal, but doesn’t mean that both are consistent, needs to be checked. So when ones combines effects 1 and 2 ones gets a bunch of different situations under- or overestimating the actual inflow of the continuous case. Like for the Riemann integral https://en.wikipedia.org/wiki/Riemann_integral |
Please check the attached code to see what I mean (note that at least in spyder, %timeit can’t be run as a full script and has to be run as a line or selection of lines). I added my speed test results as comments, conducted on an i7 Windows 10 laptop with 16GB of memory.
Second, the “big picture” context: if empirically measured outflow data statistics were available, they would be for the calendar year. Therefore the outflows we model should also be for the calendar year, meaning that the outflows labeled 2016 should be the interval 2016-2017, not the interval 2015-2016 or mid-point -/+0.5 years. This is also relevant for comparison with other statistics: we won’t be able to get GDP values for -/+0.5 years. Third, I ran some simple tests comparing python’s normal and Weibull pdf to various deltas of the cdf: (cdf)-(cdf-1), (cdf+1)-(cdf), and (cdf+0.5)-(cdf-0.5). The first two have positive or negative offsets. The +/-0.5 delta approximates the pdf closely but is not exactly the same as the pdf output. This calls into question whether any of them is actually useful. The code for this is also included in the attached .py file. Fourth, this may actually be a non-issue, because we are modeling approximate values and the “accuracy” of them is virtually meaningless – even if we have a handful of empirical statistics. to compare and calibrate with, these empirical results suffer from uncertainties, averaging, simplifications, rounding, measurement errors... So what does accuracy even mean in this case? Which of the two is underestimated or overestimated? I have a feeling that the “error” of using the -1 or +1 delta instead of the -0.5 delta would still be smaller than the error of the -1 delta to the empirical measurements. All these lead me to suggest the following recommendations:
|
import scipy.stats
import numpy as np
import matplotlib.pyplot as plt
# a single year
year = 14
# an array of years
yeararray = np.arange(0,100,1)
# Normal distribution
mean = 60 #feel free to change the parameters
sd = 20
# test the processing time of scipy's normal distribution family:
# for a single year input:
%timeit scipy.stats.norm.pdf(year, loc=mean, scale=sd) #61.3 µs ± 697 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit scipy.stats.norm.cdf(year, loc=mean, scale=sd) #53 µs ± 3.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit scipy.stats.norm.sf(year, loc=mean, scale=sd) #51.9 µs ± 440 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# compare with:
%timeit scipy.stats.norm(loc=mean, scale=sd).pdf(year) #576 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit scipy.stats.norm(loc=mean, scale=sd).cdf(year) #558 µs ± 5.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit scipy.stats.norm(loc=mean, scale=sd).sf(year) #565 µs ± 5.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# clearly the first formulation is better
# for an array of years input:
%timeit scipy.stats.norm.pdf(yeararray, loc=mean, scale=sd) #66.8 µs ± 459 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit scipy.stats.norm.cdf(yeararray, loc=mean, scale=sd) #56.6 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit scipy.stats.norm.sf(yeararray, loc=mean, scale=sd) #57.5 µs ± 947 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# compare with a for loop:
def normal_cdf(yearseries):
cdfarray = np.zeros(len(yearseries))
for i in yeararray:
cdfarray[i] = scipy.stats.norm.cdf(yearseries[i], loc=mean, scale=sd)
return cdfarray
# ensure that the for loop and the built-in treatment of numpy arrays provide the same results:
scipy.stats.norm.cdf(yeararray, loc=mean, scale=sd) == normal_cdf(yeararray)
%timeit normal_cdf(yeararray) #5.2 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# this is ~100 times slower than the built-in array treatment, which makes sense because the for loops 100 times.
# Weibull distribution
k = 2.5 #feel free to change the parameters
l = 50
# test the processing time of scipy's WEIBULL distribution family:
# for a single year input:
%timeit scipy.stats.weibull_min.pdf(year, k, 0, l) #72.5 µs ± 597 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit scipy.stats.weibull_min.cdf(year, k, 0, l) #60.5 µs ± 354 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit scipy.stats.weibull_min.sf(year, k, 0, l) #61 µs ± 908 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# compare with:
%timeit scipy.stats.weibull_min(k, 0, l).pdf(year) #549 µs ± 6.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit scipy.stats.weibull_min(k, 0, l).cdf(year) #547 µs ± 7.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit scipy.stats.weibull_min(k, 0, l).sf(year) #542 µs ± 4.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# clearly the first formulation is better
# for an array of years input:
%timeit scipy.stats.weibull_min.pdf(yeararray, k, 0, l) #87.5 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit scipy.stats.weibull_min.cdf(yeararray, k, 0, l) #72.7 µs ± 712 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit scipy.stats.weibull_min.sf(yeararray, k, 0, l) #75.6 µs ± 4.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# compare with a for loop:
def weib_cdf(yearseries):
cdfarray = np.zeros(len(yearseries))
for i in yeararray:
cdfarray[i] = scipy.stats.weibull_min.cdf(yearseries[i], k, 0, l)
return cdfarray
# ensure that the for loop and the built-in treatment of numpy arrays provide the same results:
scipy.stats.weibull_min.cdf(yeararray, k, 0, l) == weib_cdf(yeararray)
%timeit weib_cdf(yeararray) #6.14 ms ± 56.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# the loop runs 100 times, and its speed is ~100 times longer than the array calculation.
# discretization comparisons
plt.plot(yeararray, scipy.stats.norm.pdf(yeararray, loc=mean, scale=sd), label="pdf")
plt.plot(yeararray, scipy.stats.norm.cdf(yeararray, loc=mean, scale=sd) - scipy.stats.norm.cdf(yeararray-1, loc=mean, scale=sd), label="(cdF)-(cdf-1)")
plt.plot(yeararray, scipy.stats.norm.cdf(yeararray+1, loc=mean, scale=sd) - scipy.stats.norm.cdf(yeararray, loc=mean, scale=sd), label="(cdF+1)-(cdf)")
plt.plot(yeararray, scipy.stats.norm.cdf(yeararray+0.5, loc=mean, scale=sd) - scipy.stats.norm.cdf(yeararray-0.5, loc=mean, scale=sd), label="(cdF+0.5)-(cdf-0.5)")
plt.legend(loc='best', numpoints=1)
norm_pdf = scipy.stats.norm.pdf(yeararray, loc=mean, scale=sd)
norm_cdf05 = scipy.stats.norm.cdf(yeararray+0.5, loc=mean, scale=sd) - scipy.stats.norm.cdf(yeararray-0.5, loc=mean, scale=sd)
plt.plot(yeararray, scipy.stats.weibull_min.pdf(yeararray, k, 0, l), label="pdf")
plt.plot(yeararray, scipy.stats.weibull_min.cdf(yeararray, k, 0, l) - scipy.stats.weibull_min.cdf(yeararray-1, k, 0, l), label="(cdF)-(cdf-1)")
plt.plot(yeararray, scipy.stats.weibull_min.cdf(yeararray+1, k, 0, l) - scipy.stats.weibull_min.cdf(yeararray, k, 0, l), label="(cdF+1)-(cdf)")
plt.plot(yeararray, scipy.stats.weibull_min.cdf(yeararray+0.5, k, 0, l) - scipy.stats.weibull_min.cdf(yeararray-0.5, k, 0, l), label="(cdF+0.5)-(cdf-0.5)")
plt.legend(loc='best', numpoints=1)
weib_pdf = scipy.stats.weibull_min.pdf(yeararray, k, 0, l)
weib_cdf05 = scipy.stats.weibull_min.cdf(yeararray+0.5, k, 0, l) - scipy.stats.weibull_min.cdf(yeararray-0.5, k, 0, l) |
This is what I ended up doing most of the time -- pre-sampling. We could put distributions or random numbers into an array or dictionary, which is created in the beginning. |
Thanks a lot for checking the different stats methods so carefully! I started revising the dynamic stock methods according to your suggestions. A first version is running on the master branch now, but some of the unit tests need to be rewritten as we use sf instead of pdf now. I asked Steffi, who is a PhD student in the Freiburg group, to have a look and work on it. Using sf the inflow-driven model can now be reformulated as where s_c is the stock by cohort, i is the inflow, and sf the survival function array. The model is quite fast now. |
In addition, the methods were now changed to allow for a non-zere outflow from an age-cohort in year 0 already. In practice, that means that the methods can handle situations where sf[m,m] != 1. That is important for products with short lifetimes, and for modellers who do not want to use the current calculation method for the sf, which is calculated for the end of the year, assuming that the inflow, which in reality is distributed across the year, happens instantaneously at the end of the year. |
excellent, thank you for the quick implementations! The Einstein sum implementation is very elegant. I’d suggest that for clarity, this row of code would probably benefit from a small comment explaining the function’s syntax. It's likely that not all in the IE and MFA community are familiar with this special syntax and the Einstein summation convention in general. |
Comment added and pushed! |
For dynamic stock modelling we need to solve 3 problems:
The text was updated successfully, but these errors were encountered: