general-01-probability-mass-versus-density.qmd

# Probability mass versus probability density

```{r echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
```

When you first encounter continuous distribution, one of the initially confusing things is the concept of probability density that _looks like_ probability of an outcome (but is not) and the fact that it can be any positive value, not just within 0 and 1. To make understanding easier, let us start with a simple concept of _probability mass_. Here, each outcome gets a probability that must be between 0 and 1 and probabilities for all outcomes must add up to 1 (makes these values probability rather than just plausibility). Imagine the simplest case when all events are equally likely. For example, I have four cubes and you must guess the height of the tower that I have built. There are five possible towers (zero-height tower is also a tower, just not a particularly good one) and without any prior knowledge you can assume that each tower height is equally likely: 1/N, where N=5, so 1/5 = 0.2 (or 20%, if you like percentages more). 

There is another way of thinking about this via _cumulative mass function_. It tells a cumulative (total) probability of observing a height of a tower that is equal or smaller than a chosen value. 

```{r echo=FALSE, message=FALSE, warning=FALSE}
cubes_df <- 
  tibble(h = seq(0, 4),
         `Probability mass` = 0.2) %>%
  mutate(`Cumulative probability` = cumsum(`Probability mass`)) %>%
  pivot_longer(cols=c(`Probability mass`, `Cumulative probability`), names_to = "Function", values_to = "Y")

ggplot(cubes_df, aes(x=h, y=Y, color=Function)) + 
  geom_line() +
  geom_point() + 
  scale_y_continuous("Cumulative / probability mass", limits=c(0, 1)) + 
  xlab("Tower height") 
```
As you can see, a cumulative probability of observing a tower of zero height (or lower) is 0.2. We if consider height of 1, it bumps cumulative probability to 0.4 (Pr(height=0) + Pr(height=1) = 0.2 + 0.2 = 0.4). Going up to 2 makes it 0.6, to 3 --- 0.8 and, finally, the cumulative probability of me building a tower of height of 4 _or lower_ is 1 (100%!). The latter includes _all_ possible tower heights, so the probability you observe one of them is 100%. Note that for each height the _probability mass_ tells you by how much the cumulative probability will increase. This probability mass as an change of cumulative probability will become important later.

Note that cumulative probability can only grow and _must_ be between 0 and 1, as all probabilities cannot be negative and must sum up to 1 (otherwise, we call them plausibilities). Below is the same plot but now it includes impossible heights of -1 and 5. Their probabilities are 0, so the cumulative probability does not grow.
```{r echo=FALSE, message=FALSE, warning=FALSE}
cubes_df <- 
  tibble(h = c(-1, seq(0, 4), 5),
         Probability = c(0, rep(0.2, 5), 0)) %>%
  mutate(`Cumulative probability` = cumsum(Probability)) %>%
  pivot_longer(cols=c(Probability, `Cumulative probability`), names_to = "Function", values_to = "Y")

ggplot(cubes_df, aes(x=h, y=Y, color=Function)) + 
  geom_line() +
  geom_point() + 
  scale_y_continuous("Cumulative / probability mass", limits=c(0, 1)) +
  scale_x_continuous("Tower height", breaks=-1:5)
```

But what if I build a "tower" out of  _very fine sand_? For simplicity, let us assume that its height is also within a  0 to 4 (cubes height) range. But now, there are way more heights that my tower can have. If my sand is super fine, as is individual grains are infinitesimally small, there are _infinite_ numbers of possible heights. Knowing that they are all equally likely sort of helps but you cannot compute a probability for each individual height: 1/N, where N=∞ mean 1/∞ ≈ 0. This is the annoying thing about infinities, they make computing things really hard^[People tend to be scared of calculus which is all about limits when things become either infinitesimally small or large.]. So, what do you do if you cannot compute a probability for an individual event/value (height)? You can still compute the _cumulative probability_ of a tower being smaller or equal to a particular height! This is the cumulative function that you saw earlier, but the steps are now so fine that you cannot talk about sums but only about integrals. This is why it is now called the cumulative _density_ function (CDF).
```{r echo=FALSE, message=FALSE, warning=FALSE}

gridN <- 100
height_df <- 
  tibble(h = seq(0, 4, length.out=gridN),
         `Cumulative density` = seq(0, 1, length.out=gridN),
         Function = "Cumulative density")

ggplot(height_df, aes(x=h, y=`Cumulative density`, color=Function)) + 
  geom_line() +
  xlab("Tower height") + 
  ylab("Cumulatie density") 
```

Note that it looks fairly similar to the previous plot with cumulative probability for discrete events. It still grows from 0 to 1 but the steps are much finer. Yet we can still tell the probability of observing a tower of a height that is equal to or less than this. We start of at zero, because you cannot observe towers of negative height, so _all_ tower heights from negative infinity up to zero have a total cumulative probability of 0^[Keep in mind that these values are based on our example of me building a tower. In other circumstances, negative values are possible.]. If we pick the height of 4, then the _cumulative_ probability of observing a tower that is equal-or-less is 1 because _all_ must be that or smaller (that is how we defined it, there cannot be a tower higher than 4!). If we pick the middle of the interval (height = 2), the _cumulative_ probability of observing a tower that is equal-or-smaller is 0.5. This is because we are using a uniform distribution, so half splitting height in half also gives us 50% chance. As you can see, cumulative probability density behaves the same way as the cumulative probability mass in the example above. Both are fairly straightforward, as long as you appreciate that they are about _all_  outcomes up to the point of interest, not just a single event.

What about the probability _density_ function (PDF)? Recall that the probability mass function tells you how much the cumulative probability  _changes_ when you "move to the right" to include the next outcome. Same thing here but for continuous cases a function that describes the rate of change is called a derivative. The formula for our continuous uniform distribution is
$$CDF = \frac{1}{4} \cdot height$$
(note that we are currently only thinking about a range of 0 to 4 to make things simpler). You can check that this is indeed the case by plugging in different heights. Its derivative with respect to height is 
$$\frac{\delta CDF}{\delta height} = \frac{1}{4}$$

Thus, we can now plot both CDF and PDF. 
```{r echo=FALSE}
height_df <-
  tibble(h = seq(0, 4, length.out=100),
         `Cumulative density` = seq(0, 1, length.out=gridN), 
         `Probability density` = 0.25) %>%
  pivot_longer(cols=c(`Probability density`, `Cumulative density`), names_to = "Function", values_to = "Y")

ggplot(height_df, aes(x=h, y=Y, color=Function)) + 
  geom_line() +
  scale_y_continuous("Cumulative / probability mass", limits=c(0, 1)) +
  scale_x_continuous("Tower height", breaks=-1:5)
```
You can reverse the logic and say that the cumulative density function is an _integral_ of PDF, i.e., area under the curve up to each point. So, the total area under PDF (blue line) should end up being 1 (final value of CDF). This is easy to check for a rectangular like that, just multiply its width (height range of 4) by its height (probability density value of 0.25) and get that 1. Note that if you restrict your height=2 then CDF values is 0.5 and the area under blue line (PDF) is also 0.5 (compute!). 

In the example above, probability density is constant at 0.25 and that makes it look "normal", at least not surprising. But what if we restrict my tower-building, so I cannot build anything taller than 0.5 cubes (or meters, units are of little relevance here). Now, the CDF formula is (check!):
$$CDF = 2 \cdot height$$

and for PDF:
$$\frac{\delta CDF}{\delta height} = 2$$

```{r echo=FALSE}
height_df <-
  tibble(h = seq(0, 0.5, length.out=100),
         `Cumulative density` = seq(0, 1, length.out=gridN), 
         `Probability density` = 2) %>%
  pivot_longer(cols=c(`Probability density`, `Cumulative density`), names_to = "Function", values_to = "Y")

ggplot(height_df, aes(x=h, y=Y, color=Function)) + 
  geom_line() +
  scale_y_continuous("Cumulative / probability mass") +
  scale_x_continuous("Tower height")
```

Our constant PDF value is 2! Way above one, how come? Think about the slope and think about the area of rectangle: if its 0.5 units wide it must be 2 units tall to make are equal to 1. The point of this is probability density _is not_ a probability and that you should not care about _absolute_ values of probability density, only about _relative_ values. If you feel that "adding up" values larger than 1 should not give you 1 at the end, yes, it is very counter-intuitive. Integrals, as other things that have to do with infinitely small or large numbers, are counter-intuitive. Thus, ignore the units, ignore the absolute values, just keep in mind that whenever it is higher, the more probable that specific value is. (But, again, there is no way to compute a meaningful probability for a _single_ value).