diff --git a/.github/workflows/generate-pdf.yml b/.github/workflows/generate-pdf.yml
index 57dda80..77d0247 100644
--- a/.github/workflows/generate-pdf.yml
+++ b/.github/workflows/generate-pdf.yml
@@ -64,6 +64,7 @@ jobs:
             "rl/eae.md"
             "rl/summary.md"
             "bayes-nets/index.md"
+            "bayes-nets/probability.md"
             "bayes-nets/inference.md"
             "bayes-nets/representation.md"
             "bayes-nets/structure.md"
@@ -150,6 +151,7 @@ jobs:
             "pdf_output/rl_eae.pdf" \
             "pdf_output/rl_summary.pdf" \
             "pdf_output/bayes-nets_index.pdf" \
+            "pdf_output/bayes-nets_probability.pdf" \
             "pdf_output/bayes-nets_inference.pdf" \
             "pdf_output/bayes-nets_representation.pdf" \
             "pdf_output/bayes-nets_structure.pdf" \
diff --git a/bayes-nets/approximate.md b/bayes-nets/approximate.md
index b2f19b4..f1b3540 100644
--- a/bayes-nets/approximate.md
+++ b/bayes-nets/approximate.md
@@ -3,6 +3,8 @@ title: "6.7 Approximate Inference in Bayes Nets: Sampling"
 parent: 6. Bayes Nets
 nav_order: 7
 layout: page
+header-includes:
+    \pagenumbering{gobble}
 ---
 
 # 6.7 Approximate Inference in Bayes Nets: Sampling
diff --git a/bayes-nets/d-separation.md b/bayes-nets/d-separation.md
index 6bd8785..30b49b7 100644
--- a/bayes-nets/d-separation.md
+++ b/bayes-nets/d-separation.md
@@ -3,6 +3,8 @@ title: "6.5 D-Separation"
 parent: 6. Bayes Nets
 nav_order: 5
 layout: page
+header-includes:
+    \pagenumbering{gobble}
 ---
 
 # 6.4 D-Separation
@@ -22,44 +24,38 @@ We will present all three canonical cases of connected three-node two-edge Bayes
 *Figure 2: Causal Chain with Y observed.*
 
 Figure 1 is a configuration of three nodes known as a **causal chain**. It expresses the following representation of the joint distribution over $$X$$, $$Y$$, and $$Z$$:
-$$
-P(x, y, z) = P(z|y)P(y|x)P(x)
-$$
+
+$$P(x, y, z) = P(z|y)P(y|x)P(x)$$
+
 It's important to note that $$X$$ and $$Z$$ are not guaranteed to be independent, as shown by the following counterexample:
 
-$$
-P(y|x) = 
+$$P(y|x) = 
 \begin{cases} 
       1 & \text{if } x = y \\
       0 & \text{else }
-  \end{cases}
-$$
+  \end{cases}$$
 
-$$
-P(z|y) = 
+$$P(z|y) = 
 \begin{cases} 
       1 & \text{if } z = y \\
       0 & \text{else }
-  \end{cases}
-$$
+  \end{cases}$$
 
 <p>
 </p>
 In this case, $$P(z|x) = 1$$ if $$x = z$$ and $$0$$ otherwise, so $$X$$ and $$Z$$ are not independent.
 
 However, we can make the statement that $$X \perp\!\!\!\perp Z | Y$$, as in Figure 2. Recall that this conditional independence means:
-$$
-P(X | Z, Y) = P(X | Y)
-$$
+
+$$P(X | Z, Y) = P(X | Y)$$
+
 We can prove this statement as follows:
 
-$$
-P(X | Z, y) = \frac{P(X, Z, y)}{P(Z, y)}
+$$P(X | Z, y) = \frac{P(X, Z, y)}{P(Z, y)}
 = \frac{P(Z|y) P(y|X) P(X)}{\sum_{x} P(X, y, Z)}
 = \frac{P(Z|y) P(y|X) P(X)}{P(Z|y) \sum_{x} P(y|x)P(x)}
 = \frac{P(y|X) P(X)}{\sum_{x} P(y|x)P(x)}
-= P(X|y)
-$$
+= P(X|y)$$
 
 <p>
 </p>
@@ -77,26 +73,21 @@ An analogous proof can be used to show the same thing for the case where $$X$$ h
 
 Another possible configuration for a triple is the **common cause**. It expresses the following representation:
 
-$$
-P(x, y, z) = P(x|y)P(z|y)P(y)
-$$
+$$P(x, y, z) = P(x|y)P(z|y)P(y)$$
 
 Just like with the causal chain, we can show that $$X$$ is not guaranteed to be independent of $$Z$$ with the following counterexample distribution:
 
-$$
-P(x|y) = 
+$$P(x|y) = 
 \begin{cases} 
       1 & \text{if } x = y \\
       0 & \text{else }
-  \end{cases}
-$$
-$$
-P(z|y) = 
+  \end{cases}$$
+
+$$P(z|y) = 
 \begin{cases} 
       1 & \text{if } z = y \\
       0 & \text{else }
-  \end{cases}
-$$
+  \end{cases}$$
 
 <p>
 </p>
@@ -105,9 +96,7 @@ Then $$P(x|z) = 1$$ if $$x = z$$ and $$0$$ otherwise, so $$X$$ and $$Z$$ are not
 </p>
 But it is true that $$X \perp\!\!\!\perp Z | Y$$. That is, $$X$$ and $$Z$$ are independent if $$Y$$ is observed as in Figure 4. We can show this as follows:
 
-$$
-P(X | Z, y) = \frac{P(X, Z, y)}{P(Z, y)} = \frac{P(X|y) P(Z|y) P(y)}{P(Z|y) P(y)} = P(X|y)
-$$
+$$P(X | Z, y) = \frac{P(X, Z, y)}{P(Z, y)} = \frac{P(X|y) P(Z|y) P(y)}{P(Z|y) P(y)} = P(X|y)$$
 
 ## 6.4.3 Common Effect
 
@@ -121,30 +110,22 @@ $$
 
 It expresses the representation:
 
-$$
-P(x, y, z) = P(y|x,z)P(x)P(z)
-$$
+$$P(x, y, z) = P(y|x,z)P(x)P(z)$$
 
 In the configuration shown in Figure 5, $$X$$ and $$Z$$ are independent: $$X \perp\!\!\!\perp Z$$. However, they are not necessarily independent when conditioned on $$Y$$ (Figure 6). As an example, suppose all three are binary variables. $$X$$ and $$Z$$ are true and false with equal probability:
 
-$$
-P(X=true) = P(X=false) = 0.5
-$$
+$$P(X=true) = P(X=false) = 0.5$$
 
-$$
-P(Z=true) = P(Z=false) = 0.5
-$$
+$$P(Z=true) = P(Z=false) = 0.5$$
 
 and $$Y$$ is determined by whether $$X$$ and $$Z$$ have the same value:
 
-$$
-P(Y | X, Z) = 
+$$P(Y | X, Z) = 
 \begin{cases} 
   1 & \text{if } X = Z \text{ and } Y = true \\
   1 & \text{if } X \ne Z \text{ and } Y = false \\
   0 & \text{else}
-\end{cases}
-$$
+\end{cases}$$
 
 Then $$X$$ and $$Z$$ are independent if $$Y$$ is unobserved. But if $$Y$$ is observed, then knowing $$X$$ will tell us the value of $$Z$$, and vice-versa. So $$X$$ and $$Z$$ are *not* conditionally independent given $$Y$$.
 <p>
diff --git a/bayes-nets/elimination.md b/bayes-nets/elimination.md
index e2c482a..cac521b 100644
--- a/bayes-nets/elimination.md
+++ b/bayes-nets/elimination.md
@@ -3,6 +3,8 @@ title: '6.6 Exact Inference in Bayes Nets'
 parent: 6. Bayes Nets
 nav_order: 6
 layout: page
+header-includes:
+    \pagenumbering{gobble}
 ---
 
 # 6.6 Exact Inference in Bayes Nets
@@ -54,15 +56,11 @@ Alternatively, we can write $$P(C, +e | T, S)$$, even if this is not guaranteed
 
 This approach to writing factors is grounded in repeated applications of the chain rule. In the example above, we know that we can't have a variable on both sides of the conditional bar. Also, we know:
 
-$$
-P(T, C, S, +e) = P(T) P(S | T)  P(C | T) P(+e | C, S) = P(S, T) P(C | T) P(+e | C, S)
-$$
+$$P(T, C, S, +e) = P(T) P(S | T)  P(C | T) P(+e | C, S) = P(S, T) P(C | T) P(+e | C, S)$$
 
 and so:
 
-$$
-P(C | T) P(+e | C, S) = \frac{P(T, C, S, +e)}{P(S, T)} = P(C, +e | T, S)
-$$
+$$P(C | T) P(+e | C, S) = \frac{P(T, C, S, +e)}{P(S, T)} = P(C, +e | T, S)$$
 
 While the variable elimination process is more involved conceptually, the maximum size of any factor generated is only 8 rows instead of 16, as it would be if we formed the entire joint PDF.
 
@@ -70,15 +68,11 @@ While the variable elimination process is more involved conceptually, the maximu
 </p>
 An alternate way of looking at the problem is to observe that the calculation of $$P(T|+e)$$ can either be done through inference by enumeration as follows:
 
-$$
-\alpha \sum_s{\sum_c{P(T)P(s|T)P(c|T)P(+e|c,s)}}
-$$
+$$\alpha \sum_s{\sum_c{P(T)P(s|T)P(c|T)P(+e|c,s)}}$$
 
 or by variable elimination as follows:
 
-$$
-\alpha P(T)\sum_s{P(s|T)\sum_c{P(c|T)P(+e|c,s)}}
-$$
+$$\alpha P(T)\sum_s{P(s|T)\sum_c{P(c|T)P(+e|c,s)}}$$
 
 We can see that the equations are equivalent, except that in variable elimination we have moved terms that are irrelevant to the summations outside of each summation!
 
diff --git a/bayes-nets/inference.md b/bayes-nets/inference.md
index c60febf..e932570 100644
--- a/bayes-nets/inference.md
+++ b/bayes-nets/inference.md
@@ -3,6 +3,8 @@ title: '6.2 Probability Inference'
 parent: 6. Bayes Nets
 nav_order: 2
 layout: page
+header-includes:
+    \pagenumbering{gobble}
 ---
 
 # 6.2 Probabilistic Inference
diff --git a/bayes-nets/probability.md b/bayes-nets/probability.md
index 0d98c52..fc0cf35 100644
--- a/bayes-nets/probability.md
+++ b/bayes-nets/probability.md
@@ -3,6 +3,8 @@ title: "6.1 Probability Rundown"
 parent: 6. Bayes Nets
 nav_order: 1
 layout: page
+header-includes:
+    \pagenumbering{gobble}
 ---
 
 # 6.1 Probability Rundown
@@ -11,57 +13,47 @@ We're assuming that you've learned the foundations of probability in CS70, so th
 
 A **random variable** represents an event whose outcome is unknown. A **probability distribution** is an assignment of weights to outcomes. Probability distributions must satisfy the following conditions:
 
-$$
-0 \leq P(\omega) \leq 1
-$$
-$$
-\sum_{\omega}P(\omega) = 1
-$$
+$$0 \leq P(\omega) \leq 1$$
 
-For instance, if $$ A $$ is a binary variable (can only take on two values), then $$ P(A = 0) = p $$ and $$ P(A = 1) = 1 - p $$ for some $$ p \in [0,1] $$.
+$$\sum_{\omega}P(\omega) = 1$$
+
+For instance, if $$A$$ is a binary variable (can only take on two values), then $$P(A = 0) = p$$ and $$P(A = 1) = 1 - p$$ for some $$p \in [0,1]$$.
 
 We will use the convention that capital letters refer to random variables and lowercase letters refer to some specific outcome of that random variable.
 
-We use the notation $$ P(A, B, C) $$ to denote the **joint distribution** of the variables $$ A, B, C $$. In joint distributions, ordering does not matter, i.e., $$ P(A, B, C) = P(C, B, A) $$.
+We use the notation $$P(A, B, C)$$ to denote the **joint distribution** of the variables $$A, B, C$$. In joint distributions, ordering does not matter, i.e., $$P(A, B, C) = P(C, B, A)$$.
 
 We can expand a joint distribution using the **chain rule**, also sometimes referred to as the product rule. 
 
-$$
-P(A, B) = P(A | B) P(B) = P(B | A) P(A)
-$$
-$$
-P(A_1, A_2, \dots, A_k) = P(A_1) P(A_2 | A_1) \dots P(A_k | A_1, \dots, A_{k-1})
-$$
+$$P(A, B) = P(A | B) P(B) = P(B | A) P(A)$$
+
+$$P(A_1, A_2, \dots, A_k) = P(A_1) P(A_2 | A_1) \dots P(A_k | A_1, \dots, A_{k-1})$$
 
-The **marginal distribution** of $$ A, B $$ can be obtained by summing out all possible values that variable $$ C $$ can take as $$ P(A, B) = \sum_{c}P(A, B, C = c) $$. The marginal distribution of $$ A $$ can also be obtained as $$ P(A) = \sum_{b} \sum_{c}P(A, B = b, C = c) $$. We will also sometimes refer to the process of marginalization as "summing out."
+The **marginal distribution** of $$A, B$$ can be obtained by summing out all possible values that variable $$C$$ can take as $$P(A, B) = \sum_{c}P(A, B, C = c)$$. The marginal distribution of $$A$$ can also be obtained as $$P(A) = \sum_{b} \sum_{c}P(A, B = b, C = c)$$. We will also sometimes refer to the process of marginalization as "summing out."
 
 When we do operations on probability distributions, sometimes we get distributions that do not necessarily sum to 1. To fix this, we **normalize**: take the sum of all entries in the distribution and divide each entry by that sum.
 
 <p>
 </p>
-**Conditional probabilities** assign probabilities to events conditioned on some known facts. For instance, $$ P(A|B = b) $$ gives the probability distribution of $$ A $$ given that we know the value of $$ B $$ equals $$ b $$. Conditional probabilities are defined as:
+**Conditional probabilities** assign probabilities to events conditioned on some known facts. For instance, $$P(A|B = b)$$ gives the probability distribution of $$A$$ given that we know the value of $$B$$ equals $$b$$. Conditional probabilities are defined as:
 
-$$
-P(A|B) = \frac{P(A, B)}{P(B)}.
-$$
+$$P(A|B) = \frac{P(A, B)}{P(B)}.$$
 
 Combining the above definition of conditional probability and the chain rule, we get **Bayes' Rule**:
 
-$$
-P(A | B) = \frac{P(B | A) P(A)}{P(B)}
-$$
+$$P(A | B) = \frac{P(B | A) P(A)}{P(B)}$$
 
-To write that random variables $$ A $$ and $$ B $$ are **mutually independent**, we write $$ A \perp\!\!\!\perp B $$. This is equivalent to $$ B \perp\!\!\!\perp A $$.
+To write that random variables $$A$$ and $$B$$ are **mutually independent**, we write $$A \perp\!\!\!\perp B$$. This is equivalent to $$B \perp\!\!\!\perp A$$.
 
 <p>
 </p>
-When $$ A $$ and $$ B $$ are mutually independent, $$ P(A, B) = P(A) P(B) $$. An example you can think of is two independent coin flips. You may be familiar with mutual independence as just "independence" in other courses. We can derive from the above equation and the chain rule that $$ P(A | B) = P(A) $$ and $$ P(B | A) = P(B) $$.
+When $$A$$ and $$B$$ are mutually independent, $$P(A, B) = P(A) P(B)$$. An example you can think of is two independent coin flips. You may be familiar with mutual independence as just "independence" in other courses. We can derive from the above equation and the chain rule that $$P(A | B) = P(A)$$ and $$P(B | A) = P(B)$$.
 
 <p>
 </p>
-To write that random variables $$ A $$ and $$ B $$ are **conditionally independent** given another random variable $$ C $$, we write $$ A \perp\!\!\!\perp B | C $$. This is also equivalent to $$ B \perp\!\!\!\perp A | C $$.
+To write that random variables $$A$$ and $$B$$ are **conditionally independent** given another random variable $$C$$, we write $$A \perp\!\!\!\perp B | C$$. This is also equivalent to $$B \perp\!\!\!\perp A | C$$.
 
 <p>
 </p>
-If $$ A $$ and $$ B $$ are conditionally independent given $$ C $$, then $$ P(A, B | C) = P(A | C) P(B | C) $$. This means that if we have knowledge about the value of $$ C $$, then $$ B $$ and $$ A $$ do not affect each other. Equivalent to the above definition of conditional independence are the relations $$ P(A | B, C) = P(A | C) $$ and $$ P(B | A, C) = P(B | C) $$. Notice how these three equations are equivalent to the three equations for mutual independence, just with an added conditional on $$ C $$!
+If $$A$$ and $$B$$ are conditionally independent given $$C$$, then $$P(A, B | C) = P(A | C) P(B | C)$$. This means that if we have knowledge about the value of $$C$$, then $$B$$ and $$A$$ do not affect each other. Equivalent to the above definition of conditional independence are the relations $$P(A | B, C) = P(A | C)$$ and $$P(B | A, C) = P(B | C)$$. Notice how these three equations are equivalent to the three equations for mutual independence, just with an added conditional on $$C$$!
 
diff --git a/bayes-nets/representation.md b/bayes-nets/representation.md
index a4d49c5..a047754 100644
--- a/bayes-nets/representation.md
+++ b/bayes-nets/representation.md
@@ -3,6 +3,8 @@ title: '6.3 Bayesian Network Representation'
 parent: 6. Bayes Nets
 nav_order: 3
 layout: page
+header-includes:
+    \pagenumbering{gobble}
 ---
 
 # 6.3 Bayesian Network Representation
@@ -40,15 +42,11 @@ In this Bayes Net, we would store probability tables $$P(B)$$, $$P(E)$$, $$P(A |
 
 Given all of the CPTs for a graph, we can calculate the probability of a given assignment using the following rule:
 
-$$
-P(X1, X2, ..., Xn) = \prod_{i=1}^n{P(X_i | parents(X_i))}
-$$
+$$P(X1, X2, ..., Xn) = \prod_{i=1}^n{P(X_i | parents(X_i))}$$
 
 For the alarm model above, we can actually calculate the probability of a joint probability as follows:
 
-$$
-P(-b, -e, +a, +j, -m) = P(-b) \cdot P(-e) \cdot P(+a | -b, -e) \cdot P(+j | +a) \cdot P(-m | +a)
-$$
+$$P(-b, -e, +a, +j, -m) = P(-b) \cdot P(-e) \cdot P(+a | -b, -e) \cdot P(+j | +a) \cdot P(-m | +a)$$
 
 We will see how this relation holds in the next section.
 
diff --git a/bayes-nets/structure.md b/bayes-nets/structure.md
index 6f06419..3962564 100644
--- a/bayes-nets/structure.md
+++ b/bayes-nets/structure.md
@@ -3,6 +3,8 @@ title: '6.4 Structure of Bayes Nets'
 parent: 6. Bayes Nets
 nav_order: 4
 layout: page
+header-includes:
+    \pagenumbering{gobble}
 ---
 
 # 6.4 Structure of Bayes Nets
@@ -19,9 +21,7 @@ In this class, we will refer to two rules for Bayes Net independences that can b
 
 Using these tools, we can return to the assertion in the previous section: that we can get the joint distribution of all variables by joining the CPTs of the Bayes Net.
 
-$$
-P(X_1, X_2, \dots, X_n) = \prod_{i=1}^n P(X_i | \text{parents}(X_i))
-$$
+$$P(X_1, X_2, \dots, X_n) = \prod_{i=1}^n P(X_i | \text{parents}(X_i))$$
 
 This relation between the joint distribution and the CPTs of the Bayes net works because of the conditional independence relationships given by the graph. We will prove this using an example.
 
@@ -32,15 +32,11 @@ Let's revisit the previous example. We have the CPTs $$P(B)$$ , $$P(E)$$ , $$P(A
 
 For this Bayes net, we are trying to prove the following relation:
 
-$$
-P(B, E, A, J, M) = P(B)P(E)P(A | B, E)P(J | A)P(M | A)
-$$
+$$P(B, E, A, J, M) = P(B)P(E)P(A | B, E)P(J | A)P(M | A)$$
 
 We can expand the joint distribution another way: using the chain rule. If we expand the joint distribution with topological ordering (parents before children), we get the following equation:
 
-$$
-P(B, E, A, J, M) = P(B)P(E | B)P(A | B, E)P(J | B, E, A)P(M | B, E, A, J)
-$$
+$$P(B, E, A, J, M) = P(B)P(E | B)P(A | B, E)P(J | B, E, A)P(M | B, E, A, J)$$
 
 <p></p>
 Notice that in the first equation every variable is represented in a CPT $$P(var | Parents(var))$$ , while in the second equation, every variable is represented in a CPT $$P(var | Parents(var), Ancestors(var))$$ .