Skip to content

Commit

Permalink
Deploy gregates/gregates.github.io to gregates/gregates.github.io:gh-…
Browse files Browse the repository at this point in the history
…pages
  • Loading branch information
GitHub Actions committed Dec 2, 2024
0 parents commit f46128d
Show file tree
Hide file tree
Showing 28 changed files with 958 additions and 0 deletions.
3 changes: 3 additions & 0 deletions 404.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
<!doctype html>
<title>404 Not Found</title>
<h1>404 Not Found</h1>
1 change: 1 addition & 0 deletions CNAME
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
gregat.es
56 changes: 56 additions & 0 deletions baby-names-rise-of-n/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
<!DOCTYPE html>
<html lang="en">

<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width,initial-scale=1"><meta property="author" content="Greg Gates"/>
<title>gregat.es</title>
<link href="/g.css" rel="stylesheet" />
</head>

<body>

<h1 class="title">
The rise of -n
</h1>
<p class="subtitle"><strong>2024-04-04</strong></p>
<main class="blog">
<p>One of my favorite public datasets is the baby name data published every year by
the United States Social Security Administration (SSA). When I joined
<a href="https://rowzero.io">Row Zero</a>, one of the first things I did was gather this data
into a single csv (it's published as one csv per year) and upload it to S3 for us
to use as a test dataset. It's nice because it's not too big — less than 8 KiB
gzipped — but it's still too big for Excel since it has more than 2 million rows.</p>
<p>We've done a lot of fun ad hoc analyses of this data set. One of my colleagues produced
a graph of baby name popularity over time by final letter, which prompted me to do
a deeper dive on the popularity of baby boy names ending in with the letter n, and
specifically with the suffix -ayden, -aiden, or -aden. I published that on Row Zero's blog.
You can read it here: <a href="https://rowzero.io/blog/baby-names-rise-of-n">The Rise of -n</a>.
That blog post also has a link to a Row Zero workbook containing all the SSA baby name
data from 1880 to 2022, which you can copy to do your own analysis.</p>
<p>Some fun takeaways:</p>
<ul>
<li>The most popular final letter for baby boys in the U.S. has been n since 1963.</li>
<li>The most popular final letter for baby girls in the U.S. has been a since 1935.</li>
<li>Peak popularity of boy names ending in n was in 2011, at 36.4% of all baby boy names
in the SSA data.</li>
<li>Although -n names are broadly popular, the peak in 2011 was driven specifically by
-aiden, -ayden, and -aden names, without which the peak would have been earlier and lower.</li>
<li>Among Gen Z boys in the U.S., 2.6% have names ending in -ayden, -aiden, or -aden. This is a little
bit higher than the popularity of Matthew for Millenials.</li>
</ul>
<p>Again, you can read the full analysis (including graphs!) over at <a href="https://rowzero.io/blog/baby-names-rise-of-n">Row Zero's
blog</a>.</p>

<a href="/">&#171; archive</a>
</main>

<footer>
<a href="https://rowzero.io/home">Row Zero</a>
<a href="https://github.com/gregates">GitHub</a>
<a href="https://www.linkedin.com/in/greg-gates-87482420">LinkedIn</a>
<a href="https://bsky.app/profile/gregat.es">Bluesky</a>
</footer>
</body>

</html>
Binary file added baby_names_pivot_chart.webp
Binary file not shown.
Binary file added baby_names_pivot_config.webp
Binary file not shown.
Binary file added baby_names_sum_of_count_by_year.webp
Binary file not shown.
Binary file added baby_names_sum_of_count_by_year_and_sex.webp
Binary file not shown.
Binary file not shown.
Binary file added baby_names_table.webp
Binary file not shown.
Binary file added baby_names_unique_year_sex.webp
Binary file not shown.
10 changes: 10 additions & 0 deletions elasticlunr.min.js

Large diffs are not rendered by default.

Binary file added excel-loss-of-precision.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
141 changes: 141 additions & 0 deletions float-like-excel/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
<!DOCTYPE html>
<html lang="en">

<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width,initial-scale=1"><meta property="og:image" content="&#x2F;excel-loss-of-precision.gif"/><meta property="author" content="Greg Gates"/>
<title>gregat.es</title>
<link href="/g.css" rel="stylesheet" />
</head>

<body>

<h1 class="title">
Float like Excel
</h1>
<p class="subtitle"><strong>2024-11-27</strong></p>
<main class="blog">
<p>Microsoft Excel stores numbers in a binary floating-point format. Specifically, <a href="https://learn.microsoft.com/en-us/office/troubleshoot/excel/floating-point-arithmetic-inaccurate-result">the
documentation</a>
tells us that Excel &quot;was designed around&quot; <a href="https://en.wikipedia.org/wiki/IEEE_754">IEEE 754</a> and
uses a version of the binary64 type specified in that document. In the course of building <a href="https://rowzero.io/home">the
world's fastest spreadsheet</a>, I've had occasion to look into some of the
nuances of how Excel handles numbers. In this post, I will explain one way in which Excel's behavior
surprised me. Excel discards more numeric precision than is strictly necessary for the binary64
format (more commonly known as double-precision floating point numbers, or &quot;doubles&quot; for short).</p>
<p>(If you're already an expert on the binary64 format, feel free to skip the next two sections.)</p>
<h2 id="quick-primer-on-binary64">Quick primer on binary64</h2>
<p>The binary64 format encodes numbers in 64 bits. It uses 1 bit for the <em>sign</em>, like any signed number
format. 11 bits are used to encode the <em>exponent</em>, and 52 bits for the <em>significand</em>, sometimes also
called the <em>mantissa</em>.</p>
<p>As an 11-bit unsigned integer, the exponent can range from 0 to 2047. But we want to
support negative exponents, so the intended value is recovered by subtracting 1023, also known as
the <em>bias</em>.</p>
<p>The 52 bits of the significand are used as the fractional part of a number with a leading 1 (except
in the subnormal case, which we'll ignore — Excel explicitly doesn't support it, anyway). We
can get away with using 52 bits for a 53-bit number because the most significant digit of any number
is guaranteed to be non-zero, which in binary means it must be 1. So let's disambiguate: we'll call
the (unsigned) integer represented by the 52 bits the <em>fraction</em>, and the <em>significand</em> is always a
53 bit number, and is recovered by adding the leading 1, i.e., the significand is 1.<em>fraction</em>. Note that this is a binary
fraction, so multiplying by 2<sup><em>n</em></sup> just shifts the decimal point <em>n</em> places to the right, like
multiplying by 10<sup><em>n</em></sup> does in decimal.</p>
<p>Then we can recover the number encoded in the bits as:</p>
<blockquote>
<p>(-1)<sup><em>sign</em></sup> × <em>significand</em> × 2<sup><em>exponent</em>-1023</sup></p>
</blockquote>
<p>Let's also define a <em>representable</em> number as one which can be recovered, using this formula, from
a binary64 number.</p>
<p>A few facts that follow from these definitions:</p>
<ol>
<li>Not every number is representable. For example, 1234567890.123456789 — it's not possible to
encode this many significant decimal digits in 53 binary digits.</li>
<li>The largest representable number is 1.7976931348623157e308 (where &quot;e308&quot; is scientific notation
meaning &quot;×10<sup>308</sup>&quot;).</li>
<li>The least representable number is -1.7976931348623157e308.</li>
<li>The smallest representable fraction (again ignoring subnormal numbers) is 2.2250738585072014e-308.</li>
<li>In general, decimal numbers with 15 significant digits or fewer which are neither too large nor
too small <em>are</em> representable, but some numbers with as many as 17 significant decimal digits are representable.</li>
<li>An example of the last is 2<sup>53</sup>, which is equivalent to 9,007,199,254,740,992 (16
significant digits). This number is just barely too big to fit in the 53 bits of the
significand. But since it's exactly 2<sup>52</sup> × 2<sup>1</sup>, it can be represented that
way.</li>
</ol>
<p>So what happens when you have a number that isn't representable, and you need to store it in this
format?</p>
<h2 id="rounding-rules">Rounding rules</h2>
<p>Well, you have no choice but to round. But round to what? The nearest representable number is the
option that introduces the least rounding error — the smallest delta between the number you
want and the number you get. So that's what the IEEE 754 specification says to do. And then you also
need a rule for breaking ties, when there are two equidistant representable numbers. The typical,
but not necessarily required, rule is &quot;round ties to even&quot;, which is more colloquially known as
&quot;banker's rounding&quot;. In binary64, since there's only one even digit, that means if you're
equidistant to two representable numbers, you pick the one with a 0 in the least digit in the
significand.</p>
<p>Here's an example. As noted above, 2<sup>53</sup> is representable. We saw that this was because it
can be represented as a 53-bit number multiplied by 2. This is true for <em>every</em> even integer in the
range 2<sup>53</sup> to 2<sup>54</sup> - 1. But the odd integers are not representable.</p>
<p>So if we try to parse 2<sup>53</sup> + 1, we have to round it either to 2<sup>53</sup> or
2<sup>53</sup> + 2. The former is representable as 2<sup>52</sup> × 2; the latter is
representable as (2<sup>52</sup> + 1) × 2. For the former, the binary significand is a one
followed by 52 zeros. The latter has a one followed by 51 zeros and a trailing one. So the rule says
to choose the former. And this is what we see in, for example, rust's <code>f64</code> type. The following
program executes successfully:</p>
<pre data-lang="rust" style="background-color:#2b303b;color:#c0c5ce;" class="language-rust "><code class="language-rust" data-lang="rust"><span style="color:#b48ead;">fn </span><span style="color:#8fa1b3;">main</span><span>() {
</span><span> </span><span style="color:#b48ead;">let</span><span> a = </span><span style="color:#d08770;">2</span><span style="color:#b48ead;">f64</span><span>.</span><span style="color:#96b5b4;">powi</span><span>(</span><span style="color:#d08770;">53</span><span>);
</span><span> </span><span style="color:#b48ead;">let</span><span> b = a + </span><span style="color:#d08770;">1.0</span><span>;
</span><span> </span><span style="color:#b48ead;">let</span><span> c = a + </span><span style="color:#d08770;">2.0</span><span>;
</span><span>
</span><span> assert_eq!(a, b);
</span><span> assert_ne!(a, c);
</span><span>}
</span></code></pre>
<h2 id="what-does-excel-do">What does Excel do?</h2>
<p>So we're now in a position to state how Excel diverges from what I'd expect from an implementation
of IEEE 754 binary64 numbers. Excel does not round to the nearest representable number. Instead,
they truncate to 15 significant decimal digits, which as we noted above is guaranteed to be
representable. And they do this even if a number with more than 15 significant decimal digits is
representable without rounding!</p>
<p>For example if you type 9,007,199,254,740,992 (2<sup>53</sup>) into a cell in Excel, what you get
back is 9,007,199,254,740,990. Note the final digit. You get the same result if you enter any number
in the range 9,007,199,254,740,990 to 9,007,199,254,740,999.</p>
<p>This is true even though 9,007,199,254,741,000 is accepted as-is by Excel, and is closer to
9,007,199,254,740,999 than the value it actually rounds to. So this is not rounding — it's
truncation to 15 significant digits.</p>
<p><img src="/excel-loss-of-precision.gif" alt="Animated gif showing Excel parsing the numbers 9,007,199,254,740,989 to 9,007,199,254,741,000, demonstrating the loss of precision." /></p>
<h2 id="pros-and-cons-of-excel-s-behavior">Pros and cons of Excel's behavior</h2>
<p>A consequence of what Excel does is that it introduces a larger overall rounding error. In rust, if
you subtract 2<sup>53</sup> from 2<sup>53</sup> + 2, you get 2, which is the precise, correct
result. In Excel, you get 0. If you introduce many such errors, and then do, say, a sum over a bunch
of numbers with individual small rounding errors, the total error can add up to be quite large.
This is the reason for rounding to the nearest representable number — to reduce error
introduced by rounding. It's also the <a href="https://stackoverflow.com/questions/45223778/is-bankers-rounding-really-more-numerically-stable">reason to use banker's rounding</a> rather than the more familiar rule we learn in school (round ties away from zero).</p>
<p>So why do what Excel does? Excel's behavior gives you the following property: no number it displays
will contain a significant digit that's different from one you typed. By limiting to 15 significant
decimal places, it can
guarantee that the truncated number is precisely representable. And by truncating instead of
rounding, it can guarantee that the significant digits that remain are exactly the same as the
input.</p>
<p>I imagine this is the property that they wanted. An Excel user might feel, if they
typed 9,007,199,254,740,993 and got back 9,007,199,254,740,992, that this was a bug. Or, potentially
worse, wouldn't even realize it had changed, and would later wrongly infer that the number they
entered was 9,007,199,254,740,992.</p>
<p>Of course, the actual behavior might also appear to be a bug, but I imagine it is easier to explain
&quot;we only support 15 significant digits of precision&quot; than it is to explain binary64 in all its
complex glory. Is it less surprising for a 2 to become a 0 than a 4? I guess, maybe.</p>
<p>It's worth noting that Google sheets emulates Excel exactly here. Is that just extreme dedication to
Excel-compatibility? Or is it because they agree that this behavior is desirable?</p>
<p>The cost of this is that Excel sacrifices more precision than is strictly necessary. Personally, I'm
not sure that trade-off is worth it.</p>

<a href="/">&#171; archive</a>
</main>

<footer>
<a href="https://rowzero.io/home">Row Zero</a>
<a href="https://github.com/gregates">GitHub</a>
<a href="https://www.linkedin.com/in/greg-gates-87482420">LinkedIn</a>
<a href="https://bsky.app/profile/gregat.es">Bluesky</a>
</footer>
</body>

</html>
1 change: 1 addition & 0 deletions g.css

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions googled81170c38ea5c4ca.html
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
google-site-verification: googled81170c38ea5c4ca.html
52 changes: 52 additions & 0 deletions index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
<!DOCTYPE html>
<html lang="en">

<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width,initial-scale=1"><meta property="author" content="Greg Gates"/>
<title>gregat.es</title>
<link href="/g.css" rel="stylesheet" />
</head>

<body>

<main>
<div id="index">
<!-- If you are using pagination, section.pages will be empty. You need to use the paginator object -->

<div class="row">
<span>2024-11-27</span><a href="https://gregat.es/float-like-excel/">Float like Excel</a></li>
</div>

<div class="row">
<span>2024-11-15</span><a href="https://gregat.es/traffic-poem/">Traffic poem</a></li>
</div>

<div class="row">
<span>2024-07-31</span><a href="https://gregat.es/pivot-shorter-fatter-groupby/">Pivot is just a shorter, fatter group by</a></li>
</div>

<div class="row">
<span>2024-06-10</span><a href="https://gregat.es/pmtud/">PMTUD: an AWS debugging story</a></li>
</div>

<div class="row">
<span>2024-05-07</span><a href="https://gregat.es/language-barrier/">Oddities and difficulties in cross-cultural communication</a></li>
</div>

<div class="row">
<span>2024-04-04</span><a href="https://gregat.es/baby-names-rise-of-n/">The rise of -n</a></li>
</div>

</div>
</main>

<footer>
<a href="https://rowzero.io/home">Row Zero</a>
<a href="https://github.com/gregates">GitHub</a>
<a href="https://www.linkedin.com/in/greg-gates-87482420">LinkedIn</a>
<a href="https://bsky.app/profile/gregat.es">Bluesky</a>
</footer>
</body>

</html>
54 changes: 54 additions & 0 deletions keybase.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
==================================================================
https://keybase.io/gregates
--------------------------------------------------------------------

I hereby claim:

* I am an admin of http://gregat.es
* I am gregates (https://keybase.io/gregates) on keybase.
* I have a public key ASDxGDAaWy6TTn2S10OySwcBg__f-2v9odee2LsNDdHtrwo

To do so, I am signing this object:

{
"body": {
"key": {
"eldest_kid": "0120f118301a5b2e934e7d92d743b24b070183ffdffb6bfda1d79ed8bb0d0dd1edaf0a",
"host": "keybase.io",
"kid": "0120f118301a5b2e934e7d92d743b24b070183ffdffb6bfda1d79ed8bb0d0dd1edaf0a",
"uid": "baa5eee63dc343bdfc303082d9ab2119",
"username": "gregates"
},
"service": {
"hostname": "gregat.es",
"protocol": "http:"
},
"type": "web_service_binding",
"version": 1
},
"client": {
"name": "keybase.io go client",
"version": "1.0.17"
},
"ctime": 1472272673,
"expire_in": 504576000,
"merkle_root": {
"ctime": 1472272527,
"hash": "29ee88f6aa6df77f22e56ad907e352854a1953b8857cc9667fc4538dd55f15d25ffe0004a598950f15f3990ce14a18e780722b5a5fd5d961b973f5ffc3bdc96b",
"seqno": 604678
},
"prev": "26dd3a80b6bd46f1ff82eaed0b91a964936b362bb1d16badd5ade9aa6faead82",
"seqno": 9,
"tag": "signature"
}

which yields the signature:

hKRib2R5hqhkZXRhY2hlZMOpaGFzaF90eXBlCqNrZXnEIwEg8RgwGlsuk059ktdDsksHAYP/3/tr/aHXnti7DQ3R7a8Kp3BheWxvYWTFAvB7ImJvZHkiOnsia2V5Ijp7ImVsZGVzdF9raWQiOiIwMTIwZjExODMwMWE1YjJlOTM0ZTdkOTJkNzQzYjI0YjA3MDE4M2ZmZGZmYjZiZmRhMWQ3OWVkOGJiMGQwZGQxZWRhZjBhIiwiaG9zdCI6ImtleWJhc2UuaW8iLCJraWQiOiIwMTIwZjExODMwMWE1YjJlOTM0ZTdkOTJkNzQzYjI0YjA3MDE4M2ZmZGZmYjZiZmRhMWQ3OWVkOGJiMGQwZGQxZWRhZjBhIiwidWlkIjoiYmFhNWVlZTYzZGMzNDNiZGZjMzAzMDgyZDlhYjIxMTkiLCJ1c2VybmFtZSI6ImdyZWdhdGVzIn0sInNlcnZpY2UiOnsiaG9zdG5hbWUiOiJncmVnYXQuZXMiLCJwcm90b2NvbCI6Imh0dHA6In0sInR5cGUiOiJ3ZWJfc2VydmljZV9iaW5kaW5nIiwidmVyc2lvbiI6MX0sImNsaWVudCI6eyJuYW1lIjoia2V5YmFzZS5pbyBnbyBjbGllbnQiLCJ2ZXJzaW9uIjoiMS4wLjE3In0sImN0aW1lIjoxNDcyMjcyNjczLCJleHBpcmVfaW4iOjUwNDU3NjAwMCwibWVya2xlX3Jvb3QiOnsiY3RpbWUiOjE0NzIyNzI1MjcsImhhc2giOiIyOWVlODhmNmFhNmRmNzdmMjJlNTZhZDkwN2UzNTI4NTRhMTk1M2I4ODU3Y2M5NjY3ZmM0NTM4ZGQ1NWYxNWQyNWZmZTAwMDRhNTk4OTUwZjE1ZjM5OTBjZTE0YTE4ZTc4MDcyMmI1YTVmZDVkOTYxYjk3M2Y1ZmZjM2JkYzk2YiIsInNlcW5vIjo2MDQ2Nzh9LCJwcmV2IjoiMjZkZDNhODBiNmJkNDZmMWZmODJlYWVkMGI5MWE5NjQ5MzZiMzYyYmIxZDE2YmFkZDVhZGU5YWE2ZmFlYWQ4MiIsInNlcW5vIjo5LCJ0YWciOiJzaWduYXR1cmUifaNzaWfEQPeZIhHuiTaRpA5kkLwrkNKIA9QNNAjF09/SbZXYycwCv/UZdN1ygiyahRFsNcdT0ZP+U130AT5yfk+5RtXMLwOoc2lnX3R5cGUgpGhhc2iCpHR5cGUIpXZhbHVlxCDnADSNE1Ma1X32odiBlYig4myZxwH4PAjPLXrqgZcnRKN0YWfNAgKndmVyc2lvbgE=

And finally, I am proving ownership of this host by posting or
appending to this document.

View my publicly-auditable identity here: https://keybase.io/gregates

==================================================================
Loading

0 comments on commit f46128d

Please sign in to comment.