json, performance, subqueries, and other fixes

btholt · Aug 14, 2024 · 7e14d72 · 7e14d72
1 parent b9d0dff
commit 7e14d72
Show file tree

Hide file tree

Showing 10 changed files with 312 additions and 1 deletion.
diff --git a/lessons/01-welcome/A-intro.md b/lessons/01-welcome/A-intro.md
@@ -74,4 +74,5 @@ Growing SQLite
 - Litestream / Backups
 - LiteFS / Network Access
 - SQLite vs libSQL
-- SQLite Cloud
+
+TODO – subqueries
diff --git a/lessons/04-intermediate-sql/E-subqueries.md b/lessons/04-intermediate-sql/E-subqueries.md
@@ -0,0 +1,33 @@
+Okay, what if you want to find all customer invoices from a certain email address? Using joins, we could accomplish
+
+```sql
+SELECT
+ *
+FROM
+  Invoice i
+
+JOIN
+  Customer c
+ON
+  c.CustomerId = i.CustomerId
+
+WHERE
+  c.Email = '[email protected]';
+```
+
+This totally works and if you like this, roll with it. No issues here. However I'm going to show you a second way (mostly to demonstrate subqueries!)
+
+```sql
+SELECT
+  *
+FROM
+  Invoice
+WHERE
+  CustomerId = (
+    SELECT CustomerId FROM Customer WHERE Email='[email protected]'
+  );
+```
+
+If you put parens in, you can do a subquery. This query is run first and its results can be fed into the parent query. In this case, we use a subquery to find the ID of the email. In this case, for a one-off query, the performance difference isn't important. If this was a thing run constantly in production and it was slow, I'd analyze the performance of both and pick the better one given our data and indexes (stuff we'll talk about later.)
+
+In thise case you can use JOINs to work around using subqueries but it's not always possible. It's a good little tool in your SQL tool belt to have.
diff --git a/lessons/07-extensions/B-json.md b/lessons/07-extensions/B-json.md
diff --git a/lessons/07-performance/A-explain.md b/lessons/07-performance/A-explain.md
@@ -0,0 +1,70 @@
+---
+title: "EXPLAIN"
+---
+
+```sql
+SELECT * FROM Track Where name ='Black Dog';
+PRAGMA index_list('Track');
+EXPLAIN SELECT * FROM Track Where name ='Black Dog';
+EXPLAIN QUERY PLAN SELECT * FROM Track Where name ='Black Dog';
+CREATE INDEX idx_track_name ON Track (Name);
+EXPLAIN QUERY PLAN SELECT * FROM Track Where name ='Black Dog';
+PRAGMA index_list('Track');
+```
+
+Generally speaking, SQLite is extremely fast even on large datasets. It does a great job of working with large amounts of data even on complicated queries. However occasionally it can use some help when you have heavy queries that you run frequently. Let's first understand how to look at queries using some fun features built into SQLite.
+
+```sql
+SELECT * FROM Track Where name ='Black Dog';
+```
+
+Let's say we are building an interface that frequently needs to look up tracks by their names. You'd be running queries like this frequently. Right now our Track database has some 3,000 rows in it but imagine if you had Spotify's database of music. Spotify says it has over 100,000,000 tracks on it, so that query get very slow. Let's see a few of the ways that SQLite gives you to inspect your queries.
+
+```sql
+EXPLAIN SELECT * FROM Track Where name ='Black Dog';
+```
+
+```
+addr  opcode         p1    p2    p3    p4             p5  comment
+----  -------------  ----  ----  ----  -------------  --  -------------
+0     Init           0     17    0                    0   Start at 17
+1     OpenRead       0     409   0     9              0   root=409 iDb=0; Track
+2     Rewind         0     16    0                    0
+3       Column         0     1     1                    0   r[1]= cursor 0 column 1
+4       Ne             2     15    1     BINARY-8       82  if r[1]!=r[2] goto 15
+5       Rowid          0     3     0                    0   r[3]=Track.rowid
+6       Column         0     1     4                    0   r[4]= cursor 0 column 1
+7       Column         0     2     5                    0   r[5]= cursor 0 column 2
+8       Column         0     3     6                    0   r[6]= cursor 0 column 3
+9       Column         0     4     7                    0   r[7]= cursor 0 column 4
+10      Column         0     5     8                    0   r[8]= cursor 0 column 5
+11      Column         0     6     9                    0   r[9]= cursor 0 column 6
+12      Column         0     7     10                   0   r[10]= cursor 0 column 7
+13      Column         0     8     11                   0   r[11]= cursor 0 column 8
+14      ResultRow      3     9     0                    0   output=r[3..11]
+15    Next           0     3     0                    1
+16    Halt           0     0     0                    0
+17    Transaction    0     0     66    0              1   usesStmtJournal=0
+18    String8        0     2     0     Black Dog      0   r[2]='Black Dog'
+19    Goto           0     1     0                    0
+```
+
+I'll be honest, I understand like zero of this. This is what SQLite is doing under the hood. I found the input from Postgres's explain to be much more readable and actionable. In any case, it's there and you can anaylze it if you want to. I never look at this so I just wanted to show you that it's there.
+
+Instead, I use this
+
+```sql
+EXPLAIN QUERY PLAN SELECT * FROM Track Where name ='Black Dog';
+-- `--SCAN Track
+```
+
+Critically, the word `SCAN` here lets you know that this query is going to look at _every_ row in the table. Now, if you only have 3,000 rows or you only run this query infrequently, who cares, a SCAN is fine. However, if you're Spotify and you're scanning 100,000,000 rows every search, then yes, you need to do something about this.
+
+One more fun trick if you're playing around in the CLI:
+
+```sql
+.eqp on
+SELECT * FROM Track Where name ='Black Dog';
+```
+
+If you run `.eqp on`, for the rest of your session it will always show you the query plan when it runs a query. Can be kinda cool to see how SQLite chooses to plan queries.
diff --git a/lessons/07-performance/B-indexes.md b/lessons/07-performance/B-indexes.md
@@ -0,0 +1,52 @@
+Okay, so now we know we have a query we need to optimize. How do we do it? With an index!
+
+An index makes a B-Tree (which stands for balanced tree) to make look ups much faster. Instead of taking O(n) to do a lookup items, it takes O(log n) since it can use a tree to find the item instead of scanning every item in the database.
+
+```sql
+CREATE INDEX idx_track_name ON Track (Name);
+
+-- see it's there now
+PRAGMA index_list('Track');
+
+EXPLAIN QUERY PLAN SELECT * FROM Track Where name ='Black Dog';
+-- `--SEARCH Track USING INDEX idx_track_name (Name=?)
+```
+
+Notice it's a SEARCH now instead of a SCAN. This means it was able to use an index and only look at a subset of the table instead of every row. Hooray!
+
+```sql
+EXPLAIN QUERY PLAN SELECT * FROM Track Where name ='Black Dog';
+```
+
+```
+-- w/ index
+addr  opcode         p1    p2    p3    p4             p5  comment
+----  -------------  ----  ----  ----  -------------  --  -------------
+0     Init           0     19    0                    0   Start at 19
+1     OpenRead       0     409   0     9              0   root=409 iDb=0; Track
+2     OpenRead       1     2     0     k(2,,)         2   root=2 iDb=0; idx_track_name
+3     String8        0     1     0     Black Dog      0   r[1]='Black Dog'
+4     SeekGE         1     18    1     1              0   key=r[1]
+5       IdxGT          1     18    1     1              0   key=r[1]
+6       DeferredSeek   1     0     0                    0   Move 0 to 1.rowid if needed
+7       IdxRowid       1     2     0                    0   r[2]=rowid; Track.rowid
+8       Column         1     0     3                    0   r[3]= cursor 1 column 0
+9       Column         0     2     4                    0   r[4]= cursor 0 column 2
+10      Column         0     3     5                    0   r[5]= cursor 0 column 3
+11      Column         0     4     6                    0   r[6]= cursor 0 column 4
+12      Column         0     5     7                    0   r[7]= cursor 0 column 5
+13      Column         0     6     8                    0   r[8]= cursor 0 column 6
+14      Column         0     7     9                    0   r[9]= cursor 0 column 7
+15      Column         0     8     10                   0   r[10]= cursor 0 column 8
+16      ResultRow      2     9     0                    0   output=r[2..10]
+17    Next           1     5     1                    0
+18    Halt           0     0     0                    0
+19    Transaction    0     0     65    0              1   usesStmtJournal=0
+20    Goto           0     1     0                    0
+```
+
+Again, I have a hard time reading this. You can see that it refers to the index in the comments so that's positive.
+
+Okay so let's talk a little bit more about why you may not to index everything. I've heard the saying that indexes are like aspirin – they're a great help when you have a problem but if you use too many they become a problem.
+
+Every time you insert into a table that has indexes, it has to do some rebuilding of the indexes to accommodate this information. Likewise, if you delete, it has to move its nodes around its B-tree to keep it balanced. B-trees also take up space, and on large tables it can be non-trivial amounts of space. The trade-off here is that indexes help with reads but slow down updates, deletes, and inserts as well as take up space. In general I wait for a query to become a problem first before I try to index it, and even then I try to index only what I need to solve my problem. Pre-mature optimization generally a bad thing to do because as developers we're pretty bad at guessing what's going to go wrong.
diff --git a/lessons/07-extensions/A-sqlite-extensions.md → lessons/08-json/A-sqlite-extensions.md b/lessons/07-extensions/A-sqlite-extensions.md → lessons/08-json/A-sqlite-extensions.md
diff --git a/lessons/08-json/B-querying.md b/lessons/08-json/B-querying.md
@@ -0,0 +1,69 @@
+---
+title: "JSON"
+---
+
+The JSON extension ([json1][json1]) for SQLite is the most useful one to me, and one that I load nearly any time I do a project with SQLite. It is extremely useful for application development. It allows you to read and write arbitrary data. It allows you to treat SQLite essentially like a document-based database like MongoDB where you don't have to define your schema up front and can do so on the fly. It allows you to have arrays inside of rows instead of having to do one-to-many relations across tables. There are still reasons you'd want to define schema and have relations across table, but this makes it easy for simple use cases where you don't need all the SQL rigor.
+
+Let's install it.
+
+```bash
+sqlpkg install sqlite/json1
+sqlpkg which sqlite/json1 # copy the out output of this
+```
+
+Now load your database and run this
+
+```sql
+.load /Users/my-user/.sqlpkg/sqlite/json1/json1.dylib
+
+SELECT json('{"username": "btholt", "favorites":["Daft Punk", "Radiohead"]}');
+```
+
+This should return back to you a JSON object of the string we passed in. As you can see, we can now operate with JSON inside of SQLite! Pretty cool, right? Let's try a few more.
+
+```sql
+-- create an array
+SELECT json_array(1, 2, 3);
+
+-- get the length of an array
+SELECT json_array_length('{"username": "btholt", "favorites":["Daft Punk", "Radiohead"]}', '$.favorites');
+
+-- get the type of a field in an object
+SELECT json_type('{"username": "btholt", "favorites":["Daft Punk", "Radiohead"]}', '$.username');
+
+-- construct a new object using pairs
+SELECT json_object('username', 'btholt', 'favorites', json_array('Daft Punk', 'Radiohead'));
+```
+
+These are a bunch of helper methods to help you interact with JSON objects in SQLite. Let's see how to manipulate it.
+
+```sql
+-- add a new field
+SELECT json_insert('{"username": "btholt", "favorites":["Daft Punk", "Radiohead"]}', '$.city', 'Sacramento');
+
+-- remove a field
+SELECT json_remove('{"username": "btholt", "favorites":["Daft Punk", "Radiohead"]}', '$.favorites');
+
+-- update a field
+SELECT json_replace('{"username": "btholt", "favorites":["Daft Punk", "Radiohead"]}', '$.username', 'holtbt');
+```
+
+## -> and ->>
+
+SQLite provides two convenience operators that it copied from MySQL and Postgres to keep the syntaxes compatible. It allows you to extract specific values from JSON.
+
+```sql
+SELECT json('{"username": "btholt", "favorites":["Daft Punk", "Radiohead"]}') -> 'username';
+SELECT json('{"username": "btholt", "name": { "first": "Brian" }, "favorites":["Daft Punk", "Radiohead"]}') -> 'name';
+SELECT json('{"username": "btholt", "name": { "first": "Brian" }, "favorites":["Daft Punk", "Radiohead"]}') -> 'name' -> 'first';
+```
+
+Notice you can do multiple levels of extraction. Also notice that anything that coming back from with -> operator has double quotes. That's because it's still treating it as JSON so that we can keep using -> to dig further into objects. If we want it to return it as text or integer, we use ->>
+
+```sql
+SELECT json('{"username": "btholt", "name": { "first": "Brian" }, "favorites":["Daft Punk", "Radiohead"]}') -> 'name' ->> 'first';
+```
+
+->> lets you get the actual value out and not JSON. Use it for your last extraction to acutally get the data out.
+
+[json1]: https://sqlite.org/json1.html
diff --git a/lessons/08-json/C-jsonb.md b/lessons/08-json/C-jsonb.md
@@ -0,0 +1,45 @@
+So we've been doing everything with the JSON functions but in reality we really want to use JSONB. It's a more compact way to represent JSON and it's faster to use. It makes everything a little harder to read but otherwise is just better.
+
+```sql
+-- notice the b in jsonb
+SELECT jsonb('{"username": "btholt", "favorites":["Daft Punk", "Radiohead"]}');
+-- ?1?usernamegbtholt?favorites??Daft Punk?Radiohead
+```
+
+It comes out a little hard to read but ultimately we don't want to read it until it's out of the database so I'd suggest always using JSONB.
+
+Given that, let's make a table, insert some stuff into it, and write some queries.
+
+```sql
+CREATE TABLE users (email, data);
+
+INSERT INTO
+  users
+  (email, data)
+VALUES
+  ('[email protected]', jsonb('{"favorites":["Daft Punk", "Radiohead"], "name": {"first": "Brian", "last": "Holt"}}')),
+  ('[email protected]', jsonb('{"favorites":["Daft Punk"], "name": {"first": "Bob", "last": "Smith"}}')),
+  ('[email protected]', jsonb('{"admin": true, "favorites":["The Beatles", "Queen"], "name": {"first": "Alice", "last": "Johnson"}}')),
+  ('[email protected]', jsonb('{"favorites":["Nirvana", "Pearl Jam"], "name": {"first": "Charlie", "last": "Brown"}}')),
+  ('[email protected]', jsonb('{"favorites":["Pink Floyd", "Led Zeppelin"], "name": {"first": "Dave", "last": "Wilson"}}')),
+  ('[email protected]', jsonb('{"favorites":["Madonna", "Michael Jackson"], "name": {"first": "Eve", "last": "Davis"}}')),
+  ('[email protected]', jsonb('{"favorites":["Queen", "David Bowie"], "name": {"first": "Frank", "last": "Miller"}}')),
+  ('[email protected]', jsonb('{"favorites":["Radiohead", "Led Zeppelin"], "name": {"first": "Grace", "last": "Lee"}}')),
+  ('[email protected]', jsonb('{"favorites":["U2", "Radiohead"], "name": {"first": "Hank", "last": "Taylor"}}')),
+  ('[email protected]', jsonb('{"favorites":["Adele", "Beyoncé"], "name": {"first": "Ivy", "last": "Anderson"}}')),
+  ('[email protected]', jsonb('{"favorites":["Radiohead", "Muse"], "name": {"first": "Jack", "last": "Thomas"}}')),
+  ('[email protected]', jsonb('{"favorites":["Taylor Swift", "Madonna"], "name": {"first": "Kate", "last": "Martinez"}}')),
+  ('[email protected]', jsonb('{"favorites":["Nirvana", "Daft Punk"], "name": {"first": "Leo", "last": "Garcia"}}'));
+
+-- it's readable but hard to. **never** modify this directly, always let SQLite do it
+SELECT data from users;
+
+-- get nested data
+SELECT data -> 'name' ->> 'first', data -> 'name' ->> 'last' FROM users;
+
+SELECT data -> 'name' ->> 'first', data -> 'name' ->> 'last' FROM users WHERE json_array_length(data, '$.favorites') < 2;
+```
+
+Pretty straightforward here. Getting data out of JSON is very similar to just normal fields.
+
+For the second one, we are asking for all users that have less than two favorites. You can use these functions anywhere.
diff --git a/lessons/08-json/D-more-advance-queries.md b/lessons/08-json/D-more-advance-queries.md
@@ -0,0 +1,37 @@
+```sql
+-- this is counting the most favorited bands
+SELECT
+  COUNT(f.value) AS count, f.value
+FROM
+  users, json_each(data ->> 'favorites') f
+GROUP BY
+  f.value
+ORDER BY
+  count DESC;
+```
+
+This query finds the most favorited bands. We are using aggregation and something called a [table valued functions][table-valued-function]. We're using it to make a virtual table of all of the values out of the JSON arrays and then summing those into a most commonly favorited bands.
+
+In general this isn't something too common to use table valued functions but here it is useful. Essentially it allows you to give a table to a function and that will generate a virtual table out of values (with usually more or less rows than what in the table).
+
+## Updating JSON
+
+```sql
+-- this is how you update json
+UPDATE
+  users
+SET
+  data = json_insert(
+    (SELECT data FROM users WHERE email ='[email protected]'),
+    '$.favorites[#]',
+    'The xx'
+  )
+WHERE
+  email ='[email protected]';
+```
+
+Updating the JSON can be a bit more difficult. We need a copy of the JSON to update (hence the subquery) and then we need to use a JSON method to update it, and then to set the whole thing as the new value. Because there's no real JSON type and it's really just a string at the end of the day, we have to set it holistically each time.
+
+The `[#]` at the end of `'$.favorites[#]'` is a special syntax that means "add to the end". You can put a number in there if you want to update a specific place.
+
+[table-valued-function]: https://www.sqlite.org/vtab.html#tabfunc2
diff --git a/lessons/08-json/meta.json b/lessons/08-json/meta.json
@@ -0,0 +1,4 @@
+{
+  "title": "JSON",
+  "icon": "package"
+}