🎉 Add related research and writing via content graph to data pages #2739

danyx23 · 2023-10-11T13:14:32Z

This PR implements #2379. It adds the missing link in our db from wordpress posts to charts that are used there. It then uses this new posts_links table together with the existing posts_gdocs_links table to find the related writing for a data page by going from indciator id -> charts using this indicator -> articles using this indicator.

The posts_links table was modelled on the posts_gdocs_links table as I thought that uniformity is more important than the optimal layout here. Extracting the links is a bit crudely done ATM in that it just uses regex's on the raw html tag instead of parsing the html and querying for a tags. The latter would give us the text content of the content that establishes the links which is probably often useful, but it would complicate and slow down the script. I'd like to hear your opinions on whether this should switch to proper parsing and filling richer information into the DB.

The thumbnail rendering is also a bit ad-hoc. We have an Image component but that one is built for use in gdocs and we need to show thumbnails for both WP posts and Gdocs articles.

To rank related research and writing we use the pageviews table. This is empty by default in dev environments and so this PR adds a make command to refresh pageviews (fetched from datasette-private)

❗ after merging this to production, run the db/syncPostsToGrapher.js script to fill the new relationship table!

danyx23 · 2023-10-31T08:15:23Z

Current dependencies on/for this PR:

master
- PR Data pages: show chart as full width #2879
  - PR ✨ Data pages: enable rendering of data pages on all eligible data pages, remove old datapage #2848
    - PR 🤹 Data pages: merge Sophia's changes to the about this data section #2880
    - PR 🎉 Add related research and writing via content graph to data pages #2739 👈
      - PR Data page: About this data (& refactor shared components) #2853
        
        PR Build sources modal from shared components #2877
        
        PR Sources modal: Multiple indicators #2886
        
        PR Data page: About this data (updated design) #2914
        
        PR Data page / Sources modal: Bug fixes #2918
        
        PR 🔨 restrict datapages to only charts with vars that have any description* set #2920
        PR 🔨 show default thumbnail if none is available #2922
        PR improve-citation-block #2929
        PR enhance: TextWrap opens and closes every HTML tag #2927
        
        PR fix(datapages): don't bind charts in the all charts block to the window #2921
        PR Data pages: final tweaks #2931

This stack of pull requests is managed by Graphite.

db/model/PostLink.ts

db/syncPostsToGrapher.ts

marcelgerber · 2023-11-06T18:10:43Z

db/syncPostsToGrapher.ts

+    const linksToAdd: PostLink[] = []
+    const linksToDelete: PostLink[] = []
+
+    // This is doing a set difference, but we want to do the set operation on a subset


There's the slight technicality in here that the same link can appear multiple times in the same document, which is something we don't detect here.
This is fine for our current use case. And is probably also fine overall (especially seeing that WP is gonna be dead soon anyways).

If you agree, we can also change out the groupBy calls to keyBy.

marcelgerber · 2023-11-06T18:22:23Z

db/wpdb.ts

+            left join charts c on
+                pl.target = c.slug


If I see this right this will only work if there's not a redirect?
And otherwise, chartSlug above will be null, and the join of chart_dimensions will also not work?

Great catch, thanks!

marcelgerber · 2023-11-06T18:23:06Z

db/wpdb.ts

+                pt.post_id = p.id
+            where
+                pl.linkType = 'grapher'
+                and componentType = 'src' -- this filters out links in tags and keeps only embedded charts


I don't know what that comment means? Can we reword it somehow?

(explaining for my own understanding as well 😊)

linkType = 'grapher' is for URLs with paths that match on /^\/grapher\/[\w]+/

componentType = 'src' is for the links that match the anySrcRegex (and not the anyHrefRegex or prominentLinkRegex

So this clause is filtering out any links that may be from an inline link to a grapher, because we only want to count grapher embeds (i.e. links from iframes - <iframe src="https://ourworldindata.org/grapher/life-expectancy")

marcelgerber · 2023-11-06T18:24:15Z

db/wpdb.ts

+                and componentType = 'src' -- this filters out links in tags and keeps only embedded charts
+                and cd.variableId = ?
+                and cd.property in ('x', 'y') -- ignore cases where the indicator is size, color etc
+                and p.status = 'publish' -- only use published wp charts


Suggested change

and p.status = 'publish' -- only use published wp charts

and p.status = 'publish' -- only use published wp posts

ikesau

Looks mostly good to me! I tested out the DB queries very gently, but have done a E2E test yet.

I'll get to that once comments are addressed 👍

db/model/PostLink.ts

ikesau · 2023-11-06T20:45:05Z

db/refreshPageviewsFromDatasette.ts

+import Papa from "papaparse"
+import * as db from "./db.js"
+
+async function downloadAndInsertCSV(): Promise<void> {


ikesau · 2023-11-06T20:46:04Z

db/refreshPageviewsFromDatasette.ts

+    const response = await fetch(csvUrl)
+
+    if (!response.ok) {
+        throw new Error(`Failed to fetch CSV: ${response.statusText}`)


Maybe handle the case here where the caller isn't on Tailscale (and explain that it's available to team members only?)

ikesau · 2023-11-06T21:07:47Z

db/wpdb.ts

+                pt.post_id = p.id
+            where
+                pl.linkType = 'grapher'
+                and componentType = 'src' -- this filters out links in tags and keeps only embedded charts


(explaining for my own understanding as well 😊)

linkType = 'grapher' is for URLs with paths that match on /^\/grapher\/[\w]+/

componentType = 'src' is for the links that match the anySrcRegex (and not the anyHrefRegex or prominentLinkRegex

So this clause is filtering out any links that may be from an inline link to a grapher, because we only want to count grapher embeds (i.e. links from iframes - <iframe src="https://ourworldindata.org/grapher/life-expectancy")

ikesau · 2023-11-06T21:18:15Z

db/migration/1692042923850-AddPostsLinks.ts

@@ -0,0 +1,26 @@
+import { MigrationInterface, QueryRunner } from "typeorm"
+
+export class AddPostsLinks1692042923850 implements MigrationInterface {


A tragedy of alphabetization 🥲

ikesau · 2023-11-06T21:48:54Z

db/wpdb.ts

+                and cd.variableId = ?
+                and cd.property in ('x', 'y') -- ignore cases where the indicator is size, color etc
+                and p.status = 'publish' -- only use published wp charts
+                and coalesce(pg.published, 0) = 0 -- if the wp post has a published gdoc successor then ignore it


There are cases where we have a published WP post that is succeeded by a published Gdoc, that isn't linked via gdocSuccessorId (e.g. Topic Pages do this)

So you might want to handle that case too and filter out WP posts that share a slug with a published Gdoc post.

This deduplicates by url, sorts authors and only uses full chart embeds and ignores plain links to charts

Co-authored-by: Marcel Gerber <[email protected]>

danyx23 · 2023-11-27T11:24:07Z

Merge activity

Nov 27, 6:24 AM: @danyx23 started a stack merge that includes this pull request via Graphite.
Nov 27, 6:25 AM: @danyx23 merged this pull request with Graphite.

danyx23 linked an issue Oct 11, 2023 that may be closed by this pull request

Extend content graph and enable Research and Writing block in data pages #2736

Closed

8 tasks

danyx23 self-assigned this Oct 11, 2023

danyx23 force-pushed the data-pages-add-related-research-and-writing branch from 35db5ff to bc82b10 Compare October 13, 2023 11:55

danyx23 force-pushed the data-pages-add-related-research-and-writing branch 2 times, most recently from 2f1c2b0 to f13860d Compare October 25, 2023 10:46

danyx23 changed the base branch from master to data-page-bake-on-all-eligible-charts-from-etl October 31, 2023 08:15

danyx23 force-pushed the data-pages-add-related-research-and-writing branch from f13860d to 72183e8 Compare October 31, 2023 08:15

danyx23 mentioned this pull request Oct 31, 2023

✨ Data pages: enable rendering of data pages on all eligible data pages, remove old datapage #2848

Merged

danyx23 force-pushed the data-page-bake-on-all-eligible-charts-from-etl branch from 2b18987 to 0ff197c Compare October 31, 2023 15:56

danyx23 force-pushed the data-pages-add-related-research-and-writing branch from 72183e8 to a429808 Compare October 31, 2023 15:56

danyx23 force-pushed the data-page-bake-on-all-eligible-charts-from-etl branch from 0ff197c to b8fea76 Compare October 31, 2023 17:48

danyx23 mentioned this pull request Oct 31, 2023

Data pages: show chart as full width #2879

Merged

danyx23 force-pushed the data-pages-add-related-research-and-writing branch from a429808 to 9e3bf1e Compare October 31, 2023 17:49

danyx23 changed the base branch from data-page-bake-on-all-eligible-charts-from-etl to data-page-merge-about-this-data October 31, 2023 18:01

danyx23 force-pushed the data-pages-add-related-research-and-writing branch from 9e3bf1e to 43a47c2 Compare October 31, 2023 18:01

danyx23 mentioned this pull request Oct 31, 2023

🤹 Data pages: merge Sophia's changes to the about this data section #2880

Closed

danyx23 changed the base branch from data-page-merge-about-this-data to data-page-bake-on-all-eligible-charts-from-etl November 3, 2023 10:41

danyx23 force-pushed the data-pages-add-related-research-and-writing branch from 3cf9b1d to 02cea62 Compare November 3, 2023 10:41

This was referenced Nov 3, 2023

Data page: About this data (& refactor shared components) #2853

Merged

Build sources modal from shared components #2877

Merged

Sources modal: Multiple indicators #2886

Merged

sophiamersmann force-pushed the data-page-bake-on-all-eligible-charts-from-etl branch from b8fea76 to 69a0883 Compare November 3, 2023 11:15

sophiamersmann force-pushed the data-pages-add-related-research-and-writing branch from 02cea62 to cc073f0 Compare November 3, 2023 11:16

danyx23 marked this pull request as ready for review November 6, 2023 09:33

danyx23 requested a review from ikesau November 6, 2023 09:33

marcelgerber reviewed Nov 6, 2023

View reviewed changes

ikesau reviewed Nov 6, 2023

View reviewed changes

danyx23 force-pushed the data-page-bake-on-all-eligible-charts-from-etl branch from 69a0883 to b35fc80 Compare November 8, 2023 09:39

danyx23 force-pushed the data-pages-add-related-research-and-writing branch from 20567e7 to e69b9bc Compare November 8, 2023 09:39

danyx23 and others added 24 commits November 24, 2023 17:45

:constrution: tweak full posts link generation, almost complete

365a41d

✨ add updating of PostLink to wp update hook

d681df2

🚧 WIP - query for related research and writing

a80c490

🐛 fix group by

8d4b75e

🎉 start showing related research and writing

63a9737

🔨 add temporary thumbnail rendering

b8f28fd

🐛 fix wordpress authors display

5204cf5

🔨 tweak related research query

6480d6f

This deduplicates by url, sorts authors and only uses full chart embeds and ignores plain links to charts

🤖 style: prettify code

6cd3a2d

🐝 fix lint issues

e9738a9

🔨 add tooling to get pageview data into local mysql

50c0f71

🔨 make sure pageviews as 0 and not null

f7a695c

✨ use thumbnails for wp posts

7ab6aaf

🔨 add tags to content that is retrieved

b726d7f

: hammer: incorporate tags when matching related research

156daeb

🐝 fix accidental commits in launch.json

a8e8f74

🔨 fix filter query

796f8c4

🔨 fix page title fallback to chart tile

f640483

🐛 fix url not showing up in citation

99dc5ba

🔨 hide charts thumbnails in all charts block for single charts

be2a07f

🔨 hard code link redirects from country templates to selector

a07aa34

Simplify find postlink

78c20b5

Co-authored-by: Marcel Gerber <[email protected]>

🔨incorporate feedback

b617889

💄 (lint) remove unused variable

2943fbe

danyx23 force-pushed the data-page-bake-on-all-eligible-charts-from-etl branch from dd9778f to ad1ddab Compare November 24, 2023 16:50

danyx23 force-pushed the data-pages-add-related-research-and-writing branch from f3666b5 to 2943fbe Compare November 24, 2023 16:50

Base automatically changed from data-page-bake-on-all-eligible-charts-from-etl to master November 27, 2023 11:25

danyx23 merged commit cad20ea into master Nov 27, 2023
10 checks passed

danyx23 deleted the data-pages-add-related-research-and-writing branch November 27, 2023 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎉 Add related research and writing via content graph to data pages #2739

🎉 Add related research and writing via content graph to data pages #2739

danyx23 commented Oct 11, 2023 •

edited

Loading

danyx23 commented Oct 31, 2023 •

edited

Loading

marcelgerber Nov 6, 2023

marcelgerber Nov 6, 2023

danyx23 Nov 16, 2023

marcelgerber Nov 6, 2023

ikesau Nov 6, 2023

marcelgerber Nov 6, 2023

ikesau left a comment

ikesau Nov 6, 2023

ikesau Nov 6, 2023

ikesau Nov 6, 2023

ikesau Nov 6, 2023

ikesau Nov 6, 2023

danyx23 commented Nov 27, 2023 •

edited

Loading

	and p.status = 'publish' -- only use published wp charts
	and p.status = 'publish' -- only use published wp posts

		@@ -0,0 +1,26 @@
		import { MigrationInterface, QueryRunner } from "typeorm"

		export class AddPostsLinks1692042923850 implements MigrationInterface {

🎉 Add related research and writing via content graph to data pages #2739

🎉 Add related research and writing via content graph to data pages #2739

Conversation

danyx23 commented Oct 11, 2023 • edited Loading

danyx23 commented Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ikesau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danyx23 commented Nov 27, 2023 • edited Loading

Merge activity

danyx23 commented Oct 11, 2023 •

edited

Loading

danyx23 commented Oct 31, 2023 •

edited

Loading

danyx23 commented Nov 27, 2023 •

edited

Loading