Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Add related research and writing via content graph to data pages #2739

Merged
merged 25 commits into from
Nov 27, 2023

Conversation

danyx23
Copy link
Contributor

@danyx23 danyx23 commented Oct 11, 2023

This PR implements #2379. It adds the missing link in our db from wordpress posts to charts that are used there. It then uses this new posts_links table together with the existing posts_gdocs_links table to find the related writing for a data page by going from indciator id -> charts using this indicator -> articles using this indicator.

The posts_links table was modelled on the posts_gdocs_links table as I thought that uniformity is more important than the optimal layout here. Extracting the links is a bit crudely done ATM in that it just uses regex's on the raw html tag instead of parsing the html and querying for a tags. The latter would give us the text content of the content that establishes the links which is probably often useful, but it would complicate and slow down the script. I'd like to hear your opinions on whether this should switch to proper parsing and filling richer information into the DB.

The thumbnail rendering is also a bit ad-hoc. We have an Image component but that one is built for use in gdocs and we need to show thumbnails for both WP posts and Gdocs articles.

To rank related research and writing we use the pageviews table. This is empty by default in dev environments and so this PR adds a make command to refresh pageviews (fetched from datasette-private)

  • ❗ after merging this to production, run the db/syncPostsToGrapher.js script to fill the new relationship table!

@danyx23 danyx23 linked an issue Oct 11, 2023 that may be closed by this pull request
8 tasks
@danyx23 danyx23 self-assigned this Oct 11, 2023
@danyx23 danyx23 force-pushed the data-pages-add-related-research-and-writing branch from 35db5ff to bc82b10 Compare October 13, 2023 11:55
@danyx23 danyx23 force-pushed the data-pages-add-related-research-and-writing branch 2 times, most recently from 2f1c2b0 to f13860d Compare October 25, 2023 10:46
@danyx23 danyx23 changed the base branch from master to data-page-bake-on-all-eligible-charts-from-etl October 31, 2023 08:15
@danyx23 danyx23 force-pushed the data-pages-add-related-research-and-writing branch from f13860d to 72183e8 Compare October 31, 2023 08:15
@danyx23 danyx23 force-pushed the data-page-bake-on-all-eligible-charts-from-etl branch from 2b18987 to 0ff197c Compare October 31, 2023 15:56
@danyx23 danyx23 force-pushed the data-pages-add-related-research-and-writing branch from 72183e8 to a429808 Compare October 31, 2023 15:56
@danyx23 danyx23 force-pushed the data-page-bake-on-all-eligible-charts-from-etl branch from 0ff197c to b8fea76 Compare October 31, 2023 17:48
@danyx23 danyx23 force-pushed the data-pages-add-related-research-and-writing branch from a429808 to 9e3bf1e Compare October 31, 2023 17:49
@danyx23 danyx23 changed the base branch from data-page-bake-on-all-eligible-charts-from-etl to data-page-merge-about-this-data October 31, 2023 18:01
@danyx23 danyx23 force-pushed the data-pages-add-related-research-and-writing branch from 9e3bf1e to 43a47c2 Compare October 31, 2023 18:01
@danyx23 danyx23 changed the base branch from data-page-merge-about-this-data to data-page-bake-on-all-eligible-charts-from-etl November 3, 2023 10:41
@danyx23 danyx23 force-pushed the data-pages-add-related-research-and-writing branch from 3cf9b1d to 02cea62 Compare November 3, 2023 10:41
@sophiamersmann sophiamersmann force-pushed the data-page-bake-on-all-eligible-charts-from-etl branch from b8fea76 to 69a0883 Compare November 3, 2023 11:15
@sophiamersmann sophiamersmann force-pushed the data-pages-add-related-research-and-writing branch from 02cea62 to cc073f0 Compare November 3, 2023 11:16
@danyx23 danyx23 marked this pull request as ready for review November 6, 2023 09:33
@danyx23 danyx23 requested a review from ikesau November 6, 2023 09:33
db/model/PostLink.ts Show resolved Hide resolved
db/syncPostsToGrapher.ts Outdated Show resolved Hide resolved
const linksToAdd: PostLink[] = []
const linksToDelete: PostLink[] = []

// This is doing a set difference, but we want to do the set operation on a subset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's the slight technicality in here that the same link can appear multiple times in the same document, which is something we don't detect here.
This is fine for our current use case. And is probably also fine overall (especially seeing that WP is gonna be dead soon anyways).

If you agree, we can also change out the groupBy calls to keyBy.

Comment on lines +705 to +711
left join charts c on
pl.target = c.slug
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I see this right this will only work if there's not a redirect?
And otherwise, chartSlug above will be null, and the join of chart_dimensions will also not work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, thanks!

db/wpdb.ts Outdated
pt.post_id = p.id
where
pl.linkType = 'grapher'
and componentType = 'src' -- this filters out links in tags and keeps only embedded charts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what that comment means? Can we reword it somehow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(explaining for my own understanding as well 😊)

linkType = 'grapher' is for URLs with paths that match on /^\/grapher\/[\w]+/

componentType = 'src' is for the links that match the anySrcRegex (and not the anyHrefRegex or prominentLinkRegex

So this clause is filtering out any links that may be from an inline link to a grapher, because we only want to count grapher embeds (i.e. links from iframes - <iframe src="https://ourworldindata.org/grapher/life-expectancy")

db/wpdb.ts Outdated
and componentType = 'src' -- this filters out links in tags and keeps only embedded charts
and cd.variableId = ?
and cd.property in ('x', 'y') -- ignore cases where the indicator is size, color etc
and p.status = 'publish' -- only use published wp charts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and p.status = 'publish' -- only use published wp charts
and p.status = 'publish' -- only use published wp posts

Copy link
Member

@ikesau ikesau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me! I tested out the DB queries very gently, but have done a E2E test yet.

I'll get to that once comments are addressed 👍

db/model/PostLink.ts Show resolved Hide resolved
import Papa from "papaparse"
import * as db from "./db.js"

async function downloadAndInsertCSV(): Promise<void> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 🙌

const response = await fetch(csvUrl)

if (!response.ok) {
throw new Error(`Failed to fetch CSV: ${response.statusText}`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe handle the case here where the caller isn't on Tailscale (and explain that it's available to team members only?)

db/wpdb.ts Outdated
pt.post_id = p.id
where
pl.linkType = 'grapher'
and componentType = 'src' -- this filters out links in tags and keeps only embedded charts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(explaining for my own understanding as well 😊)

linkType = 'grapher' is for URLs with paths that match on /^\/grapher\/[\w]+/

componentType = 'src' is for the links that match the anySrcRegex (and not the anyHrefRegex or prominentLinkRegex

So this clause is filtering out any links that may be from an inline link to a grapher, because we only want to count grapher embeds (i.e. links from iframes - <iframe src="https://ourworldindata.org/grapher/life-expectancy")

@@ -0,0 +1,26 @@
import { MigrationInterface, QueryRunner } from "typeorm"

export class AddPostsLinks1692042923850 implements MigrationInterface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

A tragedy of alphabetization 🥲

db/wpdb.ts Outdated
and cd.variableId = ?
and cd.property in ('x', 'y') -- ignore cases where the indicator is size, color etc
and p.status = 'publish' -- only use published wp charts
and coalesce(pg.published, 0) = 0 -- if the wp post has a published gdoc successor then ignore it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are cases where we have a published WP post that is succeeded by a published Gdoc, that isn't linked via gdocSuccessorId (e.g. Topic Pages do this)

So you might want to handle that case too and filter out WP posts that share a slug with a published Gdoc post.

@danyx23 danyx23 force-pushed the data-page-bake-on-all-eligible-charts-from-etl branch from 69a0883 to b35fc80 Compare November 8, 2023 09:39
@danyx23 danyx23 force-pushed the data-pages-add-related-research-and-writing branch from 20567e7 to e69b9bc Compare November 8, 2023 09:39
@danyx23 danyx23 force-pushed the data-page-bake-on-all-eligible-charts-from-etl branch from dd9778f to ad1ddab Compare November 24, 2023 16:50
@danyx23 danyx23 force-pushed the data-pages-add-related-research-and-writing branch from f3666b5 to 2943fbe Compare November 24, 2023 16:50
Copy link
Contributor Author

danyx23 commented Nov 27, 2023

Merge activity

  • Nov 27, 6:24 AM: @danyx23 started a stack merge that includes this pull request via Graphite.
  • Nov 27, 6:25 AM: @danyx23 merged this pull request with Graphite.

Base automatically changed from data-page-bake-on-all-eligible-charts-from-etl to master November 27, 2023 11:25
@danyx23 danyx23 merged commit cad20ea into master Nov 27, 2023
10 checks passed
@danyx23 danyx23 deleted the data-pages-add-related-research-and-writing branch November 27, 2023 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend content graph and enable Research and Writing block in data pages
4 participants