-
-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎉 Add related research and writing via content graph to data pages #2739
Conversation
35db5ff
to
bc82b10
Compare
2f1c2b0
to
f13860d
Compare
f13860d
to
72183e8
Compare
2b18987
to
0ff197c
Compare
72183e8
to
a429808
Compare
0ff197c
to
b8fea76
Compare
a429808
to
9e3bf1e
Compare
9e3bf1e
to
43a47c2
Compare
3cf9b1d
to
02cea62
Compare
b8fea76
to
69a0883
Compare
02cea62
to
cc073f0
Compare
const linksToAdd: PostLink[] = [] | ||
const linksToDelete: PostLink[] = [] | ||
|
||
// This is doing a set difference, but we want to do the set operation on a subset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's the slight technicality in here that the same link can appear multiple times in the same document, which is something we don't detect here.
This is fine for our current use case. And is probably also fine overall (especially seeing that WP is gonna be dead soon anyways).
If you agree, we can also change out the groupBy
calls to keyBy
.
left join charts c on | ||
pl.target = c.slug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I see this right this will only work if there's not a redirect?
And otherwise, chartSlug
above will be null, and the join of chart_dimensions
will also not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, thanks!
db/wpdb.ts
Outdated
pt.post_id = p.id | ||
where | ||
pl.linkType = 'grapher' | ||
and componentType = 'src' -- this filters out links in tags and keeps only embedded charts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what that comment means? Can we reword it somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(explaining for my own understanding as well 😊)
linkType = 'grapher'
is for URLs with paths that match on /^\/grapher\/[\w]+/
componentType = 'src'
is for the links that match the anySrcRegex
(and not the anyHrefRegex
or prominentLinkRegex
So this clause is filtering out any links that may be from an inline link to a grapher, because we only want to count grapher embeds (i.e. links from iframes - <iframe src="https://ourworldindata.org/grapher/life-expectancy"
)
db/wpdb.ts
Outdated
and componentType = 'src' -- this filters out links in tags and keeps only embedded charts | ||
and cd.variableId = ? | ||
and cd.property in ('x', 'y') -- ignore cases where the indicator is size, color etc | ||
and p.status = 'publish' -- only use published wp charts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and p.status = 'publish' -- only use published wp charts | |
and p.status = 'publish' -- only use published wp posts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good to me! I tested out the DB queries very gently, but have done a E2E test yet.
I'll get to that once comments are addressed 👍
import Papa from "papaparse" | ||
import * as db from "./db.js" | ||
|
||
async function downloadAndInsertCSV(): Promise<void> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! 🙌
db/refreshPageviewsFromDatasette.ts
Outdated
const response = await fetch(csvUrl) | ||
|
||
if (!response.ok) { | ||
throw new Error(`Failed to fetch CSV: ${response.statusText}`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe handle the case here where the caller isn't on Tailscale (and explain that it's available to team members only?)
db/wpdb.ts
Outdated
pt.post_id = p.id | ||
where | ||
pl.linkType = 'grapher' | ||
and componentType = 'src' -- this filters out links in tags and keeps only embedded charts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(explaining for my own understanding as well 😊)
linkType = 'grapher'
is for URLs with paths that match on /^\/grapher\/[\w]+/
componentType = 'src'
is for the links that match the anySrcRegex
(and not the anyHrefRegex
or prominentLinkRegex
So this clause is filtering out any links that may be from an inline link to a grapher, because we only want to count grapher embeds (i.e. links from iframes - <iframe src="https://ourworldindata.org/grapher/life-expectancy"
)
@@ -0,0 +1,26 @@ | |||
import { MigrationInterface, QueryRunner } from "typeorm" | |||
|
|||
export class AddPostsLinks1692042923850 implements MigrationInterface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
db/wpdb.ts
Outdated
and cd.variableId = ? | ||
and cd.property in ('x', 'y') -- ignore cases where the indicator is size, color etc | ||
and p.status = 'publish' -- only use published wp charts | ||
and coalesce(pg.published, 0) = 0 -- if the wp post has a published gdoc successor then ignore it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are cases where we have a published WP post that is succeeded by a published Gdoc, that isn't linked via gdocSuccessorId
(e.g. Topic Pages do this)
So you might want to handle that case too and filter out WP posts that share a slug with a published Gdoc post.
69a0883
to
b35fc80
Compare
20567e7
to
e69b9bc
Compare
This deduplicates by url, sorts authors and only uses full chart embeds and ignores plain links to charts
Co-authored-by: Marcel Gerber <[email protected]>
dd9778f
to
ad1ddab
Compare
f3666b5
to
2943fbe
Compare
This PR implements #2379. It adds the missing link in our db from wordpress posts to charts that are used there. It then uses this new posts_links table together with the existing posts_gdocs_links table to find the related writing for a data page by going from indciator id -> charts using this indicator -> articles using this indicator.
The posts_links table was modelled on the posts_gdocs_links table as I thought that uniformity is more important than the optimal layout here. Extracting the links is a bit crudely done ATM in that it just uses regex's on the raw html tag instead of parsing the html and querying for a tags. The latter would give us the text content of the content that establishes the links which is probably often useful, but it would complicate and slow down the script. I'd like to hear your opinions on whether this should switch to proper parsing and filling richer information into the DB.
The thumbnail rendering is also a bit ad-hoc. We have an Image component but that one is built for use in gdocs and we need to show thumbnails for both WP posts and Gdocs articles.
To rank related research and writing we use the pageviews table. This is empty by default in dev environments and so this PR adds a make command to refresh pageviews (fetched from datasette-private)