Skip to content

Commit

Permalink
🎉 Add related research and writing via content graph to data pages (#…
Browse files Browse the repository at this point in the history
…2739)

This PR implements #2379. It adds the missing link in our db from wordpress posts to charts that are used there. It then uses this new posts_links table together with the existing posts_gdocs_links table to find the related writing for a data page by going from indciator id -> charts using this indicator -> articles using this indicator.

The posts_links table was modelled on the posts_gdocs_links table as I thought that uniformity is more important than the optimal layout here. Extracting the links is a bit crudely done ATM in that it just uses regex's on the raw html tag instead of parsing the html and querying for a tags. The latter would give us the text content of the content that establishes the links which is probably often useful, but it would complicate and slow down the script. I'd like to hear your opinions on whether this should switch to proper parsing and filling richer information into the DB.

The thumbnail rendering is also a bit ad-hoc. We have an Image component but that one is built for use in gdocs and we need to show thumbnails for both WP posts and Gdocs articles.

To rank related research and writing we use the pageviews table. This is empty by default in dev environments and so this PR adds a make command to refresh pageviews (fetched from datasette-private)

- [ ] ❗ after merging this to production, run the db/syncPostsToGrapher.js script to fill the new relationship table!
  • Loading branch information
danyx23 authored Nov 27, 2023
2 parents c9619c6 + 2943fbe commit cad20ea
Show file tree
Hide file tree
Showing 15 changed files with 586 additions and 58 deletions.
1 change: 1 addition & 0 deletions .eslintignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,4 @@ wordpress/web/wp/wp-content/**
wordpress/vendor/**
packages/@ourworldindata/*/dist/
dist/
.vscode/
20 changes: 17 additions & 3 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@
"skipFiles": [
"<node_internals>/**"
],
"skipFiles": [
"<node_internals>/**"
],
"type": "node"
},
{
Expand All @@ -25,6 +28,10 @@
"${fileBasenameNoExtension}.js",
"--watch"
],
"args": [
"${fileBasenameNoExtension}.js",
"--watch"
],
"console": "integratedTerminal"
// "internalConsoleOptions": "neverOpen"
},
Expand Down Expand Up @@ -70,7 +77,7 @@
"skipFiles": [
"<node_internals>/**"
],
"type": "node"
"type": "node",
},
{
"name": "Run SVGTester",
Expand All @@ -79,17 +86,24 @@
"skipFiles": [
"<node_internals>/**"
],
"skipFiles": [
"<node_internals>/**"
],
"type": "node",
"args": [
"-g",
"367"
]
"args": [
"-g",
"367"
]
},
{
"name": "Launch admin server",
"program": "${workspaceFolder}/itsJustJavascript/adminSiteServer/app.js",
"request": "launch",
"type": "node"
"type": "node",
},
{
"name": "Attach to node",
Expand All @@ -115,4 +129,4 @@
"port": 9000
}
]
}
}
31 changes: 18 additions & 13 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -24,23 +24,24 @@ help:
@echo 'Available commands:'
@echo
@echo ' GRAPHER ONLY'
@echo ' make up start dev environment via docker-compose and tmux'
@echo ' make down stop any services still running'
@echo ' make refresh (while up) download a new grapher snapshot and update MySQL'
@echo ' make migrate (while up) run any outstanding db migrations'
@echo ' make test run full suite (except db tests) of CI checks including unit tests'
@echo ' make dbtest run db test suite that needs a running mysql db'
@echo ' make svgtest compare current rendering against reference SVGs'
@echo ' make up start dev environment via docker-compose and tmux'
@echo ' make down stop any services still running'
@echo ' make refresh (while up) download a new grapher snapshot and update MySQL'
@echo ' make refresh.pageviews (while up) download and load pageviews from the private datasette instance'
@echo ' make migrate (while up) run any outstanding db migrations'
@echo ' make test run full suite (except db tests) of CI checks including unit tests'
@echo ' make dbtest run db test suite that needs a running mysql db'
@echo ' make svgtest compare current rendering against reference SVGs'
@echo
@echo ' GRAPHER + WORDPRESS (staff-only)'
@echo ' make up.full start dev environment via docker-compose and tmux'
@echo ' make down.full stop any services still running'
@echo ' make refresh.wp download a new wordpress snapshot and update MySQL'
@echo ' make refresh.full do a full MySQL update of both wordpress and grapher'
@echo ' make up.full start dev environment via docker-compose and tmux'
@echo ' make down.full stop any services still running'
@echo ' make refresh.wp download a new wordpress snapshot and update MySQL'
@echo ' make refresh.full do a full MySQL update of both wordpress and grapher'
@echo
@echo ' OPS (staff-only)'
@echo ' make deploy Deploy your local site to production'
@echo ' make stage Deploy your local site to staging'
@echo ' make deploy Deploy your local site to production'
@echo ' make stage Deploy your local site to staging'
@echo

up: export DEBUG = 'knex:query'
Expand Down Expand Up @@ -136,6 +137,10 @@ refresh:
@echo '==> Updating grapher database'
@. ./.env && DATA_FOLDER=tmp-downloads ./devTools/docker/refresh-grapher-data.sh

refresh.pageviews:
@echo '==> Refreshing pageviews'
yarn && yarn buildTsc && yarn refreshPageviews

refresh.wp:
@echo '==> Downloading wordpress data'
./devTools/docker/download-wordpress-mysql.sh
Expand Down
7 changes: 6 additions & 1 deletion baker/GrapherBaker.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ import {
getRelatedArticles,
getRelatedCharts,
getRelatedChartsForVariable,
getRelatedResearchAndWritingForVariable,
isWordpressAPIEnabled,
isWordpressDBEnabled,
} from "../db/wpdb.js"
Expand Down Expand Up @@ -227,7 +228,7 @@ export async function renderDataPageV2({
}
const datapageData = await getDatapageDataV2(
variableMetadata,
grapherConfigForVariable ?? {}
grapher ?? {}
)

const firstTopicTag = datapageData.topicTagsLinks?.[0]
Expand Down Expand Up @@ -272,6 +273,10 @@ export async function renderDataPageV2({
variableId,
grapher && "id" in grapher ? [grapher.id as number] : []
)

datapageData.relatedResearch =
await getRelatedResearchAndWritingForVariable(variableId)

return renderToHtmlPage(
<DataPageV2
grapher={grapher}
Expand Down
37 changes: 36 additions & 1 deletion baker/postUpdatedHook.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,12 @@ import { exit } from "../db/cleanup.js"
import { PostRow } from "@ourworldindata/utils"
import * as wpdb from "../db/wpdb.js"
import * as db from "../db/db.js"
import { buildReusableBlocksResolver } from "../db/syncPostsToGrapher.js"
import {
buildReusableBlocksResolver,
getLinksToAddAndRemoveForPost,
} from "../db/syncPostsToGrapher.js"
import { postsTable, select } from "../db/model/Post.js"
import { PostLink } from "../db/model/PostLink.js"
const argv = parseArgs(process.argv.slice(2))

const zeroDateString = "0000-00-00 00:00:00"
Expand Down Expand Up @@ -141,6 +145,37 @@ const syncPostToGrapher = async (
db.knexTable(postsTable).where({ id: postId })
)
)[0]

if (postRow) {
const existingLinksForPost = await PostLink.findBy({
sourceId: wpPost.ID,
})

const { linksToAdd, linksToDelete } = getLinksToAddAndRemoveForPost(
postRow,
existingLinksForPost,
postRow!.content,
wpPost.ID
)

// TODO: unify our DB access and then do everything in one transaction
if (linksToAdd.length) {
console.log("linksToAdd", linksToAdd.length)
await PostLink.createQueryBuilder()
.insert()
.into(PostLink)
.values(linksToAdd)
.execute()
}

if (linksToDelete.length) {
console.log("linksToDelete", linksToDelete.length)
await PostLink.createQueryBuilder()
.where("id in (:ids)", { ids: linksToDelete.map((x) => x.id) })
.delete()
.execute()
}
}
return newPost ? newPost.slug : undefined
}

Expand Down
26 changes: 26 additions & 0 deletions db/migration/1692042923850-AddPostsLinks.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
import { MigrationInterface, QueryRunner } from "typeorm"

export class AddPostsLinks1692042923850 implements MigrationInterface {
public async up(queryRunner: QueryRunner): Promise<void> {
queryRunner.query(`-- sql
CREATE TABLE posts_links (
id int NOT NULL AUTO_INCREMENT,
sourceId int NOT NULL,
target varchar(2047) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_cs NOT NULL,
linkType enum('url','grapher','explorer', 'gdoc') CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_cs DEFAULT NULL,
componentType varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_cs NOT NULL,
text varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_cs NOT NULL,
queryString varchar(2047) COLLATE utf8mb4_0900_as_cs NOT NULL,
hash varchar(2047) COLLATE utf8mb4_0900_as_cs NOT NULL,
PRIMARY KEY (id),
KEY sourceId (sourceId),
CONSTRAINT posts_links_ibfk_1 FOREIGN KEY (sourceId) REFERENCES posts (id)
) ENGINE=InnoDB;`)
}

public async down(queryRunner: QueryRunner): Promise<void> {
queryRunner.query(`-- sql
DROP TABLE IF EXISTS posts_links;
`)
}
}
47 changes: 47 additions & 0 deletions db/model/PostLink.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
import { Entity, PrimaryGeneratedColumn, Column, BaseEntity } from "typeorm"
import { formatUrls } from "../../site/formatting.js"
import { Url } from "@ourworldindata/utils"
import { getLinkType, getUrlTarget } from "@ourworldindata/components"

@Entity("posts_links")
export class PostLink extends BaseEntity {
@PrimaryGeneratedColumn() id!: number
// TODO: posts is not a TypeORM but a Knex class so we can't use a TypeORM relationship here yet

@Column({ type: "int", nullable: false }) sourceId!: number

@Column() linkType!: "gdoc" | "url" | "grapher" | "explorer"
@Column() target!: string
@Column() queryString!: string
@Column() hash!: string
@Column() componentType!: string
@Column() text!: string

static createFromUrl({
url,
sourceId,
text = "",
componentType = "",
}: {
url: string
sourceId: number
text?: string
componentType?: string
}): PostLink {
const formattedUrl = formatUrls(url)
const urlObject = Url.fromURL(formattedUrl)
const linkType = getLinkType(formattedUrl)
const target = getUrlTarget(formattedUrl)
const queryString = urlObject.queryStr
const hash = urlObject.hash
return PostLink.create({
target,
linkType,
queryString,
hash,
sourceId,
text,
componentType,
})
}
}
49 changes: 49 additions & 0 deletions db/refreshPageviewsFromDatasette.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
// index.ts
import fetch from "node-fetch"
import Papa from "papaparse"
import * as db from "./db.js"

async function downloadAndInsertCSV(): Promise<void> {
const csvUrl = "http://datasette-private/owid/pageviews.csv?_size=max"
const response = await fetch(csvUrl)

if (!response.ok) {
throw new Error(
`Failed to fetch CSV: ${response.statusText} from ${csvUrl}`
)
}

const csvText = await response.text()
const parsedData = Papa.parse(csvText, {
header: true,
})

if (parsedData.errors.length > 1) {
console.error("Errors while parsing CSV:", parsedData.errors)
return
}

const onlyValidRows = [...parsedData.data].filter(
(row) => Object.keys(row as any).length === 5
) as any[]

console.log("Parsed CSV data:", onlyValidRows.length, "rows")
console.log("Columns:", parsedData.meta.fields)

await db.knexRaw("TRUNCATE TABLE pageviews")

await db.knexInstance().batchInsert("pageviews", onlyValidRows)
console.log("CSV data inserted successfully!")
}

const main = async (): Promise<void> => {
try {
await downloadAndInsertCSV()
} catch (e) {
console.error(e)
} finally {
await db.closeTypeOrmAndKnexConnections()
}
}

main()
Loading

0 comments on commit cad20ea

Please sign in to comment.