Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github Sync updates github docs parents Field #1296

Merged
merged 36 commits into from
Sep 8, 2023
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
0a42e8d
doc update of parents field
philipperolet Aug 30, 2023
605fbe3
core api to update parents (+ refactor of update tags)
philipperolet Aug 30, 2023
8004a79
spolu review
philipperolet Aug 31, 2023
876225a
logging
philipperolet Aug 31, 2023
ca0e3f3
review 2
philipperolet Sep 1, 2023
406ca6f
fixed qdrant payload update via document hash
philipperolet Sep 1, 2023
7303190
WIP
philipperolet Sep 2, 2023
cf71c42
wip2
philipperolet Sep 4, 2023
f2eff93
wip 3
philipperolet Sep 5, 2023
9a30d02
wip 4
philipperolet Sep 5, 2023
2c7efc6
add parents endpoint to front
philipperolet Sep 5, 2023
5628ee9
cleaning
philipperolet Sep 5, 2023
1bdc7c8
Merge branch 'main' into ds-parents-field-notion
philipperolet Sep 5, 2023
120e2de
removed js Set in parents (semantics don't work)
philipperolet Sep 5, 2023
b8a3091
cleaning
philipperolet Sep 5, 2023
31a853d
Merge branch 'main' into ds-parents-field-notion
philipperolet Sep 5, 2023
2462e98
remove SyncWorkflowResult
philipperolet Sep 6, 2023
a350434
controlled memoization of getParents
philipperolet Sep 6, 2023
458bc7c
cleaning
philipperolet Sep 6, 2023
aaec779
handle void returns from pqueue execution in notion workflow
philipperolet Sep 6, 2023
0acb000
fix: document is of class object at runtime, not notionpage or notiondb
philipperolet Sep 6, 2023
ed66a6a
cleaning
philipperolet Sep 6, 2023
51ffbb1
bump notion workflow version
philipperolet Sep 6, 2023
7c000fe
pass timestamp to activity execution for better memoization
philipperolet Sep 6, 2023
83bdaa4
henry's suggestions
philipperolet Sep 6, 2023
16f74e1
migration script
philipperolet Sep 6, 2023
853d7ea
cleaning
philipperolet Sep 6, 2023
af02ecb
leaner getParents + fixes on potential inconsistencies
philipperolet Sep 6, 2023
9ae414a
renaming
philipperolet Sep 6, 2023
4b73bb8
Update parents during sync
philipperolet Sep 6, 2023
319f946
migration script
philipperolet Sep 6, 2023
de327b6
comments from pr + adaptations from discussions
philipperolet Sep 7, 2023
74a96b9
last PR comments
philipperolet Sep 8, 2023
8e8c090
doc on convention on connector resource
philipperolet Sep 8, 2023
572bf8d
Merge branch 'main' into ds-parents-github
philipperolet Sep 8, 2023
140175a
Merge branch 'main' into ds-parents-github
philipperolet Sep 8, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions connectors/migrations/20230906_3_github_fill_parents_field.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
import {
getDiscussionDocumentId,
getIssueDocumentId,
} from "@connectors/connectors/github/temporal/activities";
import { updateDocumentParentsField } from "@connectors/lib/data_sources";
import {
Connector,
GithubDiscussion,
GithubIssue,
} from "@connectors/lib/models";

async function main() {
if (!process.argv[2]) {
console.error("Missing workspace id or 'all' as first argument");
process.exit(1);
}
// if first arg is "all", update all connectors, else update only the
// connector for the corresponding workspace id
const connectors =
process.argv[2] === "all"
? await Connector.findAll({
where: {
type: "github",
},
})
: await Connector.findAll({
where: {
type: "github",
workspaceId: process.argv[2],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it doesn't matter much since it a one off, but curious what happens if I forget argv[2] ? Does it basically does the same as "all" because workspaceId: undefined doesn't filter ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Let's validate that

},
});

for (const connector of connectors) {
console.log(`Updating parents field for connector ${connector.id}`);
await updateDiscussionsParentsFieldForConnector(connector);
await updateIssuesParentsFieldForConnector(connector);
}
}

async function updateDiscussionsParentsFieldForConnector(connector: Connector) {
// get all distinct documentIds and their channel ids from slack messages in
// this connector
const documentData = await GithubDiscussion.findAll({
where: {
connectorId: connector.id,
},
attributes: ["repoId", "discussionNumber"],
});
// update all parents fields for discussions by chunks of 128
const chunkSize = 32;
for (let i = 0; i < documentData.length; i += chunkSize) {
const chunk = documentData.slice(i, i + chunkSize);
console.log(`Updating ${chunk.length} documents`);
// update parents field for each document of the chunk, in parallel
await Promise.all(
chunk.map(async (document) => {
const docId = getDiscussionDocumentId(
document.repoId,
document.discussionNumber
);
await updateDocumentParentsField(connector, docId, [
getDiscussionDocumentId(document.repoId, document.discussionNumber),
document.repoId,
]);
})
);
}
}

async function updateIssuesParentsFieldForConnector(connector: Connector) {
// get all distinct issues and their repo ids fro
const documentData = await GithubIssue.findAll({
where: {
connectorId: connector.id,
},
attributes: ["repoId", "issueNumber"],
});
// update all parents fields for all issues by chunks of 128
const chunkSize = 32;
for (let i = 0; i < documentData.length; i += chunkSize) {
const chunk = documentData.slice(i, i + chunkSize);
console.log(`Updating ${chunk.length} documents`);
// update parents field for each document of the chunk, in parallel
await Promise.all(
chunk.map(async (document) => {
const docId = getIssueDocumentId(document.repoId, document.issueNumber);
await updateDocumentParentsField(connector, docId, [
getIssueDocumentId(document.repoId, document.issueNumber),
document.repoId,
]);
})
);
}
}

main()
.then(() => {
console.log("Done");
process.exit(0);
})
.catch((err) => {
console.error(err);
process.exit(1);
});
52 changes: 52 additions & 0 deletions connectors/migrations/20230906_notion_fill_parents_field.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
import { updateAllParentsFields } from "@connectors/connectors/notion/lib/parents";
import { Connector, NotionDatabase, NotionPage } from "@connectors/lib/models";

async function main() {
// if first arg is "all", update all connectors, else update only the
// connector for the corresponding workspace id
const connectors =
process.argv[2] === "all"
? await Connector.findAll({
where: {
type: "notion",
},
})
: await Connector.findAll({
where: {
type: "notion",
workspaceId: process.argv[2],
},
});

for (const connector of connectors) {
console.log(`Updating parents field for connector ${connector.id}`);
await updateParentsFieldForConnector(connector);
}
}

async function updateParentsFieldForConnector(connector: Connector) {
// get all pages and databases for this connector
const pages = await NotionPage.findAll({
where: {
connectorId: connector.id,
},
});
const databases = await NotionDatabase.findAll({
where: {
connectorId: connector.id,
},
});

// update all parents fields for all pages and databases
await updateAllParentsFields(connector, [...pages, ...databases]);
}

main()
.then(() => {
console.log("Done");
process.exit(0);
})
.catch((err) => {
console.error(err);
process.exit(1);
});
29 changes: 25 additions & 4 deletions connectors/src/connectors/github/temporal/activities.ts
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,16 @@ export async function githubUpsertIssueActivity(
documentUrl: issue.url,
timestampMs: lastUpdateTimestamp,
tags: tags,
parents: [],
// The convention for parents is to use the external id string; it is ok for
// repos, but not practical for issues since the external id is the
// issue number, which is not guaranteed unique in the workspace.
// Therefore as a special case we use getIssueDocumentId() to get a parent string
// The repo id from github is globally unique so used as-is, as per
// convention to use the external id string.
parents: [
getIssueDocumentId(repoId.toString(), issue.number),
repoId.toString(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity what does this Id look like? Is that a number?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have 2 github managed data sources I worry we'll be clashing, which means we should probably update the connectors resource API to use something more specific?

We want to make sure it's globally unique

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No clear doc about this, but I have created two repos at ~ 30mn interval and their repo ids are 688834560 and 688870568
Pretty sure it's an autoincremented primary key
=> I am willing to decide it is globally unique, and be responsible for the consequences if i'm wrong

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look long enough to me 👍 :)

],
retries: 3,
delayBetweenRetriesMs: 500,
loggerArgs: { ...loggerArgs, provider: "github" },
Expand Down Expand Up @@ -281,7 +290,16 @@ export async function githubUpsertDiscussionActivity(
documentUrl: discussion.url,
timestampMs: new Date(discussion.createdAt).getTime(),
tags,
parents: [],
// The convention for parents is to use the external id string; it is ok for
// repos, but not practical for discussions since the external id is the
// issue number, which is not guaranteed unique in the workspace. Therefore
// as a special case we use getDiscussionDocumentId() to get a parent string
// The repo id from github is globally unique so used as-is, as per
// convention to use the external id string.
parents: [
getDiscussionDocumentId(repoId.toString(), discussionNumber),
repoId.toString(),
],
retries: 3,
delayBetweenRetriesMs: 500,
loggerArgs: { ...loggerArgs, provider: "github" },
Expand Down Expand Up @@ -585,11 +603,14 @@ function renderGithubUser(user: GithubUser | null): string {
return `@${user.id}`;
}

function getIssueDocumentId(repoId: string, issueNumber: number): string {
export function getIssueDocumentId(
repoId: string,
issueNumber: number
): string {
return `github-issue-${repoId}-${issueNumber}`;
}

function getDiscussionDocumentId(
export function getDiscussionDocumentId(
repoId: string,
discussionNumber: number
): string {
Expand Down
56 changes: 56 additions & 0 deletions connectors/src/connectors/notion/lib/connectors_db_helpers.ts
Original file line number Diff line number Diff line change
Expand Up @@ -218,3 +218,59 @@ export async function getNotionDatabaseFromConnectorsDb(

return NotionDatabase.findOne({ where });
}

/**
* Get children *that are pages* of a given notion page or database
*
* !! Not children *of a page*
*/
export async function getPageChildrenOf(
dataSourceInfo: DataSourceInfo,
notionId: string
): Promise<NotionPage[]> {
const connector = await Connector.findOne({
where: {
type: "notion",
workspaceId: dataSourceInfo.workspaceId,
dataSourceName: dataSourceInfo.dataSourceName,
},
});
if (!connector) {
throw new Error("Could not find connector");
}

return NotionPage.findAll({
where: {
parentId: notionId,
connectorId: connector.id,
},
});
}

/**
* Get children *that are databases* of a given notion page or database
*
* !! Not children *of a database*
*/
export async function getDatabaseChildrenOf(
dataSourceInfo: DataSourceInfo,
notionId: string
): Promise<NotionDatabase[]> {
const connector = await Connector.findOne({
where: {
type: "notion",
workspaceId: dataSourceInfo.workspaceId,
dataSourceName: dataSourceInfo.dataSourceName,
},
});
if (!connector) {
throw new Error("Could not find connector");
}

return NotionDatabase.findAll({
where: {
parentId: notionId,
connectorId: connector.id,
},
});
}
Loading