Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[connectors] check and fix gdrive parents, backfill folders #9344

Merged
merged 16 commits into from
Dec 13, 2024

Conversation

tdraier
Copy link
Contributor

@tdraier tdraier commented Dec 13, 2024

fixes: #8527
fixes: #9250

Description

Script goes over all google files and tables to check the parents array , and fix if necessary.
This does not cover #9256, but is a prerequisite to have a consistent parents values.
It also backfill all google drive folders - if they were not present, they will be upserted.

The changes in data_sources.ts allow to call get folder and document (get table was already there).
I've also extended the /documents endpoint in front so that it can propagate document_ids. Calling /documents with document_ids is much more efficient as calling /documents/{documentId}, as it does not retrieve the content.

Risk

Deploy Plan

  • deploy front
  • run script on connectors

@tdraier tdraier marked this pull request as ready for review December 13, 2024 13:59
Copy link

github-actions bot commented Dec 13, 2024

Warnings
⚠️

Files in **/api/v1/ have been modified and the PR has the documentation-ack label.
Don't forget to run npm run docs and use the Deploy OpenAPI Docs Github action to update https://docs.dust.tt/reference.

Generated by 🚫 dangerJS against 9a8a5b8

@tdraier tdraier added the documentation-ack Used to acknowledge that documentation is up-to-date with this PR label Dec 13, 2024
Copy link
Contributor

@philipperolet philipperolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable 😉 Glad for the 🧹 👍
Left a few coms

@@ -139,10 +146,16 @@ async function handler(
? parseInt(req.query.offset as string)
: 0;

let documentIds = req.query.document_ids;
if (typeof documentIds === "string") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we fail if it's a string and pass only if it's an array? (of 1 if we want a single doc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't pass an array of 1 - if we pass a single value, it will be a string, otherwise an array (that how req.query is parsed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! I never remember that kind of things :)

} catch (e) {
const axiosError = e as AxiosError;
if (axiosError?.response?.status === 400) {
localLogger.info("Folder doesn't exist on Dust. Ignoring.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 400 would be bad request? did you mean 404?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the function should keep the throw Though; it's the caller that should decide what to do on error, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it's a 400 .. but should be a 404, updating.
The getTable was logging and returning null, I kept the same logic here

@@ -48,6 +48,7 @@ export type CoreAPIDocument = {
data_source_id: string;
created: number;
document_id: string;
parents: string[];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮 wasn't that field there already? surprising

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was returned by the api but not present in types

throw e;
}
if (dustRequestResult.data.documents.length === 0) {
localLogger.info("Document doesn't exist on Dust. Ignoring.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the function should throw there or Result
the caller can decide what to do if doc is not found
wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the logic of the existing getTable. For me both are fine

limit: QUERY_BATCH_SIZE,
});

const last = googleDriveFiles[googleDriveFiles.length - 1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as above

break;
}

googleDriveFiles.forEach((folder) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would not name the var folder, as those can also be files right? I was looking for why it was a folder when reading

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes should be file, forgot a renaming


if (execute) {
// upsert repository as folder
await upsertFolderNode({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the PR does the folders backfill too? Nice 👍
But I'd update the PR title & description

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it close #8527 too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait it doesn't backfill
but could it? 😇

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it's supposed to backfill, i fixed it one hour ago :-)
and description fixes #8527 :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great ! I'd still say it in PR title and description if you don't mind :) IMO good to see clearly that it's a backfilling PR

"Found document"
);
if (document.parents.join("/") !== parents.join("/")) {
logger.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we'll likely have ~10M lines of log for the run; maybe output in file? (that will I guess weigh in the ballpark of gigabyte(s))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, log is already pretty long, will be better on one line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some logs are generated by data_source and this does not allow to use a custom logger . it may be easier to redirect the script output to a file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's what I meant sorry for being unclear

@tdraier tdraier changed the title [connectors] check and fix gdrive parents [connectors] check and fix gdrive parents, backfill folders Dec 13, 2024
Copy link
Contributor

@philipperolet philipperolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 🔥

@tdraier tdraier merged commit a9173eb into main Dec 13, 2024
6 checks passed
@tdraier tdraier deleted the thomas/check-gdrive-parents branch December 13, 2024 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation-ack Used to acknowledge that documentation is up-to-date with this PR
Projects
None yet
2 participants