-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[connectors] check and fix gdrive parents, backfill folders #9344
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable 😉 Glad for the 🧹 👍
Left a few coms
@@ -139,10 +146,16 @@ async function handler( | |||
? parseInt(req.query.offset as string) | |||
: 0; | |||
|
|||
let documentIds = req.query.document_ids; | |||
if (typeof documentIds === "string") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we fail if it's a string and pass only if it's an array? (of 1 if we want a single doc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't pass an array of 1 - if we pass a single value, it will be a string, otherwise an array (that how req.query is parsed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! I never remember that kind of things :)
} catch (e) { | ||
const axiosError = e as AxiosError; | ||
if (axiosError?.response?.status === 400) { | ||
localLogger.info("Folder doesn't exist on Dust. Ignoring."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 400 would be bad request? did you mean 404?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the function should keep the throw Though; it's the caller that should decide what to do on error, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now it's a 400 .. but should be a 404, updating.
The getTable was logging and returning null, I kept the same logic here
@@ -48,6 +48,7 @@ export type CoreAPIDocument = { | |||
data_source_id: string; | |||
created: number; | |||
document_id: string; | |||
parents: string[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮 wasn't that field there already? surprising
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was returned by the api but not present in types
throw e; | ||
} | ||
if (dustRequestResult.data.documents.length === 0) { | ||
localLogger.info("Document doesn't exist on Dust. Ignoring."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the function should throw there or Result
the caller can decide what to do if doc is not found
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the logic of the existing getTable. For me both are fine
limit: QUERY_BATCH_SIZE, | ||
}); | ||
|
||
const last = googleDriveFiles[googleDriveFiles.length - 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: as above
break; | ||
} | ||
|
||
googleDriveFiles.forEach((folder) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would not name the var folder, as those can also be files right? I was looking for why it was a folder when reading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes should be file, forgot a renaming
|
||
if (execute) { | ||
// upsert repository as folder | ||
await upsertFolderNode({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the PR does the folders backfill too? Nice 👍
But I'd update the PR title & description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can it close #8527 too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait it doesn't backfill
but could it? 😇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it's supposed to backfill, i fixed it one hour ago :-)
and description fixes #8527 :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great ! I'd still say it in PR title and description if you don't mind :) IMO good to see clearly that it's a backfilling PR
"Found document" | ||
); | ||
if (document.parents.join("/") !== parents.join("/")) { | ||
logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we'll likely have ~10M lines of log for the run; maybe output in file? (that will I guess weigh in the ballpark of gigabyte(s))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, log is already pretty long, will be better on one line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some logs are generated by data_source and this does not allow to use a custom logger . it may be easier to redirect the script output to a file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that's what I meant sorry for being unclear
Co-authored-by: Philippe Rolet <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 🔥
fixes: #8527
fixes: #9250
Description
Script goes over all google files and tables to check the parents array , and fix if necessary.
This does not cover #9256, but is a prerequisite to have a consistent parents values.
It also backfill all google drive folders - if they were not present, they will be upserted.
The changes in data_sources.ts allow to call get folder and document (get table was already there).
I've also extended the /documents endpoint in front so that it can propagate document_ids. Calling /documents with document_ids is much more efficient as calling /documents/{documentId}, as it does not retrieve the content.
Risk
Deploy Plan