Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CoreNodes] Add cleanup script for Google Drive parents #10442

Merged
merged 2 commits into from
Feb 3, 2025

Conversation

aubin-tchoi
Copy link
Contributor

Description

Tests

Risk

  • Medium.

Deploy Plan

  • No deploy.

Copy link
Contributor

@philipperolet philipperolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, sorry for not being able to review yesterday
Left a few coms

}) {
// retrieve all data sources for the provider
const dataSources = await DataSourceModel.findAll({
where: { connectorProvider: "google_drive" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
where: { connectorProvider: "google_drive" },
where: { connectorProvider: "google_drive" },
id: {
[Op.gt]: nextDataSourceId,
},

});

for (const dataSource of dataSources) {
if (dataSource.id >= nextDataSourceId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do that in the where clause

Copy link
Contributor Author

@aubin-tchoi aubin-tchoi Feb 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually meant to log when skipping a datasource but forgot about it, adding it
Edit: actually I did not forget about it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I get that. You use this condition to make the nextDataSourceId param useful
So if you start using it and call your script with nextDataSourceId > 1000, you want 1. to grab all datasources anyways, and 2. to log "SKIP" for all the datasources with an id < nextDataSourceId?
What would be the purpose?

it's not critical to the PR so still going to approve though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the purpose is that it doesn't cost anything and if anything fishy happens on a datasource, I'd grep on the datasource ID and it's more readable to see that it was skipped vs not seeing anything and having to check in the script what it means. In general I'd rather not optimize queries that are ran once and are fast anyway if it allows me to have more easily parsable logs

// For all nodes in the data source (can be big).
let nextId = "";

for (;;) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: those are hard to read, i suggest you always use a control flow statement expliciting the condition here, or if no condition use forEach / map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can turn this into a do while but since I have to keep the break the condition isn't doing anything so didn't bother with an unused condition

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i'd have gone with while(nextId) and having nextId = rows[rows.length - 1]?.node_id;
but it's ok not to change here 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that works too!


logger.info({ nextId, rowCount: rows.length }, "BATCH");

if (rows.length === 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand it might be a little faster to write, but it's better to optimize for easier to understand :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to keep this break for rows[rows.length - 1] to not be undefined on the next line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily, but fine as is 👍

let newParentId: string | null = null;
try {
const uniqueIds = [
new Set(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no guarantees on resulting order when using set
unless you're really sure, we'd need to have guarantees on parents sorting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought order was preserved when writing this, gonna check again out of curiosity, changing if needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlike most languages, js Sets actually preserve order: https://dust.tt/w/0ec9852c2f/assistant/FycTesggB2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks for looking
so here there is an assumption that botched parents are still symmetric in the right order--which indeed seems correct to me from afar
IMO for unconventional language features such as this (in the future, not here), might warrant a comment (showing you know or you checked)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if the order of the parents was lost at any point we won't recover it here (we would need to run a backfill relying on connectors db)

Copy link
Contributor

@philipperolet philipperolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 👍 left coms, but nothing preventing approval

let newParentId: string | null = null;
try {
const uniqueIds = [
new Set(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks for looking
so here there is an assumption that botched parents are still symmetric in the right order--which indeed seems correct to me from afar
IMO for unconventional language features such as this (in the future, not here), might warrant a comment (showing you know or you checked)

// For all nodes in the data source (can be big).
let nextId = "";

for (;;) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i'd have gone with while(nextId) and having nextId = rows[rows.length - 1]?.node_id;
but it's ok not to change here 👍


logger.info({ nextId, rowCount: rows.length }, "BATCH");

if (rows.length === 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily, but fine as is 👍

});

for (const dataSource of dataSources) {
if (dataSource.id >= nextDataSourceId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I get that. You use this condition to make the nextDataSourceId param useful
So if you start using it and call your script with nextDataSourceId > 1000, you want 1. to grab all datasources anyways, and 2. to log "SKIP" for all the datasources with an id < nextDataSourceId?
What would be the purpose?

it's not critical to the PR so still going to approve though

@aubin-tchoi aubin-tchoi merged commit 517fe72 into main Feb 3, 2025
6 checks passed
@aubin-tchoi aubin-tchoi deleted the fix-gdrive-migration branch February 3, 2025 10:23
philipperolet pushed a commit that referenced this pull request Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clean up following Google Drive botched migration
2 participants