Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage Squid integration with Argus/Colossus #5001

Merged
merged 35 commits into from
Jan 23, 2024

Conversation

zeeshanakram3
Copy link
Contributor

@zeeshanakram3 zeeshanakram3 commented Dec 13, 2023

closes #4957

This PR:

@kdembler kdembler self-assigned this Dec 21, 2023
@kdembler kdembler self-requested a review December 21, 2023 19:14
@mnaamani mnaamani added colossus argus Argus distributor node labels Dec 23, 2023
Copy link
Collaborator

@kdembler kdembler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing to see this realize, great work @zeeshanakram3

I'll try to get to proper code review later, for now tried Argus out and seeing some problems:

  1. Getting {"type":"exception","message":"Unexpected end of JSON input"} from /status when squid is not available
  2. On start (serving bucket 0:0) I'm seeing
2023-12-28 22:59:51:5951 ContentService info: ContentService initializing...
{
    "supportedObjects": 0,
    "filesCountOnStartup": 0,
    "cacheItemsCountOnStartup": 0
}
2023-12-28 22:59:51:5951 ContentService info: ContentService initialized
{
    "filesDropped": 0,
    "cacheItemsDropped": 0,
    "contentSizeSum": 0
}

I expected the supportedObjects to give proper number. I can get objects fine

@@ -1,6 +1,6 @@
id: test-node
endpoints:
queryNode: http://localhost:8081/graphql
queryNode: http://localhost:4352/graphql
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make it clear that this is not the query node now. I suggest to rename to storageSquid or just squid

Copy link
Member

@mnaamani mnaamani Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind the name so much. Just wondering where the port number selection came from, is it just 4250 (orion graphql port) + 2 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind the name so much. Just wondering where the port number selection came from, is it just 4250 (orion graphql port) + 2 ?

@mnaamani yes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind the name so much.

It's not purely cosmetic, I'm just thinking about operators upgrading and possibly forgetting to update this config value. If we leave the current name and somebody keeps the old value, then they will try to query old QN, which will most likely lead to some weird GraphQL errors that may not be as readable and clear.

If we update the config key, any operator that forgot to update their config will get a clear error about missing config key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we update the config key, any operator that forgot to update their config will get a clear error about missing config key.

I think that is a good argument actually.

Copy link
Member

@mnaamani mnaamani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. I left a few comments.

I see the annoying test curator moderation actions: Failed failing often. I think this started happening more frequently when we introduced the accept pending service because of delay between object being uploaded and the accepted status being updated?

.env Outdated
SQUID_DB_PORT=23332

# Processor service prometheus port
SQUID_PROCESSOR_PROMETHEUS_PORT=3337
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PROCESSOR_PROMETHEUS_PORT earlier in the file is also assigned 3337
But this is only a problem if orion processor is also running on same host.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 36212f0

@@ -1,6 +1,6 @@
id: test-node
endpoints:
queryNode: http://localhost:8081/graphql
queryNode: http://localhost:4352/graphql
Copy link
Member

@mnaamani mnaamani Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind the name so much. Just wondering where the port number selection came from, is it just 4250 (orion graphql port) + 2 ?

import { DataObjectDetailsFragment } from './query-node/generated/queries'
import { DistributionBucketOperatorStatus } from './query-node/generated/schema'
import { RuntimeApi } from './runtime/api'
import { StorageNodeApi } from './storage-node/api'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we discussed this once, but it is hard to see what imports were added or removed when they are also being re-ordered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the re-ordering change

@@ -138,6 +138,7 @@
"prepack": "rm -rf lib && tsc -b && oclif-dev manifest && generate:all",
"test": "nyc --extension .ts mocha --forbid-only \"test/**/*.test.ts\"",
"version": "generate:docs:cli && git add docs/cli/*",
"generate:schema:graphql": "docker run --rm joystream/storage-squid:latest npm run get-graphql-schema > src/services/networking/query-node/schema.graphql",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"generate:schema:graphql": "docker run --rm joystream/storage-squid:latest npm run get-graphql-schema > src/services/networking/query-node/schema.graphql",
"generate:schema:graphql": "docker inspect joystream/storage-squid:latest && docker run --rm joystream/storage-squid:latest npm run get-graphql-schema > src/services/networking/query-node/schema.graphql",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get-graphql-schema is existing with value 0 when graphql endpoint is unreachable:

FetchError: request to http://localhost:4352/graphql failed, reason: connect ECONNREFUSED 127.0.0.1:4352

In cases docker run fails or scripts doesn't connect, the resulting src/services/networking/query-node/schema.graphql is overwritten with no content.

This step is done manually and not part of automated build step, but it would be better for it to fail in a better way.

I'm guessing on my machine 5 seconds was not long enough for graphql-server to startup.

Interactively it works fine..

docker run -it joystream/storage-squid:latest sh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a new shell script that will only write to output file if the returned schema response is not empty. Let me know if it's fine.

Added in 3a6b0e4

- DB_PORT=${SQUID_DB_PORT}
- DB_HOST=${SQUID_DB_HOST}
- GQL_PORT=${SQUID_GQL_PORT}
- ARCHIVE_GATEWAY_URL=${SQUID_ARCHIVE_GATEWAY_URL}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQUID_ARCHIVE_GATEWAY_URL is empty in .env, what is the default if ARCHIVE_GATEWAY_URL is empty? Does it sync directly from chain using rcp endpoint?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answering myself: Yes :)

tests/network-tests/start-storage.sh Show resolved Hide resolved

echo "Staring storage infrastructure"

HOST_IP=`$THIS_DIR/get-host-ip.sh`
# Start Storage-Squid
docker-compose -f $THIS_DIR/../../docker-compose.storage-squid.yml up -d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to also start the storage-squid in project root
start.sh and start-multi-storage.sh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 8c03814

@@ -65,6 +67,12 @@ export class NetworkingService {
this.downloadQueue.on('error', (err) => {
this.logger.error('Data object download failed', { err })
})

this.initRuntimeApi().catch((err) => this.logger.error('Runtime API initialization failed:', { err }))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it is not critical for runtimeApi to be functioning, as only place i see it used is in the public api endpoint to get the status. But if the api is not initialized and the endpoint is called and getQueryNodeStatus() thorws, we should ensure the request handler catches it and return a HTTP 503 status rather than cause the express server to cause the process to exit with an unhandled expection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should ensure the request handler catches it and return a HTTP 503 status rather than cause the express server to cause the process to exit with an unhandled expection.

I don't think this is currently happening for distributor-node (yes, storage-node is crashing), if getQueryNodeStatus() throws, then it is being caught by the express error handler middleware.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if getQueryNodeStatus() throws, then it is being caught by the express error handler middleware.

Was this added recently? Referencing my previous comment:

Getting {"type":"exception","message":"Unexpected end of JSON input"} from /status when squid is not available

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually it is handled and Argus doesn't crash, so I guess we just need better error message in that case

this.apolloClient = new ApolloClient({
link: splitLink,
link: from([errorLink, new HttpLink({ uri: this.config.endpoints.queryNode, fetch })]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where this from() function is imported. Is it a typescript/es6 syntax?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks I guess it was hard to spot because of so many occurrences of the from keyword.

@@ -158,7 +159,7 @@ export async function getStatus(req: express.Request, res: express.Response<Stat
downloadBuckets,
sync,
cleanup,
queryNodeStatus: await getQueryNodeStatus(qnApi),
queryNodeStatus: await getQueryNodeStatus(api, qnApi),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does express handle the response if we get an uncaught exception here? Is it handled by

app.use((err: Error, req: express.Request, res: express.Response, next: express.NextFunction) => {
?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is a similar point made in #5001 (review)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, error handler middleware is not really used/called in the storage-node (even in all the previous versions), so whenever the unhandled exception occurred in the route handlers, it caused the app to crash.

If async route handlers throw then its the responsibility of the app to pass this error to the next() function which then will be handled by the default error handler

I have fixed this for storage-node, for distributor-node this is not a problem as error thrown from the route handlers are already passed to the next() function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in e2ce216

@kdembler
Copy link
Collaborator

I'd suggest to do it on a server so that it indexes all the time. Locally (only syncing when my laptop was on) it didn't happen to me as well

After your comment I have dropped my squid, started fresh sync and it crashed again

@kdembler
Copy link
Collaborator

@zeeshanakram3 one more question: do you think it would be a lot of effort to include storage squid version in argus&colossus status endpoints? #4732

@zeeshanakram3
Copy link
Contributor Author

@zeeshanakram3 one more question: do you think it would be a lot of effort to include storage squid version in argus&colossus status endpoints? #4732

I will implement it.

Copy link
Member

@mnaamani mnaamani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more code to drop (upload authentication related)
A couple suggestions on docker-compose config for squid.

networks:
- joystream
ports:
- '${SQUID_DB_PORT}:${SQUID_DB_PORT}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- '${SQUID_DB_PORT}:${SQUID_DB_PORT}'
- '127.0.0.1:${SQUID_DB_PORT}:${SQUID_DB_PORT}'

To prevent the db being exposed on public interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

ports:
- '${SQUID_DB_PORT}:${SQUID_DB_PORT}'
command: ['postgres', '-c', 'config_file=/etc/postgresql/postgresql.conf', '-p', '${SQUID_DB_PORT}']
shm_size: 1g
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a bit more than what postgres tries to allocate as configured in postgresql.conf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

squid_db:
container_name: squid_db
hostname: squid-db
image: postgres:14
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does squid work with postgres:16 ? Is there any downside of using latest if its compatible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, should be safe upgrading to postgres:16, so changed it.

// Delete the stale objects from the pending folder
await Promise.allSettled(
deletedObjectIds.map(({ data: { dataObjectId } }) =>
fsPromises.unlink(path.join(this.pendingDataObjectsDir, dataObjectId))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should also remove the id from pendingDataObjects array so we don't attempt to process it in next step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an object has been deleted from the runtime then, pendingDataObjects wont contain it in the first place since QN wont return it. Wouldn't that be the case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, but that original array i'm refering to includes files picked up from the pending folder. So here we are deleting when we find that it has been deleted by runtime. Now it is odd for that file to get placed there but non-the less if it was placed, or somehow uploaded again, then we will delete it but later still try to actually moving to uploads folder because no matching bucket will be found for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might actually try to compute the ipfs hash of it first.. but still the general point applies. correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my bad, I confused it with the pendingIds array.

@@ -82,7 +79,8 @@ export async function createApp(config: AppConfig): Promise<Express> {
validateRequests: true,
operationHandlers: {
basePath: path.join(__dirname, './controllers'),
resolver: OpenApiValidator.resolvers.modulePathResolver,
resolver: (basePath: string, route: RouteMetadata, apiDoc: OpenAPIV3.Document) =>
asyncHandler(OpenApiValidator.resolvers.modulePathResolver(basePath, route, apiDoc)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change fixing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It catches the errors thrown by the async route handlers and passes it to the default error handler. Without this change if any of the route handlers (e.g. handler for /status) threw an error then it would crash the application since it would not be caught anywhere.

It's the equivalent of following piece of code in distributor node
https://github.com/zeeshanakram3/joystream/blob/0b8c3f8bdd6539a408bfcf6d9d4539a53cbb7649/distributor-node/src/services/httpApi/HttpApiBase.ts#L27-L31

storage-node/src/services/webApi/app.ts Show resolved Hide resolved
})

return app
}

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are getting rid of these.. we can also drop

    // when enabling upload auth ensure the keyring has the operator role key and set it here.
    const enableUploadingAuth = false
    const operatorRoleKey = undefined

in server.ts and the corresponding fields in the AppConfig type

Does it make sense then to also remove the relevant parts in the api spec openapi.yaml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in fd44d0d. Also removed relevant parts from the API spec file.

@@ -237,7 +237,7 @@ services:
"

indexer:
image: joystream/hydra-indexer:v5.0.0-alpha.1
image: hydra-indexer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think like me you were using local build of hydra-indexer ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ;), fixed that in 190daa0

@mnaamani mnaamani self-requested a review January 16, 2024 13:40
Copy link
Member

@mnaamani mnaamani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good, final changes

  • fix the prometheus port conflict in docker compose file.
  • Bump storage node version to v3.11.0 and the storage cli package to v3.3.0

- RPC_ENDPOINT=${JOYSTREAM_NODE_WS}

ports:
- '127.0.0.1:${PROCESSOR_PROMETHEUS_PORT}:${PROCESSOR_PROMETHEUS_PORT}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to use SQUID_PROCESSOR_PROMETHEUS_PORT instead. Otherwise services are failing to start with:

Error response from daemon: driver failed programming external connectivity on endpoint squid_processor (e1a9bb40428326391a37414aa61f52dfd4817f99dc1662c278013f4780d7c979): Bind for 127.0.0.1:3337 failed: port is already allocated

When we start network with start.sh or start-multistorage.sh that starts orion services first which is listening on PROCESSOR_PROMETHEUS_PORT port.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 897c7c0

@zeeshanakram3
Copy link
Contributor Author

  • Bump storage node version to v3.11.0 and the storage cli package to v3.3.0

@mnaamani regarding the package versions, shouldn't we bump the major versions in both Argus & Colossus, since we have a breaking change in configuration options (changed --queryNodeEndpoint to --storageSquidEndpoint)?

@mnaamani
Copy link
Member

  • Bump storage node version to v3.11.0 and the storage cli package to v3.3.0

@mnaamani regarding the package versions, shouldn't we bump the major versions in both Argus & Colossus, since we have a breaking change in configuration options (changed --queryNodeEndpoint to --storageSquidEndpoint)?

That might actually make more sense yes.

@mnaamani
Copy link
Member

Should we also consider forking current master to keep the old version on a separate branch so we can slowly transition all operators just in-case we need to create some important patch while operators are still on older version?

@zeeshanakram3
Copy link
Contributor Author

Should we also consider forking current master to keep the old version on a separate branch so we can slowly transition all operators just in-case we need to create some important patch while operators are still on older version?

Yeah, makes sense

@zeeshanakram3 zeeshanakram3 force-pushed the storage-squid-integration branch from 67247bf to b625322 Compare January 17, 2024 08:33
@zeeshanakram3
Copy link
Contributor Author

Failing integration tests should be fixed by Joystream/storage-squid#13

@mnaamani mnaamani self-requested a review January 18, 2024 05:16
@kdembler
Copy link
Collaborator

@zeeshanakram3 could you reply to this? #5001 (comment) I think Argus (and Colossus) should still provide valid response for /status request even when the squid is down. Currently we're just forwarding the raw error to the user.

As a counterargument, we need that kind of handling for asset requests as well, not only for status and that currently isn't handled

Copy link
Collaborator

@kdembler kdembler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created followup for the thing I mentioned: #5056

Great work on this @zeeshanakram3, LGTM!

@mnaamani mnaamani merged commit 3b0965e into Joystream:master Jan 23, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
argus Argus distributor node colossus
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

colossus: command docs Storage squid support for Argus and Colossus
3 participants