-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storage Squid integration with Argus/Colossus #5001
Storage Squid integration with Argus/Colossus #5001
Conversation
…s when running Argus/Colossus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing to see this realize, great work @zeeshanakram3
I'll try to get to proper code review later, for now tried Argus out and seeing some problems:
- Getting
{"type":"exception","message":"Unexpected end of JSON input"}
from/status
when squid is not available - On start (serving bucket
0:0
) I'm seeing
2023-12-28 22:59:51:5951 ContentService info: ContentService initializing...
{
"supportedObjects": 0,
"filesCountOnStartup": 0,
"cacheItemsCountOnStartup": 0
}
2023-12-28 22:59:51:5951 ContentService info: ContentService initialized
{
"filesDropped": 0,
"cacheItemsDropped": 0,
"contentSizeSum": 0
}
I expected the supportedObjects
to give proper number. I can get objects fine
distributor-node/config.yml
Outdated
@@ -1,6 +1,6 @@ | |||
id: test-node | |||
endpoints: | |||
queryNode: http://localhost:8081/graphql | |||
queryNode: http://localhost:4352/graphql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make it clear that this is not the query node now. I suggest to rename to storageSquid
or just squid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind the name so much. Just wondering where the port number selection came from, is it just 4250 (orion graphql port) + 2 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind the name so much. Just wondering where the port number selection came from, is it just 4250 (orion graphql port) + 2 ?
@mnaamani yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind the name so much.
It's not purely cosmetic, I'm just thinking about operators upgrading and possibly forgetting to update this config value. If we leave the current name and somebody keeps the old value, then they will try to query old QN, which will most likely lead to some weird GraphQL errors that may not be as readable and clear.
If we update the config key, any operator that forgot to update their config will get a clear error about missing config key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we update the config key, any operator that forgot to update their config will get a clear error about missing config key.
I think that is a good argument actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. I left a few comments.
I see the annoying test curator moderation actions: Failed
failing often. I think this started happening more frequently when we introduced the accept pending service because of delay between object being uploaded and the accepted status being updated?
.env
Outdated
SQUID_DB_PORT=23332 | ||
|
||
# Processor service prometheus port | ||
SQUID_PROCESSOR_PROMETHEUS_PORT=3337 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PROCESSOR_PROMETHEUS_PORT
earlier in the file is also assigned 3337
But this is only a problem if orion processor is also running on same host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 36212f0
distributor-node/config.yml
Outdated
@@ -1,6 +1,6 @@ | |||
id: test-node | |||
endpoints: | |||
queryNode: http://localhost:8081/graphql | |||
queryNode: http://localhost:4352/graphql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind the name so much. Just wondering where the port number selection came from, is it just 4250 (orion graphql port) + 2 ?
import { DataObjectDetailsFragment } from './query-node/generated/queries' | ||
import { DistributionBucketOperatorStatus } from './query-node/generated/schema' | ||
import { RuntimeApi } from './runtime/api' | ||
import { StorageNodeApi } from './storage-node/api' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we discussed this once, but it is hard to see what imports were added or removed when they are also being re-ordered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted the re-ordering change
distributor-node/package.json
Outdated
@@ -138,6 +138,7 @@ | |||
"prepack": "rm -rf lib && tsc -b && oclif-dev manifest && generate:all", | |||
"test": "nyc --extension .ts mocha --forbid-only \"test/**/*.test.ts\"", | |||
"version": "generate:docs:cli && git add docs/cli/*", | |||
"generate:schema:graphql": "docker run --rm joystream/storage-squid:latest npm run get-graphql-schema > src/services/networking/query-node/schema.graphql", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"generate:schema:graphql": "docker run --rm joystream/storage-squid:latest npm run get-graphql-schema > src/services/networking/query-node/schema.graphql", | |
"generate:schema:graphql": "docker inspect joystream/storage-squid:latest && docker run --rm joystream/storage-squid:latest npm run get-graphql-schema > src/services/networking/query-node/schema.graphql", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get-graphql-schema
is existing with value 0
when graphql endpoint is unreachable:
FetchError: request to http://localhost:4352/graphql failed, reason: connect ECONNREFUSED 127.0.0.1:4352
In cases docker run fails or scripts doesn't connect, the resulting src/services/networking/query-node/schema.graphql
is overwritten with no content.
This step is done manually and not part of automated build step, but it would be better for it to fail in a better way.
I'm guessing on my machine 5 seconds was not long enough for graphql-server to startup.
Interactively it works fine..
docker run -it joystream/storage-squid:latest sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created a new shell script that will only write to output file if the returned schema response is not empty. Let me know if it's fine.
Added in 3a6b0e4
- DB_PORT=${SQUID_DB_PORT} | ||
- DB_HOST=${SQUID_DB_HOST} | ||
- GQL_PORT=${SQUID_GQL_PORT} | ||
- ARCHIVE_GATEWAY_URL=${SQUID_ARCHIVE_GATEWAY_URL} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQUID_ARCHIVE_GATEWAY_URL is empty in .env, what is the default if ARCHIVE_GATEWAY_URL
is empty? Does it sync directly from chain using rcp endpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answering myself: Yes :)
|
||
echo "Staring storage infrastructure" | ||
|
||
HOST_IP=`$THIS_DIR/get-host-ip.sh` | ||
# Start Storage-Squid | ||
docker-compose -f $THIS_DIR/../../docker-compose.storage-squid.yml up -d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to also start the storage-squid in project root
start.sh
and start-multi-storage.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 8c03814
@@ -65,6 +67,12 @@ export class NetworkingService { | |||
this.downloadQueue.on('error', (err) => { | |||
this.logger.error('Data object download failed', { err }) | |||
}) | |||
|
|||
this.initRuntimeApi().catch((err) => this.logger.error('Runtime API initialization failed:', { err })) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it is not critical for runtimeApi to be functioning, as only place i see it used is in the public api endpoint to get the status. But if the api is not initialized and the endpoint is called and getQueryNodeStatus()
thorws, we should ensure the request handler catches it and return a HTTP 503 status rather than cause the express server to cause the process to exit with an unhandled expection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should ensure the request handler catches it and return a HTTP 503 status rather than cause the express server to cause the process to exit with an unhandled expection.
I don't think this is currently happening for distributor-node
(yes, storage-node
is crashing), if getQueryNodeStatus()
throws, then it is being caught by the express error handler middleware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if getQueryNodeStatus() throws, then it is being caught by the express error handler middleware.
Was this added recently? Referencing my previous comment:
Getting {"type":"exception","message":"Unexpected end of JSON input"} from /status when squid is not available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm actually it is handled and Argus doesn't crash, so I guess we just need better error message in that case
this.apolloClient = new ApolloClient({ | ||
link: splitLink, | ||
link: from([errorLink, new HttpLink({ uri: this.config.endpoints.queryNode, fetch })]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where this from()
function is imported. Is it a typescript/es6 syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 thanks I guess it was hard to spot because of so many occurrences of the from
keyword.
@@ -158,7 +159,7 @@ export async function getStatus(req: express.Request, res: express.Response<Stat | |||
downloadBuckets, | |||
sync, | |||
cleanup, | |||
queryNodeStatus: await getQueryNodeStatus(qnApi), | |||
queryNodeStatus: await getQueryNodeStatus(api, qnApi), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does express handle the response if we get an uncaught exception here? Is it handled by
app.use((err: Error, req: express.Request, res: express.Response, next: express.NextFunction) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is a similar point made in #5001 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, error handler middleware is not really used/called in the storage-node
(even in all the previous versions), so whenever the unhandled exception occurred in the route handlers, it caused the app to crash.
If async route handlers throw then its the responsibility of the app to pass this error to the next()
function which then will be handled by the default error handler
I have fixed this for storage-node, for distributor-node
this is not a problem as error thrown from the route handlers are already passed to the next() function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in e2ce216
…e default error handler middleware
After your comment I have dropped my squid, started fresh sync and it crashed again |
@zeeshanakram3 one more question: do you think it would be a lot of effort to include storage squid version in argus&colossus status endpoints? #4732 |
I will implement it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more code to drop (upload authentication related)
A couple suggestions on docker-compose config for squid.
docker-compose.storage-squid.yml
Outdated
networks: | ||
- joystream | ||
ports: | ||
- '${SQUID_DB_PORT}:${SQUID_DB_PORT}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- '${SQUID_DB_PORT}:${SQUID_DB_PORT}' | |
- '127.0.0.1:${SQUID_DB_PORT}:${SQUID_DB_PORT}' |
To prevent the db being exposed on public interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
docker-compose.storage-squid.yml
Outdated
ports: | ||
- '${SQUID_DB_PORT}:${SQUID_DB_PORT}' | ||
command: ['postgres', '-c', 'config_file=/etc/postgresql/postgresql.conf', '-p', '${SQUID_DB_PORT}'] | ||
shm_size: 1g |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a bit more than what postgres tries to allocate as configured in postgresql.conf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
docker-compose.storage-squid.yml
Outdated
squid_db: | ||
container_name: squid_db | ||
hostname: squid-db | ||
image: postgres:14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does squid work with postgres:16
? Is there any downside of using latest if its compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, should be safe upgrading to postgres:16
, so changed it.
// Delete the stale objects from the pending folder | ||
await Promise.allSettled( | ||
deletedObjectIds.map(({ data: { dataObjectId } }) => | ||
fsPromises.unlink(path.join(this.pendingDataObjectsDir, dataObjectId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should also remove the id from pendingDataObjects
array so we don't attempt to process it in next step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an object has been deleted from the runtime then, pendingDataObjects
wont contain it in the first place since QN wont return it. Wouldn't that be the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, but that original array i'm refering to includes files picked up from the pending folder. So here we are deleting when we find that it has been deleted by runtime. Now it is odd for that file to get placed there but non-the less if it was placed, or somehow uploaded again, then we will delete it but later still try to actually moving to uploads folder because no matching bucket will be found for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might actually try to compute the ipfs hash of it first.. but still the general point applies. correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry my bad, I confused it with the pendingIds
array.
@@ -82,7 +79,8 @@ export async function createApp(config: AppConfig): Promise<Express> { | |||
validateRequests: true, | |||
operationHandlers: { | |||
basePath: path.join(__dirname, './controllers'), | |||
resolver: OpenApiValidator.resolvers.modulePathResolver, | |||
resolver: (basePath: string, route: RouteMetadata, apiDoc: OpenAPIV3.Document) => | |||
asyncHandler(OpenApiValidator.resolvers.modulePathResolver(basePath, route, apiDoc)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this change fixing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It catches the errors thrown by the async route handlers and passes it to the default error handler. Without this change if any of the route handlers (e.g. handler for /status
) threw an error then it would crash the application since it would not be caught anywhere.
It's the equivalent of following piece of code in distributor node
https://github.com/zeeshanakram3/joystream/blob/0b8c3f8bdd6539a408bfcf6d9d4539a53cbb7649/distributor-node/src/services/httpApi/HttpApiBase.ts#L27-L31
}) | ||
|
||
return app | ||
} | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are getting rid of these.. we can also drop
// when enabling upload auth ensure the keyring has the operator role key and set it here.
const enableUploadingAuth = false
const operatorRoleKey = undefined
in server.ts
and the corresponding fields in the AppConfig
type
Does it make sense then to also remove the relevant parts in the api spec openapi.yaml
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in fd44d0d. Also removed relevant parts from the API spec file.
docker-compose.yml
Outdated
@@ -237,7 +237,7 @@ services: | |||
" | |||
|
|||
indexer: | |||
image: joystream/hydra-indexer:v5.0.0-alpha.1 | |||
image: hydra-indexer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think like me you were using local build of hydra-indexer ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah ;), fixed that in 190daa0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good, final changes
- fix the prometheus port conflict in docker compose file.
- Bump storage node version to v3.11.0 and the storage cli package to v3.3.0
docker-compose.storage-squid.yml
Outdated
- RPC_ENDPOINT=${JOYSTREAM_NODE_WS} | ||
|
||
ports: | ||
- '127.0.0.1:${PROCESSOR_PROMETHEUS_PORT}:${PROCESSOR_PROMETHEUS_PORT}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to use SQUID_PROCESSOR_PROMETHEUS_PORT
instead. Otherwise services are failing to start with:
Error response from daemon: driver failed programming external connectivity on endpoint squid_processor (e1a9bb40428326391a37414aa61f52dfd4817f99dc1662c278013f4780d7c979): Bind for 127.0.0.1:3337 failed: port is already allocated
When we start network with start.sh
or start-multistorage.sh
that starts orion services first which is listening on PROCESSOR_PROMETHEUS_PORT
port.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 897c7c0
@mnaamani regarding the package versions, shouldn't we bump the major versions in both Argus & Colossus, since we have a breaking change in configuration options (changed |
That might actually make more sense yes. |
Should we also consider forking current master to keep the old version on a separate branch so we can slowly transition all operators just in-case we need to create some important patch while operators are still on older version? |
Yeah, makes sense |
67247bf
to
b625322
Compare
Failing integration tests should be fixed by Joystream/storage-squid#13 |
@zeeshanakram3 could you reply to this? #5001 (comment) I think Argus (and Colossus) should still provide valid response for /status request even when the squid is down. Currently we're just forwarding the raw error to the user. As a counterargument, we need that kind of handling for asset requests as well, not only for status and that currently isn't handled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created followup for the thing I mentioned: #5056
Great work on this @zeeshanakram3, LGTM!
closes #4957
This PR:
docker-compose.storage-squid.yml
file to run storage-squid servicesstorage-squid
services when running Argus/Colossusgraphql-server