-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DBS3Upload fails to inject a large block in DBS #10237
Comments
I've got a dump of the block we are POSTing to the a few facts:
Looking at the cmsweb-prod frontend logs in vocms0750, here is some extra information. and FROM: error_log_frontend-7794845bc9-5hcbx_20210122.txt Given that the error reports vocms0763 backend, I have the impression that the HTTP request is actually reset on the DBS backend... |
With the block dump on disk, I hit the same error when trying this POST call via curl command. Something like:
@vkuznet @yuyiguo would you have any other suggestion on what can be done to better understand this problem? Another option would be to - if possible - we get this block inserted into DBS with Yuyi's help, and we start working on limiting blocks to no more than 1M lumi sections (which should be about half the size of this request). |
I made this gitlab ticket to keep discussing it and providing further details: As of now, I don't think there is any problem in WMCore itself that needs to be resolved, besides setting a limit to the number of lumi sections allowed in each block; such that we can avoid these large jsons to be posted to the webservices. |
As agreed with Hasan, we have given up on trying to insert that block into DBS. He's going to reject/clone and assign a new workflow (hopefully this time we won't pile everything in the same block). I have marked that block as I'm going to create another issue later today, which should prevent this from happening again. |
Real fix should be addressed with this GH issue: #10264 Closing this one, as there is nothing else to be done with the problematic block. |
Reopening this issue because we have been hit hard by such large blocks, and a few agents have these blocks failing to be inserted into DBS for around 10 days now. Looking into one of the blocks that Todor provided via private email (on submit4), the component log has this
which fails with a I managed to get a dump of this block and moved it under (the 2022 json file): from that, we can see it's ~200MB of json data (will be much more in python memory), and it has 1.85M lumis. From the original list of blocks that Todor provided in this JIRA ticket: I counted 4 unique datasets affected. |
The above log showed that 13 seconds between the DBSUploadPoller from uploading the block to DBS and receiving a 502 error means to me that it failed before the data completely transferred to DBS server. There was something interrupted the transfer, cherrypy or ? I don't know what was it, but skipping FE does not seem to me can help us. If we had only one blocks and a few files, we could try to remove all the lumis on the files, then upload the block w/o file-lumis. After that we manually/sql insert the file-lumis. However, we get about 15 blocks and hundreds of files? It is very dangerous to do so. We can mess up prod DBS. |
@yuyiguo we had to patch one of these agents to make sure it was indeed failing in the initial stage of this HTTP request, instead of the usual 5min timeout. Here is a new log from submit4:
so yes, our request apparently doesn't even reach the backend (but it should be confirmed through the frontend logs matching against DBSWriter ones). For the P&R team, my suggestion would be to actually abort those workflows, clone and assign the new ones with one special parameter: so, during assignment time, we could set this |
@amaltaro Thanks for patching code to get for debugging info. |
The original workflows have been rejected, output blocks were marked as "injected" in WMAgent - such that DBS3Upload stops to try to inject those blocks/files in DBS server - and new clone workflows with a tweak of files per block are on their way. Real and permanent fix still needs to be addressed in #10264 , but this specific issue can now be closed IMO. Please reopen it if I missed anything else important. |
Impact of the bug
WMAgent (DBS3Upload)
Describe the bug
As reported in this JIRA ticket:
https://its.cern.ch/jira/browse/CMSCOMPPR-17840
the workflow
pdmvserv_task_BPH-RunIISummer19UL17MiniAODv2-00001__v1_T_201217_181741_6496
has completed for many days, but it is still missing the output dataset to be available in DBS (data is already in Rucio).
This workflow was acquired by vocms0254, and a look at the component logs show that DBS3Upload has been failing to inject that data/block since December.
That request seems to be failing between the agent and the CMSWEB frontends, so the request doesn't even get to the DBS backends.
How to reproduce it
Inject a very large data in a single call (what is "very large" though?)
Expected behavior
Data injection into DBS should not fail. We need to find a way to get that data injected into the server, and perhaps implement some protection to ensure it does not happen in the future.
Additional context and error message
Error log from the component in vocms0254:
The text was updated successfully, but these errors were encountered: