Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBS3Upload fails to inject a large block in DBS #10237

Closed
amaltaro opened this issue Jan 22, 2021 · 11 comments
Closed

DBS3Upload fails to inject a large block in DBS #10237

amaltaro opened this issue Jan 22, 2021 · 11 comments

Comments

@amaltaro
Copy link
Contributor

Impact of the bug
WMAgent (DBS3Upload)

Describe the bug
As reported in this JIRA ticket:
https://its.cern.ch/jira/browse/CMSCOMPPR-17840

the workflow
pdmvserv_task_BPH-RunIISummer19UL17MiniAODv2-00001__v1_T_201217_181741_6496

has completed for many days, but it is still missing the output dataset to be available in DBS (data is already in Rucio).

This workflow was acquired by vocms0254, and a look at the component logs show that DBS3Upload has been failing to inject that data/block since December.
That request seems to be failing between the agent and the CMSWEB frontends, so the request doesn't even get to the DBS backends.

How to reproduce it
Inject a very large data in a single call (what is "very large" though?)

Expected behavior
Data injection into DBS should not fail. We need to find a way to get that data injected into the server, and perhaps implement some protection to ensure it does not happen in the future.

Additional context and error message
Error log from the component in vocms0254:

2021-01-22 11:12:53,107:140105861187328:INFO:DBSUploadPoller:About to insert block /BcToJPsiTauNu_TuneCP5_13TeV-bcvegpy2-pythia8-evtgen/RunIISummer19UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#b55f9368-257c-44e0-b289-95702602c5cd
2021-01-22 11:13:14,912:140105861187328:ERROR:DBSUploadPoller:Error trying to process block /BcToJPsiTauNu_TuneCP5_13TeV-bcvegpy2-pythia8-evtgen/RunIISummer19UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#b55f9368-257c-44e0-b289-95702602c5cd through DBS. Error: HTTP Error 502: Bad Gateway
Traceback (most recent call last):
  File "/data/srv/wmagent/v1.4.3.patch2/sw/slc7_amd64_gcc630/cms/wmagent/1.4.3.patch2/lib/python2.7/site-packages/WMComponent/DBS3Buffer/DBSUploadPoller.py", line 115, in uploadWorker
    dbsApi.insertBulkBlock(blockDump=block)
  File "/data/srv/wmagent/v1.4.3.patch2/sw/slc7_amd64_gcc630/cms/dbs3-client/3.16.0/lib/python2.7/site-packages/dbs/apis/dbsClient.py", line 346, in insertBulkBlock
    result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )
  File "/data/srv/wmagent/v1.4.3.patch2/sw/slc7_amd64_gcc630/cms/dbs3-client/3.16.0/lib/python2.7/site-packages/dbs/apis/dbsClient.py", line 203, in __callServer
    self.__parseForException(http_error)
  File "/data/srv/wmagent/v1.4.3.patch2/sw/slc7_amd64_gcc630/cms/dbs3-client/3.16.0/lib/python2.7/site-packages/dbs/apis/dbsClient.py", line 227, in __parseForException
    raise http_error
HTTPError: HTTP Error 502: Bad Gateway
2021-01-22 11:13:14,918:140105861187328:ERROR:DBSUploadPoller:Error found in multiprocess during process of block /BcToJPsiTauNu_TuneCP5_13TeV-bcvegpy2-pythia8-evtgen/RunIISummer19UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#b55f9368-257c-44e0-b289-95702602c5cd
2021-01-22 11:13:14,919:140105861187328:ERROR:DBSUploadPoller:Error trying to process block /BcToJPsiTauNu_TuneCP5_13TeV-bcvegpy2-pythia8-evtgen/RunIISummer19UL17MiniAODv2-106X_mc2017_realistic_v9-v1/MINIAODSIM#b55f9368-257c-44e0-b289-95702602c5cd through DBS. Error: HTTP Error 502: Bad Gateway
@amaltaro
Copy link
Contributor Author

I've got a dump of the block we are POSTing to the blockDump DBS API, here it is:
http://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_10237/

a few facts:

  • volume on disk is ~200MB
  • volume transferred through TCP is around ~120MB
  • this block contains 1893205 lumi sections
  • changing access to DBS through a different FE (cmsweb-prod) did not help and the error persists.

Looking at the cmsweb-prod frontend logs in vocms0750, here is some extra information.
FROM: access_log_frontend-7794845bc9-5hcbx_20210122.txt
[22/Jan/2021:12:09:44 +0100] 186a1f8b2ba2 IP-hidden "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 502 [data: 121228219 in 20986 out 232 body 639822 us ] [auth: TLSv1.2 ECDHE "/DC=BLAH.cern.ch" "-" ] [ref: "-" "DBSClient/3.16.0/" ]

and FROM: error_log_frontend-7794845bc9-5hcbx_20210122.txt
[Fri Jan 22 12:09:44.827449 2021] [proxy:error] [pid 1799] (104)Connection reset by peer: [client IP-hidden] AH01084: pass request body failed to IP-hidden (vocms0763.cern.ch)
[Fri Jan 22 12:09:44.827549 2021] [proxy_http:error] [pid 1799] [client IP-hidden] AH01097: pass request body failed to IP-hidden (vocms0763.cern.ch) from IP-hidden ()

Given that the error reports vocms0763 backend, I have the impression that the HTTP request is actually reset on the DBS backend...

@amaltaro
Copy link
Contributor Author

With the block dump on disk, I hit the same error when trying this POST call via curl command. Something like:

cmst1@vocms0254:/data/srv/wmagent/current $ curl -v --trace-time --cert $X509_USER_CERT --key $X509_USER_KEY -X POST -H "Content-Type: application/json" -H "Accept: application/json" -d @/data/srv/wmagent/current/dbsuploader_block.json "https://cmsweb.cern.ch/dbs/prod/global/DBSWriter/bulkblocks"

...
17:55:52.845262 > POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1
17:55:52.845262 > User-Agent: curl/7.29.0
17:55:52.845262 > Host: cmsweb.cern.ch
17:55:52.845262 > Content-Type: application/json
17:55:52.845262 > Accept: application/json
17:55:52.845262 > Content-Length: 208180559
17:55:52.845262 > Expect: 100-continue
17:55:52.845262 > 
17:55:52.847403 < HTTP/1.1 100 Continue
17:55:53.323221 < HTTP/1.1 502 Bad Gateway
17:55:53.323273 < Date: Fri, 22 Jan 2021 16:55:52 GMT
17:55:53.323301 < Server: Apache
17:55:53.323328 < Set-Cookie: cms-auth=BLAH;path=/;secure;httponly;expires=Thu, 01-Jan-1970 00:00:01 GMT
17:55:53.323355 < Content-Length: 232
17:55:53.323383 < Content-Type: text/html; charset=iso-8859-1
17:55:53.323411 * HTTP error before end of send, stop sending
17:55:53.323438 < 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Bad Gateway</title>
</head><body>
<h1>Bad Gateway</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
</p>
</body></html>
17:55:53.323528 * Closing connection 0

@vkuznet @yuyiguo would you have any other suggestion on what can be done to better understand this problem?

Another option would be to - if possible - we get this block inserted into DBS with Yuyi's help, and we start working on limiting blocks to no more than 1M lumi sections (which should be about half the size of this request).

@amaltaro
Copy link
Contributor Author

I made this gitlab ticket to keep discussing it and providing further details:
https://gitlab.cern.ch/cms-http-group/doc/-/issues/251

As of now, I don't think there is any problem in WMCore itself that needs to be resolved, besides setting a limit to the number of lumi sections allowed in each block; such that we can avoid these large jsons to be posted to the webservices.

@amaltaro
Copy link
Contributor Author

amaltaro commented Feb 4, 2021

As agreed with Hasan, we have given up on trying to insert that block into DBS. He's going to reject/clone and assign a new workflow (hopefully this time we won't pile everything in the same block).

I have marked that block as Closed in dbsbuffer_block; and all its 283 files as InDBS in dbsbuffer_file.

I'm going to create another issue later today, which should prevent this from happening again.

@amaltaro
Copy link
Contributor Author

amaltaro commented Feb 4, 2021

Real fix should be addressed with this GH issue: #10264

Closing this one, as there is nothing else to be done with the problematic block.

@amaltaro amaltaro closed this as completed Feb 4, 2021
@amaltaro amaltaro assigned todor-ivanov and unassigned amaltaro Mar 28, 2022
@amaltaro amaltaro reopened this Mar 28, 2022
@amaltaro
Copy link
Contributor Author

Reopening this issue because we have been hit hard by such large blocks, and a few agents have these blocks failing to be inserted into DBS for around 10 days now.

Looking into one of the blocks that Todor provided via private email (on submit4), the component log has this


2022-03-28 15:25:02,138:140043075479296:INFO:DBSUploadPoller:About to insert block /BcToNMuX_NToEMuPi_SoftQCD_b_mN5p5_ctau1p0mm_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18pLHE-102X_upgrade2018_realistic_v11-v2/LHE#eb5fcbc5-e8c8-4713-a5b3-5cb
46a3fb807
2022-03-28 15:25:15,265:140043075479296:ERROR:DBSUploadPoller:Error trying to process block /BcToNMuX_NToEMuPi_SoftQCD_b_mN5p5_ctau1p0mm_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18pLHE-102X_upgrade2018_realistic_v11-v2/LHE#eb5fcbc5-e8c8-4713

which fails with a 502 Bad Gateway in 13 seconds, so apparently it doesn't even try the insert into Oracle.

I managed to get a dump of this block and moved it under (the 2022 json file):
https://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_10237/

from that, we can see it's ~200MB of json data (will be much more in python memory), and it has 1.85M lumis.

From the original list of blocks that Todor provided in this JIRA ticket:
https://its.cern.ch/jira/browse/CMSCOMPPR-24324

I counted 4 unique datasets affected.
We need to check with the DBS team whether it's feasible to have an ad-hoc block injection (with their help) - probably bypassing the frontends - and either inject these blocks (after a dump of every single of them is collected); or we abort those workflows and no longer try to inject this data into DBS.

@yuyiguo
Copy link
Member

yuyiguo commented Mar 29, 2022

@amaltaro

2022-03-28 15:25:02,138:140043075479296:INFO:DBSUploadPoller:About to insert block /BcToNMuX_NToEMuPi_SoftQCD_b_mN5p5_ctau1p0mm_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18pLHE-102X_upgrade2018_realistic_v11-v2/LHE#eb5fcbc5-e8c8-4713-a5b3-5cb
46a3fb807
2022-03-28 15:25:15,265:140043075479296:ERROR:DBSUploadPoller:Error trying to process block /BcToNMuX_NToEMuPi_SoftQCD_b_mN5p5_ctau1p0mm_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18pLHE-102X_upgrade2018_realistic_v11-v2/LHE#eb5fcbc5-e8c8-4713
The time executing the upl

The above log showed that 13 seconds between the DBSUploadPoller from uploading the block to DBS and receiving a 502 error means to me that it failed before the data completely transferred to DBS server. There was something interrupted the transfer, cherrypy or ? I don't know what was it, but skipping FE does not seem to me can help us.

If we had only one blocks and a few files, we could try to remove all the lumis on the files, then upload the block w/o file-lumis. After that we manually/sql insert the file-lumis. However, we get about 15 blocks and hundreds of files? It is very dangerous to do so. We can mess up prod DBS.
Sorry, that I don't have a solution here.

@amaltaro
Copy link
Contributor Author

@yuyiguo we had to patch one of these agents to make sure it was indeed failing in the initial stage of this HTTP request, instead of the usual 5min timeout. Here is a new log from submit4:

2022-03-29 08:13:50,926:139667935037184:INFO:DBSUploadPoller:Queueing block for insertion: /BcToNMuX_NToEMuPi_SoftQCD_b_mN5p5_ctau1p0mm_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18pLHE-102X_upgrade2018_realistic_v11-v2/LHE#eb5fcbc5-e8c8-4713-
a5b3-5cb46a3fb807
2022-03-29 08:13:53,473:139668113659712:INFO:DBSUploadPoller:About to call insert block for: /BcToNMuX_NToEMuPi_SoftQCD_b_mN5p5_ctau1p0mm_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18pLHE-102X_upgrade2018_realistic_v11-v2/LHE#eb5fcbc5-e8c8-471
3-a5b3-5cb46a3fb807
2022-03-29 08:14:03,073:139668113659712:ERROR:DBSUploadPoller:Error trying to process block /BcToNMuX_NToEMuPi_SoftQCD_b_mN5p5_ctau1p0mm_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18pLHE-102X_upgrade2018_realistic_v11-v2/LHE#eb5fcbc5-e8c8-4713
-a5b3-5cb46a3fb807 through DBS. Error: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Bad Gateway</title>

so yes, our request apparently doesn't even reach the backend (but it should be confirmed through the frontend logs matching against DBSWriter ones).

For the P&R team, my suggestion would be to actually abort those workflows, clone and assign the new ones with one special parameter:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L1168

so, during assignment time, we could set this BlockCloseMaxFiles to only 2 files (instead of the default 500). Two files should be good enough and stay under .5M lumis (thus < 50MB payload).

@yuyiguo
Copy link
Member

yuyiguo commented Mar 29, 2022

@amaltaro Thanks for patching code to get for debugging info.
I am wondering if we can set max lumis per block besides max files/bk ?

@amaltaro
Copy link
Contributor Author

@yuyiguo Yuyi, we have this issue to track a proper development and fix: #10264

Unless we decide to remove it from our Q2 plans, it should get fixed in the coming months.

@amaltaro
Copy link
Contributor Author

The original workflows have been rejected, output blocks were marked as "injected" in WMAgent - such that DBS3Upload stops to try to inject those blocks/files in DBS server - and new clone workflows with a tweak of files per block are on their way.

Real and permanent fix still needs to be addressed in #10264 , but this specific issue can now be closed IMO. Please reopen it if I missed anything else important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants