Skip to content
This repository has been archived by the owner on Nov 6, 2019. It is now read-only.

Widespread timeouts in Edge 17 #563

Open
foolip opened this issue May 17, 2018 · 19 comments
Open

Widespread timeouts in Edge 17 #563

foolip opened this issue May 17, 2018 · 19 comments

Comments

@foolip
Copy link
Member

foolip commented May 17, 2018

Edge results have always seemed to have a lot of TIMEOUT in the "Test file" row. Browsing around at random it's easy find cases for Edge 17 as well:
https://wpt.fyi/results/accelerometer/Accelerometer-enabled-by-feature-policy-attribute-redirect-on-load.https.html?sha=ad84337c03
https://wpt.fyi/results/compat/interfaces.any.html?sha=ad84337c03
https://wpt.fyi/results/wake-lock/wakelock-cancel-twice.https.html?sha=ad84337c03

@jugglinmike, as a start, could you count how many tests are like this, where only Edge has a harness-level timeout?

@gsnedders
Copy link
Member

Do we know if this is Sauce specific or happens with native runs?

@foolip
Copy link
Member Author

foolip commented Jul 19, 2018

@thejohnjansen have you been able to compare the results from Sauce with your own runs yet?

@foolip
Copy link
Member Author

foolip commented Aug 8, 2018

@foolip
Copy link
Member Author

foolip commented Aug 17, 2018

@foolip
Copy link
Member Author

foolip commented Aug 17, 2018

As a point of comparison, the same tests from the experimental Edge 18 run:
https://wpt.fyi/results/accelerometer/Accelerometer-enabled-by-feature-policy-attribute-redirect-on-load.https.html?sha=548797a187
https://wpt.fyi/results/compat/interfaces.any.html?sha=548797a187 (missing, not sure why)
https://wpt.fyi/results/wake-lock/wakelock-cancel-twice.https.html?sha=548797a187

The tests also run and have results when testing Edge 17 manually, so I'm quite sure this isn't a problem with Edge, but with the running infrastructure.

@foolip
Copy link
Member Author

foolip commented Aug 17, 2018

https://wpt.fyi/results/?product=edge@085d47ef10&product=edge@548797a187&diff compares two runs that are close in their wpt revisions. It's a bit hard to navigate (web-platform-tests/wpt.fyi#411) but here's one directory with a lot of Ede 17 timeouts which are probably spurious:
https://wpt.fyi/results/dom/events?product=edge@085d47ef10&product=edge@548797a187&diff=true

@foolip
Copy link
Member Author

foolip commented Oct 16, 2018

This issue became apparent again today on a web-platform-tests that fail only in Chrome (from wpt.fyi data) thread on blink-dev.

@foolip
Copy link
Member Author

foolip commented Oct 19, 2018

This may be the most extreme bad run I've spotted so far: https://wpt.fyi/results/?sha=5acd3bcf66

0 / N almost down the line. @jugglinmike, perhaps the setup has become totally broken somehow?

@lukebjerring, do you have an issue for "validate results"? Seems like with that in place we would reject a run like this, or label it bad at least.

@foolip
Copy link
Member Author

foolip commented Oct 19, 2018

This appears to be the most complete of recent runs:
https://wpt.fyi/results/?sha=937ed3c7b1

https://wpt.fyi/results/?sha=99e577f57c is another really bad one, and before that I can't find any that are this busted going back as far as 434ca47448. So this appears to be a recent regression.

@jugglinmike
Copy link
Collaborator

Well, the builds certainly aren't completing any faster. They're just slow and wrong now.

We're seeing three different kinds of errors. The first two:

▶ ERROR [expected OK] /css/css-scoping/slot-non-html-display-value.html
└   → Message: Unable to get property 'addEventListener' of undefined or null reference (WARNING: The server did not provide any stacktrace information)

▶ ERROR [expected OK] /websockets/interfaces/WebSocket/bufferedAmount/bufferedAmount-large.html
└   → Message: Unable to get property 'addEventListener' of undefined or null reference (WARNING: The server did not provide any stacktrace information)

...are always followed by a timeout, e.g.

Command duration or timeout: 390 milliseconds
Build info: version: '2.52.0', revision: '4c2593c', time: '2016-02-11 19:06:42'
System info: host: 'sauce-win10', ip: '10.100.19.211', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_131'
Driver info: org.openqa.selenium.edge.EdgeDriver
Capabilities [{applicationCacheEnabled=true, InPrivate=false, pageLoadStrategy=normal, platform=ANY, acceptSslCerts=true, browserVersion=42.17134.1.0, platformVersion=10, locationContextEnabled=true, webStorageEnabled=true, browserName=MicrosoftEdge, takesScreenshot=true, takesElementScreenshot=true, platformName=windows}]
Session ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXE993DD85
Stacktrace:
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0 (None:-2)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance (None:-1)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance (None:-1)
    at java.lang.reflect.Constructor.newInstance (None:-1)
    at org.openqa.selenium.remote.ErrorHandler.createThrowable (ErrorHandler.java:206)
    at org.openqa.selenium.remote.ErrorHandler.throwIfResponseFailed (ErrorHandler.java:158)
    at org.openqa.selenium.remote.RemoteWebDriver.execute (RemoteWebDriver.java:678)
    at org.openqa.selenium.remote.RemoteWebDriver.executeAsyncScript (RemoteWebDriver.java:598)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (None:-2)
    at sun.reflect.NativeMethodAccessorImpl.invoke (None:-1)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (None:-1)
    at java.lang.reflect.Method.invoke (None:-1)
    at org.openqa.selenium.support.events.EventFiringWebDriver$2.invoke (EventFiringWebDriver.java:103)
    at com.sun.proxy.$Proxy1.executeAsyncScript (None:-1)
    at org.openqa.selenium.support.events.EventFiringWebDriver.executeAsyncScript (EventFiringWebDriver.java:229)
    at org.openqa.selenium.remote.server.handler.ExecuteAsyncScript.call (ExecuteAsyncScript.java:58)
    at java.util.concurrent.FutureTask.run (None:-1)
    at org.openqa.selenium.remote.server.DefaultSession$1.run (DefaultSession.java:176)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (None:-1)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (None:-1)
    at java.lang.Thread.run (None:-1)

The third kind of error is far more common but far less helpful:

▶ ERROR [expected OK] /websockets/interfaces/WebSocket/bufferedAmount/bufferedAmount-readonly.html
└   → 1

It is never followed by a timeout.

As you can see, the logs are now formatted with the "GroupingFormatter" because that was recently made the default behavior of the WPT CLI. As mentioned there, this format is unideal for continuous integration because it limits the amount of information available in situations like these.

I have reinstated the previous default logging format for this project. This won't fix anything, but it should help us in future debugging attempts. I've updated WPT so that I'm notified of changes to WPT's tools directory, so hopefully I can stay on top of things like this in the future.

In the mean time, I have also updated to the latest release of the "Sauce Connect Binary" in the off chance that "Fixed file descriptor leak." has some bearing here.

@foolip
Copy link
Member Author

foolip commented Oct 19, 2018

Thanks for investigating Mike! Do you think it'd be worth trying more aggressive restarting, and perhaps more chunks, to mitigate breakage like this?

@jugglinmike
Copy link
Collaborator

Do you think it'd be worth trying more aggressive restarting

The fact that this immediately interferes with almost every "chunk" makes me think that restarting will not resolve it, but it couldn't hurt to try (especially now that we're disabling Safari on Sauce Labs). We'll see what effect this has on time-to-results (it may get ugly).

(I was ready to point out that these are "ERROR"s and not "FAILURE"s, but then I remembered that doesn't actually matter.)

perhaps more chunks

For Edge, we're currently running WPT in 100 "chunks" which already feels a bit excessive. I can give it a shot if we run out of ideas, though.

@foolip
Copy link
Member Author

foolip commented Oct 19, 2018

Oh, if it's already 100 chunks, then scratch that idea. Seems like something is broken with Sauce or the sauce support in ./wpt run then?

@jugglinmike
Copy link
Collaborator

We've completed a collection from Edge that involved restarting after every test failure. This increased build time from 15 hours to 24 hours, but it had no appreciable affect on the veracity of the results.

Strangely, I haven't been able to connect to Sauce Labs locally. I'm not going to prioritize fixing that yet for a few reasons:

  • This is distinct from the behavior we're seeing on the workers
  • We've seen our share of roving DNS issues, so this may resolve itself in time

In the interim, I can do ad-hoc testing from any of the worker machines; that's probably more authentic, anyway. ...just not while they're actively testing. Although Edge is now running 24/7, 10 of our machines do not run Sauce Labs tests, so this is still a viable option.

@jugglinmike
Copy link
Collaborator

The problem comes from the way cross-realm objects are interpreted by Edge/Edgedriver. The easiest way to see this is by comparing the results of the WebDriver "execute script" command which collects test results from the browser.

Excerpt from logs of a test conducted on 2018-10-14
{
  "screenshot": null,
  "between_commands": 0.1099998950958252,
  "suggestion_values": [],
  "request": {
    "args": [],
    "script": "var callback = arguments[arguments.length - 1];\nwindow.testdriver_callback = callback;\nwindow.process_next_event();\n"
  },
  "HTTPStatus": 200,
  "result": [
    "/css/css-typed-om/the-stylepropertymap/properties/white-space.html",
    "complete",
    [
      0,
      null,
      null,
      [
        [
          "'white-space' property",
          1,
          "'CSSKeywordValue' is not defined",
          "ReferenceError: 'CSSKeywordValue' is not defined\n   at Global code (http://web-platform.test:8000/css/css-typed-om/the-stylepropertymap/properties/resources/testsuite.js:23:1)"
        ]
      ]
    ]
  ],
  "suggestion": null,
  "duration": 0.015000104904174805,
  "path": "execute_async",
  "method": "POST",
  "statusCode": 0,
  "timelineSeconds": 9.593000173568726
}
Excerpt from logs of a test conducted on 2018-10-17
{
  "screenshot": null,
  "between_commands": 0.2969999313354492,
  "suggestion_values": [],
  "request": {
    "args": [],
    "script": "var callback = arguments[arguments.length - 1];\nwindow.opener.testdriver_callback = callback;\nwindow.opener.process_next_event();\n"
  },
  "HTTPStatus": 200,
  "result": {
    "0": "/speech-api/historical.html",
    "1": "complete",
    "2": {
      "0": 0,
      "1": null,
      "2": null,
      "3": {
        "0": {
          "0": "SpeechRecognitionError interface should not exist",
          "1": 0,
          "2": null,
          "3": null
        },
        "1": {
          "0": "webkitSpeechGrammar interface should not exist",
          "1": 0,
          "2": null,
          "3": null
        },
        "2": {
          "0": "webkitSpeechGrammarList interface should not exist",
          "1": 0,
          "2": null,
          "3": null
        },
        "3": {
          "0": "webkitSpeechRecognition interface should not exist",
          "1": 0,
          "2": null,
          "3": null
        },
        "4": {
          "0": "webkitSpeechRecognitionError interface should not exist",
          "1": 0,
          "2": null,
          "3": null
        },
        "5": {
          "0": "webkitSpeechRecognitionEvent interface should not exist",
          "1": 0,
          "2": null,
          "3": null
        }
      }
    }
  },
  "suggestion": null,
  "duration": 0.25,
  "path": "execute_async",
  "method": "POST",
  "statusCode": 0,
  "timelineSeconds": 53.45300006866455
}

This has interfered with results collection because the wptrunner expects to receive a Python list, but the structure that comes from Edge is deserialized into a dict. (We were seeing an error message of simply, "1" because of an oversight in wptrunner's error handling code.)

All of the logging data provided by Sauce Labs indicates that the same build of Edge and Edgedriver were used in both cases:

And I couldn't find any relevant notes in the Sauce Labs product announcement blog or on Edgedriver's website.

It turns out that this was actually caused by a change to WPT: web-platform-tests/wpt#13419 . I've submitted a fix that will accomodate Edge's non-standard behavior and a separate patch with WebDriver specification tests.

@jugglinmike
Copy link
Collaborator

We landed a short-term fix for this in WPT:

web-platform-tests/wpt#13938

Since that was merged today, the change will not influence collection until tomorrow. I'll leave this open until that can be verified (likely not until Thursday).

@foolip
Copy link
Member Author

foolip commented Nov 8, 2018

Still no results on wpt.fyi, what is the status of the Edge run that started after the fix?

@jugglinmike
Copy link
Collaborator

Good news first: we successfully collected from Edge, the data appears to be valid, and the restart-enabled attempt took less than 24 hours.

The reason you don't see this on the frontpage of https://wpt.fyi is that the corresponding collection for Chrome failed. If it's not one browser...

I've manually triggered a retry, and filed gh-625 to track this latest instability.

@foolip
Copy link
Member Author

foolip commented Nov 8, 2018

71fe0a6c3d was the daily commit, and Taskcluster got started for it:
https://tools.taskcluster.net/groups/FJTQgT1GQ1qVCsAwYXVKzg

But it failed, I've filed web-platform-tests/wpt#13989.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants