Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better connection lost handling #33

Merged
merged 2 commits into from
Nov 27, 2024
Merged

Better connection lost handling #33

merged 2 commits into from
Nov 27, 2024

Conversation

Adesin-fr
Copy link
Contributor

There are some issues with the current version :

  • Before this, if there were no internet connection on first connect, it would fail and did not retry. That use-case is now handled.
  • After a connection failure, channels were not re-subscribed. That is now fixed.
  • The "Ping" interval was triggered everytime we did tried to connect, even without success. So if we tried to connect 40x, we would have 40 "setInterval" triggered, and 40x ping/minute , instead of one !

That's a bit more production-ready now :)

ludovic added 2 commits November 27, 2024 08:55
- Before this, if there were no internet connection on first connect, it would fail and did not retry. That use-case is now handled.
- After a connection failure, channels were not re-subscribed. That is now fixed.
- The "Ping" interval was triggered everytime we did tried to connect, even without success. So if we tried to connect 40x, we would have 40 "setInterval" triggered, and 40x ping/minute , instead of one !

That's a bit more production-ready now :)
@Adesin-fr
Copy link
Contributor Author

@leob I think you should have a look at this ;)
Since I'm using this in a 24/24 connected machine, I had to check it would survive rude conditions.
And it seems some of theme weren't ready enough for prod.
So I fixed them !

My WIFI ON/OFF switch had a hard time, being switched back-forth for a while ;)

@leob
Copy link
Collaborator

leob commented Nov 27, 2024

@leob I think you should have a look at this ;) Since I'm using this in a 24/24 connected machine, I had to check it would survive rude conditions. And it seems some of theme weren't ready enough for prod. So I fixed them !

My WIFI ON/OFF switch had a hard time, being switched back-forth for a while ;)

@Adesin-fr Thanks, will check it out !

@leob leob merged commit 5474b52 into georgeboot:master Nov 27, 2024
2 of 4 checks passed
@leob
Copy link
Collaborator

leob commented Nov 27, 2024

@Adesin-fr Okay, I've merged this PR and published a new version to npmjs.

Then I went on to create a new build of our app, and deploy and test it - but unfortunately it looks like the web sockets functionality in our app is more or less broken now, I'm not seeing any "notifications" (messages) being delivered over our private channels anymore - nothing is coming through ...

Does it work for you?

What I noticed is that I seem to be seeing fewer log messages than I would expect - here's a partial screenshot of my JS console:

image

I do see:

"just set socketId ..."
"Subscribing to ..."
"Received event ..."
"Sending ping"

but I'm not seeing:

"Trying to connect..."
"Connected !"
"... onmessage ..."

Not saying that I should be seeing all of these, but shouldn't I be seeing at least some of them?

Can you debug this a bit? From the testing that I did with our app this seems to have more or less fallen apart now - maybe the solution turns out to be something very simple, or even something which we need to change in our app, but right now I don't really have a clue (but I admit that I didn't really debug it yet either).

(by the way - the only way that I know of to debug this stuff is to go through a full AWS/lamba/app deployment, which makes iterating/debugging this really slow and painful ... not sure if there's a "better" way, I guess not)

I also wonder if our "Websocket" implementation needs this much "specific" code, while all we really do is piggyback on top of the standard Laravel Echo protocol - we might be reinventing the wheel a bit here, although I also don't know if there's a better way.

P.S. if you want to work on this, please pull master first - I've cleaned up the code a little bit by adding semicolons where they were missing - the JS code in this project was consistently using semicolons, but we now ended up with a mix of "semicolon" and "no semicolon" - I've added the semicolons where they were missing, please let's keep it consistent ...

Another thing I did was to use a "logging prefix" for the console.log statements - I'm now putting the string [AG-WS] (short for "Api Gateway Web Sockets") in front of the log messages, so that I can see which log messages originate from this lib, rather than from our app ...

@Adesin-fr
Copy link
Contributor Author

Adesin-fr commented Nov 27, 2024

First, thanks for having merged this.

That's strange since I don't think it should not break private channels ...

To see "trying to connect" messages and others , you must enable debug mode on the Echo instruction (remember the last dockbloc of the readme ?)

I'll have a look at this with private channels later today :)

PS : have you checked the auth requests on your laravel app ? They must succeed and give back an auth key, that is then used for the websocket things...
If your auth requests are failing (and not giving back auth key), the private features will fail silently.
I remember having a hard time with this , I'll dig a bit more later, I have another project that also use private channels and this lib !

@leob
Copy link
Collaborator

leob commented Nov 27, 2024

First, thanks for having merged this.

That's strange since I don't think it should not break private channels ...

To see "trying to connect" messages and others , you must enable debug mode on the Echo instruction (remember the last dockbloc of the readme ?)

I'll have a look at this with private channels later today :)

@Adesin-fr Yes, I did enable the debug option, otherwise I wouldn't have seen the messages which I saw ... so, I do see some messages, but I'm not seeing all of the messages that I would expect to see :)

Thanks for looking at it! If you can't find anything then let me know, and I'll put a bit more effort into debugging it ...

@leob
Copy link
Collaborator

leob commented Nov 27, 2024

@Adesin-fr P.S. please don't forget to pull master before you start working on it - I've pushed a commit 1 minute ago ...

By the way, I still see some Github jobs failing, e.g. one of the Jest tests - don't know whether that's relevant or not ... but for now I'd say the failing Github jobs are low priority :)

@Adesin-fr
Copy link
Contributor Author

I've tried to have a look.
I'd like to push my project to a staging env with the 0.5.2 version, but it missed axios (seems you removed it some commits ago !!)

In my project, I'm seeing the "trying to connect messages and every others.
Are you sure you correctly imported Echo in your project ?

@leob
Copy link
Collaborator

leob commented Nov 28, 2024

I've tried to have a look. I'd like to push my project to a staging env with the 0.5.2 version, but it missed axios (seems you removed it some commits ago !!)

In my project, I'm seeing the "trying to connect messages and every others. Are you sure you correctly imported Echo in your project ?

@Adesin-fr I've merged your new PR to bring Axios back - I saw that orginally the Axios import wasn't there, then you added it in your previous PR, and I then deleted it again - but I don't know why I deleted it ... no, actually I do know why now - it was because of a build error which I was getting - see comment in the other PR ...

But okay, I don't know why you do see those log messages, while I don't see them!

I'm going to debug the problem, I think I'm going to add more log statements, including in the PHP code, so that I can see exactly what's going on ... probably not today, but maybe tomorrow - work in progress!

P.S. we do have "echo" in our app's package.json:

"laravel-echo": "^1.14.0"

but it's a different (slightly newer) version than the one required by laravel-echo-api-gateway (which has "laravel-echo": "^1.10.0") ...

@leob
Copy link
Collaborator

leob commented Dec 7, 2024

@Adesin-fr I was trying to debug my websocket issues - what was just baffling was that when I opened our app in two browsers (Chrome and Firefox) and logged in in both browser windows, I could send a message (via websockets) from "Chrome" to "Firefox", but not the other way around (from "Firefox" to "Chrome") !

Debugging it was an ordeal (I had a hard time to even log anything to CloudWatch), but what I did see was this - a "410 Gone" exception from API Gateway:

[2024-12-07 10:57:01] DEBUG: Preparing to send to connections for channel .........
[2024-12-07 10:57:01] DEBUG: Sending message to connection 'CascQeh7oAMCJLw='
[2024-12-07 10:57:01] DEBUG: Sending message to connection 'K8gVIdmDIAMCLIA='
[2024-12-07 10:57:02] ERROR:  MESSAGE ###### MessageService@sendNotification - exception - error: Error executing "PostToConnection" on "https://..............."; AWS HTTP error: Client error: `POST https://...........` resulted in a `410 Gone` response Unable to parse error information from response - Error parsing JSON: Syntax error[#0 /var/task/vendor/aws/aws-sdk-php/src/WrappedHttpHandler.php(98): Aws\WrappedHttpHandler->parseError()

So, sending the message to the first connection worked, but sending it to the second connection failed with a "410 Gone" ...

And that gave me an idea - "let's look in the Connections table in DynamoDB" - and bingo! When I cleared that table, things started to work ...

Pretty clear that there are sometimes "stale" connections left behind in that table - which is strange, because we DO have this piece of code in the catch block in ConnectionRepository::sendMessage:

if ($e->getAwsErrorCode() === 'GoneException') {
  $this->subscriptionRepository->clearConnection($connectionId);

  return;
}

So the idea is that in case of an HTTP 410 Gone" it SHOULD remove that connection from the DynamoDB table ... strange that that doesn't always seem to work.

This still looks like a potential weakness - the fact that the Connections table can contain "stale" connections ... I mean, in theory "it should just work", but it feels somewhat fragile (potentially).

P.S. I was also struggling with getting any logging to work - I cloned the repo (laravel-api-gateway) and then I made changes locally (I added log statements in the PHP code), and I linked to the local directory in my app's composer.json, like this:

    "repositories": [
      {
          "type": "path",
          "url": "../laravel-echo-api-gateway"
      }
    ],

and then with:

"georgeboot/laravel-echo-api-gateway": "dev-master",

but it seemed to just refuse to pick up the dependency from my local folder - I did NOT get to see any of the logging that I added in the PHP code - somehow it did not seem to use the "local" version, but the version from Github ... ?

And to top it all off I had issues with the AWS S3 file adapter because of the upgraded aws-sdk-php version - it threw this exception when trying to fetch a file from S3:

Invalid configuration value provided for "token". Expected Aws\Token\TokenInterface|Aws\CacheInterface|array|bool|callable, but got string

Apparently that requires a PHP fix/upgrade:

aws/aws-sdk-php#2567 (comment)
laravel/framework#44979

@Adesin-fr
Copy link
Contributor Author

Well I did not yet experienced the 410 errors !

We'll surely need to address it soon !

@leob
Copy link
Collaborator

leob commented Dec 7, 2024

Well I did not yet experienced the 410 errors !

We'll surely need to address it soon !

@Adesin-fr I've seen these "410" errors more than once in our app, but today I finally found out what causes it: "stale" connections in the DynamoDB "Connections" table! That should never happen, but apparently it did ... clearing the table solved the issue, but if it happens again then we should indeed address it.

@leob
Copy link
Collaborator

leob commented Dec 7, 2024

Well I did not yet experienced the 410 errors !

We'll surely need to address it soon !

@Adesin-fr Look at this change from another PR (#25):

image

Maybe this change makes it more robust - checking for

e->getStatusCode() === Response::HTTP_GONE || $e->getAwsErrorCode() === 'GoneException'

instead of ONLY for e->getAwsErrorCode() === 'GoneException' ...

This could be a reason why in some cases the "stale" connection isn't getting cleared ... I've added this to the latest release.

@leob
Copy link
Collaborator

leob commented Dec 7, 2024

@Adesin-fr So, after upgrading my app to the latest version of laravel-echo-api-gateway (with the import axios statement), I was getting this:

image

But, it was easy to fix - I had to remove this line from my frontend app's initialization code:

window.axios.defaults.headers.common['Authorization'] = 'Bearer ' + token;

and instead I had to pass the bearerToken option to Echo:

window.Echo = new Echo({
  authEndpoint: ....................,
  bearerToken: token,
  broadcaster,
  host,
  debug: isStaging === true
});

Works now! And we're no longer using yarn, only npm ... productive day today :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants