Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VoLTE vs NATPING #53

Open
dchard opened this issue Dec 5, 2024 · 8 comments
Open

VoLTE vs NATPING #53

dchard opened this issue Dec 5, 2024 · 8 comments

Comments

@dchard
Copy link

dchard commented Dec 5, 2024

Dear @herlesupreeth

I finished setting up VoLTE with Open5GS. Calls can be made successfully within the network, both AMR-WB and EVS works just fine.

However, I noticed that there is constant SIP OPTION sent every 5 seconds to all VoLTE clients. This is bad for multiple reasons:

  1. VoLTE is always routed, so you don't need NAT keepalive for it to work.
  2. This constant signalling traffic stops the UEs to go to idle, no live network does this.
  3. This also triggers weird loss of service situations on dual sim phones, where the second SIM is used for internal VoLTE network while the primary SIM is used for a commercial one. After a couple hours idling one of my phone constantly gets lost (only on VoLTE, the default PDN is fully functional).

So what I did is commented out these lines in pcscf.cfg:

##!define WITH_NAT
##!define WITH_NATPING

And now for more than a day even the previously "lost" phone does work properly, and all the VoLTE clients can properly go to idle mode as there is no unnecessary constant signalling traffic.

Of course, if this P-CSCF is to be used for VoWIFI, that is a different situation as VoWIFI does need frequent connection keepalive to be able to pass through NAT. I wonder if there would be a way to disable keepalive for VoLTE and enable it for VoWIFI connections on the same P-CSCF, or the solution is to run separate P-CSCF for Vowifi...

Using Kamailio 5.8.4.

MOD: I have to correct myself: even on VoWIFI this NATPING stuff is not needed on the P-CSCF side, as the UEs IPsec traffic terminates on the ePDG, which can keep the link up with DPD and the IPsec link itself does the NAT traversal, this is transparent to the IMS traffic inside the tunnel.

@herlesupreeth
Copy link
Owner

However, I noticed that there is constant SIP OPTION sent every 5 seconds to all VoLTE clients

I realize its bad in terms of draining UEs battery but I had it in place to ensure that SIP signaling messages are exchanged reliably and not getting lost in the transition from IDLE <--> CONNECTED. Also, it can be easily disabled in pcscf.cfg so one can tailor to their deployment needs

VoLTE is always routed, so you don't need NAT keepalive for it to work.

I am not sure whether I stood your point here but in any case VoLTE SIP signaling is not NATed in mobile networks

This also triggers weird loss of service situations on dual sim phones, where the second SIM is used for internal VoLTE network while the primary SIM is used for a commercial one. After a couple hours idling one of my phone constantly gets lost (only on VoLTE, the default PDN is fully functional).

Havent test dual SIM phone so not aware of this issue.

##!define WITH_NAT
##!define WITH_NATPING

I agree that its a weird naming and misleading to think that it a ping for NATed client only which is not the case. In general its a SIP OPTIONS based pinging which is usually need when calling between SIP clients behind a NAT.

Of course, if this P-CSCF is to be used for VoWIFI, that is a different situation as VoWIFI does need frequent connection keepalive to be able to pass through NAT. I wonder if there would be a way to disable keepalive for VoLTE and enable it for VoWIFI connections on the same P-CSCF, or the solution is to run separate P-CSCF for Vowifi...

In order to do that you can enable frequent keepalives to only VoWifi client by checking their sip.P-Access-Network-Info SIP header and adding them to the to-be-pinged list maintained in the script only if its a WLAN

@dchard
Copy link
Author

dchard commented Dec 7, 2024

I realize its bad in terms of draining UEs battery but I had it in place to ensure that SIP signaling messages are exchanged reliably and not getting lost in the transition from IDLE <--> CONNECTED. Also, it can be easily disabled in pcscf.cfg so one can tailor to their deployment needs

The handling of state transition (or any other radio induced) small unstable periods should be handled through some sort of retry mechanism, not via constant ping :-) Something simple might work: try the first time with short timeout, then a bit longer, then one last time with even longer. Something like: 200ms, 500ms, 1000ms. I am not expecting you to incorporate some elaborate adaptive retrans scheme here :-)

I am not sure whether I stood your point here but in any case VoLTE SIP signaling is not NATed in mobile networks

That is what I meant. The signalling traffic between the UE IMS client and the P/S/I-CSCF. When there is a call, there is constant traffic anyhow :-)

This also triggers weird loss of service situations on dual sim phones, where the second SIM is used for internal VoLTE network while the primary SIM is used for a commercial one. After a couple hours idling one of my phone constantly gets lost (only on VoLTE, the default PDN is fully functional).

Havent test dual SIM phone so not aware of this issue.

This is maybe due to the nature of DSDS operation. If the VoLTE SIM for Open5GS is in the second slot, and the primary slot is occupied and actively used, this can happen. Maybe we should take a look if Open5Gs does hourly TAUs, as it can happen (although it shouldn't), that a long idling GTP-U tunnel gets lost, especially if there is heavy mobility within the tracking area. It could have been a simple UE fault as well, although since I turned it off it works perfectly well. Even with very long idling sessions I have not seen any issues. Last time I intentionally waited 24 hours between two calls, inbetween it was only the hourly SIP session renewals.

I agree that its a weird naming and misleading to think that it a ping for NATed client only which is not the case. In general its a SIP OPTIONS based pinging which is usually need when calling between SIP clients behind a NAT.

That is clear.

In order to do that you can enable frequent keepalives to only VoWifi client by checking their sip.P-Access-Network-Info SIP header and adding them to the to-be-pinged list maintained in the script only if its a WLAN

As I explained it, I dont think that this is needed for Vowifi as well. The main reason is that the IPSec link between UE <--> ePDG can be "kept alive" (from a NAT point of view) with DPD, unless there is also NAT expected between the ePDG and P/I/S-CSCFs. Good to know about the sip.P-Access-Network-Info option though.

@herlesupreeth
Copy link
Owner

As I explained it, I dont think that this is needed for Vowifi as well. The main reason is that the IPSec link between UE <--> ePDG can be "kept alive" (from a NAT point of view) with DPD, unless there is also NAT expected between the ePDG and P/I/S-CSCFs. Good to know about the sip.P-Access-Network-Info option though.

Thanks for confirming this. Then, I can safely disable the SIP OPTIONS ping. One question though, could the keep alive be ePDG implementation specific?

@dchard
Copy link
Author

dchard commented Dec 10, 2024

Thanks for confirming this. Then, I can safely disable the SIP OPTIONS ping. One question though, could the keep alive be ePDG implementation specific?

If the IPsec traffic of Vowifi terminates on the ePDG (which is not part of the IMS), then yes we can disable it. And no, the DPD is part of the IPsec base standard, so every ePDG should support it, and based on my experience, it should be set to 10 seconds or so as most routers has a rater short NAT alive timeout for UDP sessions. I am not sure if Kamailio can act as an ePDG (I mean as a separate node, not part of P/S/I-CSCF), but if it can, the DPD should be configured there.

I think we need to revise other parts as well in terms of timers, let me elaborate:

I traced a VoLTE session of a large commercial provider. There is no SIP keepalive in there at all (as expected), the validity time is 7200 seconds, and SIP session renewal happens 12 minutes before that. I think we should aim for that, or half of that: 3600 seconds and renewal 6 minutes before that. Maybe in a test lab this 3600 seconds is a better choice. But this affects a lot of other parameters:

What I did (besides disabling the NATPING and NAT parts) is this:

tcp_connection_lifetime is changed to 3630 seconds on all 3 nodes (a bit longer than the session validity timer).

  • PCSCF:

"ims_registrar_pcscf", "subscription_expires", 3600
"htable", "htable", "ipsec_clients=>size=8;autoexpire=3600;"
"ims_qos", "rx_auth_expiry", 3600"

  • SCSCF:

"ims_registrar_scscf", "subscription_default_expires", 3600)
"ims_registrar_scscf", "subscription_min_expires", 3600)
"ims_registrar_scscf", "subscription_max_expires", 3600)
What is unclear here is the "ims_auth" section:
"max_nonce_reuse", 20 <-- is this correct? Cant find anything specific about nonce reuse in the RFC. Will try to find out if the commercials are using this, and if yes then with what config.
modparam("ims_auth", "auth_vector_timeout", 60)
modparam("ims_auth", "auth_data_timeout", 60) # was 600000
modparam("ims_auth", "auth_used_vector_timeout", 300) # was 600000

  • I-CSCF:

Nothing changed here, everything was Kamilio default already.

With this from time to time my UE is not able to make any calls (after hours of idling), I am trying to catch the point about what happens and compare it with the logs.

Once this is working, it would be nice to connect calls to an actual PBX (I have experience with Asterisk) to have PBX features, like echo test, MOH etc. My general problem with Kamailio that tutorials and meaningful (eg. in-context) documentation is non-existent...

One more question: I am looking at the IMS example files form the Kamailio source, they are all very old. And I can also see that in your branch quite a few part is heavily modified. Would you care to elaborate on the differences? And which one we should use: the ones in Kamailio_IMS_Config branch, or the ones in the kamailio branch or the ones in docker_open5gs?

@herlesupreeth
Copy link
Owner

With this from time to time my UE is not able to make any calls (after hours of idling), I am trying to catch the point about what happens and compare it with the logs.

I think SIP session renewal is UE dependent (or maybe we are missing a mechanism in IMS to renew session). The reason I set the subscription_expires to that high value is because I was trying to fix a bug raised long back where user said calls were automatically dropped after 3600 seconds (i.e. UE was not issue re-INVITE to renew the session).

Once this is working, it would be nice to connect calls to an actual PBX (I have experience with Asterisk) to have PBX features, like echo test, MOH etc. My general problem with Kamailio that tutorials and meaningful (eg. in-context) documentation is non-existent...

I can understand. In Kamailio, the focus is less on IMS

One more question: I am looking at the IMS example files form the Kamailio source, they are all very old. And I can also see that in your branch quite a few part is heavily modified. Would you care to elaborate on the differences? And which one we should use: the ones in Kamailio_IMS_Config branch, or the ones in the kamailio branch or the ones in docker_open5gs?

I have a branch in my forked kamailio repo to update the example files in Kamailio source repo but never managed to get it merged. I would recommend using the one in this repo if you want a working setup. I dont think if you use the examples from kamailio source the calling works.

Regarding the differences, I cant recollect right now since I worked on it 4 years back or so. But I remember vaguely that it had to do with removing IPSec connections, OPTIONS pining of UE and routing SIP req/replies when IPSec connections are involved.

@dchard
Copy link
Author

dchard commented Dec 14, 2024

@herlesupreeth

I think SIP session renewal is UE dependent (or maybe we are missing a mechanism in IMS to renew session).

Not really. The UE always start the initial REGISTER with 600000 seconds, but that is not the actual session timeout, just what the UE initially requests. But if the network responds with a lower value (which is the case for large commercials), that takes precedence.

The reason I set the subscription_expires to that high value is because I was trying to fix a bug raised long back where user said calls were automatically dropped after 3600 seconds (i.e. UE was not issue re-INVITE to renew the session).

Yeah, we cant really keep doing that. Checked another commercial operator and they also use 7200 secs and they also renew 12 minutes before that, exactly like the previous one I checked (both of them are large international players, not a small local market operator). If you have time, you can also check one in your country with NSG. Would be nice to know how widely this is the case.

As you have seen, I am also dealing with this EBI issue @ Open5GS. So I cant reliably check long lasting sessions until that one is fixed.

I would recommend using the one in this repo if you want a working setup

To be clear, you mean here the Kamailio_IMS_Config repo's master branch?

@herlesupreeth
Copy link
Owner

To be clear, you mean here the Kamailio_IMS_Config repo's master branch?

yes, if you are using kamailio master branch from source. If you are using 5.3 tag then use 5.3 branch in this repo

Yeah, we cant really keep doing that. Checked another commercial operator and they also use 7200 secs and they also renew 12 minutes

Is it possible to upload a pcap here how the session is refreshed?

@dchard
Copy link
Author

dchard commented Dec 18, 2024

yes, if you are using kamailio master branch from source. If you are using 5.3 tag then use 5.3 branch in this repo

using latest stable from the official branch (5.8.4) so I will use the master here. Thanks!

Is it possible to upload a pcap here how the session is refreshed?

I cant provide a PCAP, but this is how it looks:

image

The phone just does the same signalling procedure as at initial registration. The expiry timer sent by the network is 7200 seconds. The only question is: how does the phone know that it needs to re-register exactly 12 minutes before the 7200 seconds pass, and why 12 minutes exactly? Looking at the signalling decoder does not indicate this 12 minutes anywhere...
The phone always sends 600000 sec, but that is not really relevant.

MOD:

There are some differences:

The commercial provider does not send any "expires" part in the XML message body of the NOTIFY message, while Kamailio does:

This is the commercial signalling, only expiry is in the message header not in the XML part:

image

In the commercial signalling there is also no "Subscription to REG saved" message sent to the phone, but Kamailio does do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants