Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Implement pluggable transport interface #340

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

certaintls
Copy link

@certaintls certaintls commented May 23, 2022

This PR implements Pluggable Transport Specification v3.0 - Go Transport API so other applications can use hysteria as a pluggable transport.

Status update (Mar.12.2023):

  1. Hysteria had feature updates, the PT work in this PR is no longer compatible with Hysteria latest version (hence merge conflicts). A re-work is needed. For application developers, the version in this PR still works, but not recommended for production.
  2. The project owner expressed interest in merging (supporting) the PT interface [WIP] Implement pluggable transport interface #340 (comment)
  3. The PT client and server interfaces are implemented (with approach to minimize the change to the existing project, refactoring can be considered in the future)
  4. An end to end test in this PR has a bug: there is a racing condition (the test can pass if run manually in the debug mode), the investigation was led to [WIP] Implement pluggable transport interface #340 (comment)
  5. Todo: The end to end test needs to use https://github.com/lucas-clemente/quic-go/blob/b5ef99a32c250fc63f89cc686c13a008c5419d01/example/echo/echo.go#L92-L113 to dynamically generate credentials, so the test can be easily automated in the CI
  6. To simple test this new PT with [PT IPC](Dispatcher IPC), please see the proposal and its PR in Support Socks5 proxy server mode  OperatorFoundation/shapeshifter-dispatcher#44

Other notes:

Threat Model

Background

Exposing the Great Firewall's Dynamic Blocking of Fully Encrypted Traffic

Adversary capabilities

As the research above shows, we assume a censor can probabilistically drop connection packets. For simplicity, an arbitrary connection which is fighting the censor can result in 10% packet loss, 30% packet loss and 60% packet loss. The more percentage packet loss is, the more head-of-line blocking will be, and the worse the performance it becomes. Practically speaking, with 30% packet loss, applications like live streaming, real-time broadcasting, online meeting, and live gaming are much disrupted. With 60% packet loss, even reading censored news become extremely slow and require users manual retries over and over.

The adversary's goals

  • The censor wants to scale the effort of fighting circumvention. No matter what new PT or protocol are invented, or random circumventing servers are deployed. Probabilistically dropping the packets that the censor cannot understand in certain conditions can be an effective way for blocking to this date.

Goal of Hysteria PT

Taking advantage of QUIC protocol, hysteria PT should be able:

  • offer 10x more faster connection than all TCP based PT when the packet drop rate is up to 30%. End users should experience no noticeable delay in live streaming, real-time broadcasting, online meeting, or live gaming in a reasonable low bandwidth and simple client-server set up.
  • survive 60% packet drop rate, and can still serve the application with tolerable delays.
  • offer algorithms "udp", "wechat-video", "faketcp" to marginally lower the probability of the drop rate.

Non-goals of Hysteria PT

The proxy application shipped in Hysteria repository does not actually use Hysteria PT interface, therefore, its usage and DevOps are not the concerns to Hysteria PT.

@tobyxdd
Copy link
Collaborator

tobyxdd commented May 25, 2022

Hi certaintls,

Yes I love the idea of using hysteria as a pluggable transport for other apps. If all we need is an adapter for code in pkg/core, perhaps it'd be better to put them in a new package?

Also, what's a good way to test it in real applications?

@tobyxdd tobyxdd added the enhancement New feature or request label May 25, 2022
@tobyxdd tobyxdd self-assigned this May 25, 2022
@certaintls
Copy link
Author

Also, what's a good way to test it in real applications?

I will write an e2e test in this PR. But, for testing with real applications, I don't know an easy way yet, I will get back to you.

@tobyxdd
Copy link
Collaborator

tobyxdd commented May 31, 2022

Hi certaintls,

I think there are some problems on how this transport is currently implemented.

QUIC is a "multiplexed" protocol, which means there are actually 2 layers of "Accept" - one for the connection, one for the streams inside the connection. The connection itself cannot be directly used for data transmission.

Hysteria creates a "control stream" for negotiation & authentication etc. after the QUIC connection is established. Then each new "connection" from this client is just a new stream inside the QUIC connection.

Basically, as a server, you need listen for both new connections, and new streams coming from every existing connections (semantically this is the real "Accept").

I'm not sure how to adapt this to Pluggable Transports API 🤔

@tobyxdd
Copy link
Collaborator

tobyxdd commented May 31, 2022

The reason your current code doesn't work is that in func (s *Server) Accept(), it accepts a QUIC connection, accepts the first stream (which is the control stream), then just returns this control stream as the accepted connection. But control streams are basically inactive after being used for exchanging rate/auth information.

Whenever you call Dial on a client, it creates a new stream over the QUIC connection (established during NewClient). It's these new streams the server should be accepting.

@certaintls
Copy link
Author

certaintls commented May 31, 2022

Thanks for digging into this. Your explanation makes sense. I pushed a new commit fbfe9ab In this commit, I close the control stream and listen/accept new ones. However, in the test, below is never returned from a new stream

// Start accepting data streams
stream, err = cs.AcceptStream(context.Background())

I am keeping digging.

@certaintls
Copy link
Author

The test is able to pass in my debugging session, this makes me think the problem is with deadlock between client-server wait.

@certaintls
Copy link
Author

Here is a little bit more debugging info.

If I step through debugging mode, the test passes, and below is the debugging info

Server up and running
time="2022-05-31T18:08:03Z" level=info msg="Client connected" src="127.0.0.1:40878"
Client up and running
Client sent the data to the server
Client starts reading from connection
Server starts reading from connection
Server received the expected data from the client
Server sent the response to the client
Client received the expected response from the server
PASS

But, if I just run the test, I got

Server up and running
time="2022-05-31T18:13:49Z" level=info msg="Client connected" src="127.0.0.1:58557"
Client up and running
Client sent the data to the server
Client starts reading from connection
panic: test timed out after 30s

@tobyxdd
Copy link
Collaborator

tobyxdd commented Jun 1, 2022

The new code still has problems. You are only accepting the first stream of each connection, so it won't work when the same client dials again. And needless to say, you also shouldn't accept streams of only the first connection, otherwise it wouldn't be able to handle multiple clients.

One way I can think of to make this work, is to have an internal goroutine for accepting connections. Then for each accepted connection, start a goroutine for handling the control stream & accepting streams. Put those streams into a channel or something for Listener.Accept()

@certaintls
Copy link
Author

The new code still has problems. You are only accepting the first stream of each connection, so it won't work when the same client dials again. And needless to say, you also shouldn't accept streams of only the first connection, otherwise it wouldn't be able to handle multiple clients.

Yes, I am aware of the limitation. Currently, I am hoping to get a proof of concept working first. On the other hand, I have a scheduled meeting tonight to discuss the multiplexing handling with other PT implementers.

One way I can think of to make this work, is to have an internal goroutine for accepting connections. Then for each accepted connection, start a goroutine for handling the control stream & accepting streams. Put those streams into a channel or something for Listener.Accept()

I was also thinking something similar, thanks for the suggestion!

@certaintls
Copy link
Author

certaintls commented Jun 20, 2022

way I can think of to make this work, is to have an internal goroutine for accepting connections. Then for each accepted connection, start a goroutine for handling the control stream & accepting streams. Put those streams into a channel or something for Listener.Accept()

I re-wrote the server side code using this suggestion. The latest change should complete the feature. However, there are still things to do:

  1. The current test can succeed if I step through, but it fails when running directly. It's probably some racing condition. I will possibly change to use channel for signaling in the test.
  2. Test needs to use https://github.com/lucas-clemente/quic-go/blob/b5ef99a32c250fc63f89cc686c13a008c5419d01/example/echo/echo.go#L92-L113 to dynamically generate credentials
  3. Provide both client side and server side example in README.md

@tobyxdd
Copy link
Collaborator

tobyxdd commented Jun 21, 2022

The code look pretty good to me now 👍

Maybe you should close allStreams to unblock Accept after Close is called - I think it's how standard listeners work.

@certaintls
Copy link
Author

I have located the exact line that is causing the flickering in test: https://github.com/HyNetwork/hysteria/blob/master/pkg/core/client.go#L80

If I place a single break point on this line ctxCancel(), and manually hit Continue (F5) in the debugger, the test always succeeds, if I don't, then the test always fails. I haven't understood this line yet. Would you have a guess why this line could make a difference?

I will look into this more.

@certaintls
Copy link
Author

certaintls commented Jun 25, 2022

I have some to debug the test a little bit more. The exact breakpoint to make the test succeed isn't L80, but L85 https://github.com/HyNetwork/hysteria/blob/master/pkg/core/client.go#L85 . In other words, if I place a single breakpoint on L85 ok, msg, err := c.handleControlStream(qs, stream), and manually click Continue(F5) in the debugger, the test succeeds, if I place the breakpoint on L86, and do the same, the test fails.

Will look into this more.

@certaintls
Copy link
Author

I put breakpoints inside handleControlStream(qs, stream), if I put one on L120 https://github.com/HyNetwork/hysteria/blob/master/pkg/core/client.go#L120, and manually continue, the test can succeed, if I put one on L121, the test would fail.

@tobyxdd
Copy link
Collaborator

tobyxdd commented Jun 26, 2022

Have you tried debugging on the server side instead?

Looks like the server was blocked on Accept. Did s.listener.Accept in acceptConn never return to begin with, or it was blocked somewhere later in acceptStream?

@certaintls
Copy link
Author

Have you tried debugging on the server side instead?

Yes, I started the debugging from the server side.

Looks like the server was blocked on Accept

Right, s.listener.Accept works as expected. But, it is the acceptStream (this line https://github.com/HyNetwork/hysteria/pull/340/files#diff-a12d34beb8640efbbedad28f8610fd58afeea457572d82e4fd295ab2a6bb8ee6R296) that is blocking if I don't have the breakpoint set up described in #340 (comment)

I will come back to this again once I got more time.

@certaintls
Copy link
Author

Just to provide an update: I definitely have not forgot this PR. I switched gear a little bit to improve PT testing (manual and automated) in general in OperatorFoundation/shapeshifter-dispatcher#42 and OperatorFoundation/shapeshifter-dispatcher#43

I also have a security audit lined up for this implementation as well as the major code of this project that needed for the PT implementation in the week of Sept.26th

@certaintls certaintls changed the title Implement pluggable transport interface [WIP] Implement pluggable transport interface Sep 11, 2022
@certaintls
Copy link
Author

A security audit was conducted by Cure53.de for the PT related work. I copy and paste the relevant parts from the report below:

Identified Vulnerabilities

The following section lists all vulnerabilities and implementation issues identified throughout the testing period. Please note that findings are listed in chronological order rather than by their degree of severity and impact. The aforementioned severity rank is
given in brackets following the title heading for each vulnerability. Furthermore, each

vulnerability is given a unique identifier (e.g., RPT-03-001) to facilitate any future follow-up correspondence.

RPT-03-003 WP1: Lack of socket dial timeouts (Low)

During a source code review of the hysteria-master repository, the observation was made that the DialTCP and ListenUDP functions utilize the Golang functions net.DialTCP and net.DialUDP for the purpose of establishing new connections.
However, these functions do not impose a timeout when establishing new connections and are therefore susceptible to DoS attacks in the eventuality the remote end of the connection does not accept a connection, which can either occur intentionally or when the packet is dropped between the client and server.

A sufficiently well-positioned attacker could leverage this behavior and halt the thread executing the DialTCP and ListenUDP functions, thereby preventing thread execution.

Affected file:
hysteria-master/pkg/transport/socks5.go

Affected code:

func (c *SOCKS5Client) DialTCP(raddr *AddrEx) (*net.TCPConn, error) {
        conn, err := net.DialTCP("tcp", nil, c.ServerTCPAddr)
        if err != nil {
            return nil, err
        [...]
}
func (c *SOCKS5Client) ListenUDP() (*socks5UDPConn, error) {
        conn, err := net.DialTCP("tcp", nil, c.ServerTCPAddr)
        if err != nil {
        [...]
        udpConn, err := net.DialUDP("udp", nil, udpRelayAddr)
        if err != nil {
                _ = conn.Close()
               return nil, err [...]

To mitigate this issue, Cure53 advises utilizing net.DialTimeout rather than net.DialTCP and net.DialUDP when establishing a new connection.

RPT-03-004 WP1: UDP session hijacking via race condition (Medium)

Whilst deep-dive reviewing the hysteria-master repository, the discovery was made that the Hysteria client application integrates a feature to initiate a SOCKS5 proxy. Depending on the configuration, this proxy enforces an authentication step comprising a username and password before serving any requests. Upon successful authentication, the SOCKS5 proxy implementation receives the connection request from the client. Valid connection requests constitute either TCP or UDP. For UDP connection requests, the SOCKS5 proxy consequently opens a UDP listener for incoming client connections, and
then creates a connection to the upstream server. However, the SOCKS5 proxy in the Hysteria application running as client considers the first incoming UDP connection on the listener as the legitimate and previously-authenticated client, which is inherently racy and may allow a malicious actor to hijack the connection.

Assume an attacker is able to reach the Hysteria application running as a client. The attacker bides their time until a legitimate client successfully authenticates to the Hysteria application. Subsequently, the attacker attempts to connect to the listener
before the legitimate client. In the eventuality the attacker wins this minimal race window, the attacker will essentially hijack the legitimate client’s UDP connection.

Affected file:
hysteria-master/pkg/socks5/server.go

Affected code:

func (s *Server) udpServer(clientConn *net.UDPConn, localRelayConn *net.UDPConn, hyUDP core.UDPConn) {
        var clientAddr *net.UDPAddr
        buf := make([]byte, udpBufferSize)
        // Local to remote
        for {
                n, cAddr, err := clientConn.ReadFromUDP(buf)
                if err != nil {
                        break
                }
               d, err := socks5.NewDatagramFromBytes(buf[:n])
               if err != nil || d.Frag != 0 {
                       // Ignore bad packets
                       continue
               }
               if clientAddr == nil {
                       // Whoever sends the first valid packet is our client
                       clientAddr = cAddr
                       [...]
               }
               else if cAddr.String() != clientAddr.String() {
                       // Not our client, bye
                       continue
               }
               [...]
        }
}

To mitigate this issue, Cure53 advises sending a token to the legitimate client upon successful authentication. In this way, following the connection to the open listener via UDP, the client must provide this token in order to complete the connection
establishment.

@tobyxdd
Copy link
Collaborator

tobyxdd commented Nov 30, 2022

Wow, I never expected this project to get a security audit from Cure53. Thanks for your effort - I will definitely take a look and fix the issues.

@tobyxdd
Copy link
Collaborator

tobyxdd commented Dec 1, 2022

RPT-03-003 is fixed in fc28c01.

However, I don't think fixing RPT-03-004 is possible, at least not in the way they suggested. We are implementing a standard SOCKS5 server, and unless I missed something, the SOCKS5 specification never explained how a UDP relay could determine and limit the client address. This token authentication certainly isn't part of the spec - so there's no way to maintain compatibility with standard SOCKS5 clients if we implement it.

@certaintls
Copy link
Author

@tobyxdd I noticed in this commit e3c3088#diff-643b6b726bd0f79dadeca7c4173d92d4534ecce85edc6a53760cbfb93c7d0e56 , you removed QUICListen() in the transport layer, which leaves ListenUDP() the only way to start the transport server. But, ListenUDP() is coupled with SOCKS5Client, this change will make it harder for other apps to use hysteria as a PT. I think this PR/branch would still need QUICListen(), do you think we can add it back? Or do you have other suggestions?

Sorry for having disappeared for quite long.

@tobyxdd
Copy link
Collaborator

tobyxdd commented Jan 25, 2023

Hi @certaintls , the purpose of this commit was to separate QUIC's PacketConn initialization process from the server/client transports. These "transports" now only handle outbound connections (how the server & client connect to remote servers).

which leaves ListenUDP() the only way to start the transport server. But, ListenUDP() is coupled with SOCKS5Client,

Can you elaborate? Which ListenUDP and how is it coupled with SOCKS5Client? 🤔

@Mohsen7s
Copy link

Any estimated time to merge and release this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants