remove unnecessary routes matching and restore old fallback next hand… #208

WeidiDeng · 2024-06-28T07:53:45Z

…ler behavior

The new non-blocking matcher breaks several things. Some of which can be avoided.

No fallback handler: caused by breaking change that handler is only called for matched routes, and potentially several times due to the loop.
Endless loop: caused by routes not realizing the handler is matched and called and still tries to read more data and loop.

The behavior will cause unnecessary read and loop as some of the matchers already make their conclusion about the status of the match. These matches will be skipped to avoid the loop.

The fallback handler is called correctly as before.

WeidiDeng

fix embedded route list stuck

mholt · 2024-06-29T15:30:55Z

This is indeed a difficult problem to solve well, it seems.

@ydylla You have a lot of experience with this issue by now. What do you think of this change? Maybe this could be a good patch.

ydylla · 2024-06-30T20:33:34Z

I am not a huge fan of all the matcher state tracking since it makes it even more complex, but there is probably no other way. Also I don't think skipping matchers is compatible with the current version/behavior of the prefetch function. Because it is theoretically possible that invocations of prefetch return just 1 byte each time. So we have to check matchers multiple times. Or we must take much more care how matchers are written. So they return only false if a match is definitely impossible even when more data becomes available. Which is currently not the case for the http matcher (I think it's the only one right now). But I like that matcher authors currently don't have to worry much about it.

mholt · 2024-07-15T17:12:06Z

Thanks @ydylla.

Sorry @WeidiDeng , I'm Finally getting around to this, and while I'd need more time to fully digest what's going on, I am open to merging this and trying it out. Think it is ready to go?

Thank you both for working on solving this in different ways 🙏

ydylla · 2024-07-17T21:41:58Z

To give some more reasons why I prefer my version. This PR could cause problems when clients produce short prefetch reads (not sending all data for a matcher in one go). Because it could happen that the correct next route already was marked unmatchable based on old/consumed data that was indented for the previous route.
My version mitigates this here, the comment there also explains the situation a bit.
The issue can also be demonstrated with this config:

fool-tls.json

{
  "admin": {
    "disabled": true
  },
  "logging": {
    "logs": {
      "default": {"level":"DEBUG", "encoder": {"format":"console"}}
    }
  },
  "apps": {
    "tls": {
      "certificates": {
        "automate": ["localhost"]
      },
      "automation": {
        "policies": [{
          "subjects": ["localhost"],
          "issuers": [{
            "module": "internal"
          }]
        }]
      }
    },
    "layer4": {
      "servers": {
        "https": {
          "listen": ["0.0.0.0:10443"],
          "routes": [
            {
              "match": [
                {"proxy_protocol": {}}
              ],
              "handle": [
                {
                  "handler": "proxy_protocol",
                  "allow": ["0.0.0.0/0"]
                }
              ]
            },
            {
              "match": [
                {"tls": {"sni": ["localhost"]}}
              ],
              "handle": [
                {"handler": "tls"},
                {
                  "handler": "proxy",
                  "upstreams": [
                    {"dial": ["127.0.0.1:10080"]}
                  ]
                }
              ]
            }
          ]
        }
      }
    },
    "http": {
      "servers": {
        "backend": {
          "protocols": ["h1","h2","h2c"],
          "listen": ["127.0.0.1:10080"],
          "routes": [
            {
              "handle": [{
                "handler": "static_response",
                "status_code": "200",
                "body": "Hello World\n",
                "headers": {
                  "Content-Type": ["text/plain"]
                }
              }]
            }
          ]
        }
      }
    }
  }
}

and this test program

fool-tls.go

package main

import (
	"crypto/tls"
	"log"
	"net"
	"time"
)

func main() {
	conn, err := net.Dial("tcp", "127.0.0.1:10443")
	if err != nil {
		log.Fatal(err)
	}

	// trigger short prefetch so proxy proto matcher reports it needs more data
	// but send >= 5 bytes to fool the tls matcher
	_, err = conn.Write([]byte("PROXY "))
	if err != nil {
		log.Fatal(err)
	}
	// send rest of proxy proto after sleep
	time.Sleep(500 * time.Millisecond)
	_, err = conn.Write([]byte("TCP4 127.0.0.1 127.0.0.1 55498 10443\r\n"))
	if err != nil {
		log.Fatal(err)
	}

	tlsConn := tls.Client(conn, &tls.Config{
		ServerName:         "localhost",
		InsecureSkipVerify: true,
	})

	_, err = tlsConn.Write([]byte("GET / HTTP/1.0\r\nHost: localhost\r\n\r\n"))
	if err != nil {
		log.Fatal(err)
	}
	buf := make([]byte, 1024)
	n, err := tlsConn.Read(buf)
	if err != nil {
		log.Fatal(err)
	}
	println(string(buf[:n]))
}

This should print a http response with "Hello World" body but with this PR the connection is closed (EOF error).
Although there are likely no real world clients that send a proxy protocol header in that way, I think that bad network conditions could cause similar situations.
In general, I would prefer to keep the route state tracking minimal. In that regard I think my PR is easier to understand, but I wrote it so others should judge that 😅

WeidiDeng · 2024-07-18T00:22:56Z

This should print a http response with "Hello World" body but with this PR the connection is closed (EOF error).
Although there are likely no real world clients that send a proxy protocol header in that way, I think that bad network conditions could cause similar situations.

This error no longer happens with the latest commits. You're using the old build. Keep in mind to test using the latest commits 😅. And the reason the problem happened is not because slow clients, but with how the matcher handled matching status as if previous matcher couldn't handle the connection because of lack of data, the next handler in the chain should be given a second chance to retry,

In general, I would prefer to keep the route state tracking minimal. In that regard I think my PR is easier to understand, but I wrote it so others should judge that 😅

Your logic of making matcher order less relevant is causing more problems than it solves. For example, if we swap the two handlers in your example config, the codes will still work. I'll let you work out what problems this behavior implicates before this patch is merged and I post the infinite loop exploit I mentioned 😅.

In general, the code is cleaner if it's doing the right thing. Being cleaner and easier to understand is no excuse for buggy behavior. And that's why I think order matters, async matcthing is a nice thing because users can worry less about matcher order because different protocols requires different number of data, and matchers with more data requirement will block matchers with less data requirement, not because that matchers can run for a loop and cause problems for them.

ydylla · 2024-07-18T18:47:37Z

This error no longer happens with the latest commits

I rechecked you are right. Apparently I was using an old build yesterday, sorry for the confusion. I think I now also understand how your PR handles these situations. After a match it tries the remaining routes without any data, so all matchers return ErrConsumedAllPrefetchedBytes which resets the state I thought I would set up with my test program.

Logs from your version:

layer4  prefetched      {"remote": "127.0.0.1:11321", "bytes": 6}
layer4  matching        {"remote": "127.0.0.1:11321", "error": "consumed all prefetched bytes", "matcher": "layer4.matchers.proxy_protocol", "matched": false}
layer4  matching        {"remote": "127.0.0.1:11321", "matcher": "layer4.matchers.tls", "matched": false}
layer4  prefetched      {"remote": "127.0.0.1:11321", "bytes": 44}
layer4  matching        {"remote": "127.0.0.1:11321", "matcher": "layer4.matchers.proxy_protocol", "matched": true}
layer4.handlers.proxy_protocol  received the PROXY header       {"remote": "127.0.0.1:55498", "local": "127.0.0.1:10443"}
layer4  matching        {"remote": "127.0.0.1:55498", "error": "consumed all prefetched bytes", "matcher": "layer4.matchers.tls", "matched": false}
layer4  prefetched      {"remote": "127.0.0.1:55498", "bytes": 261}
layer4.matchers.tls     matched {"remote": "127.0.0.1:55498", "server_name": "localhost"}
layer4  matching        {"remote": "127.0.0.1:55498", "matcher": "layer4.matchers.tls", "matched": true}
layer4.handlers.tls     terminated TLS  {"remote": "127.0.0.1:55498", "server_name": "localhost"}
layer4.handlers.proxy   dial upstream   {"remote": "127.0.0.1:55498", "upstream": "127.0.0.1:10080"}
connection stats        {"remote": "127.0.0.1:11321", "read": 426, "written": 1445, "duration": 0.5160241}

Logs from mine:

layer4  prefetched      {"remote": "127.0.0.1:11478", "bytes": 6}
layer4  matching        {"remote": "127.0.0.1:11478", "error": "consumed all prefetched bytes", "matcher": "layer4.matchers.proxy_protocol", "matched": false}
layer4  matching        {"remote": "127.0.0.1:11478", "matcher": "layer4.matchers.tls", "matched": false}
layer4  prefetched      {"remote": "127.0.0.1:11478", "bytes": 44}
layer4  matching        {"remote": "127.0.0.1:11478", "matcher": "layer4.matchers.proxy_protocol", "matched": true}
layer4.handlers.proxy_protocol  received the PROXY header       {"remote": "127.0.0.1:55498", "local": "127.0.0.1:10443"}
layer4  prefetched      {"remote": "127.0.0.1:55498", "bytes": 261}
layer4.matchers.tls     matched {"remote": "127.0.0.1:55498", "server_name": "localhost"}
layer4  matching        {"remote": "127.0.0.1:55498", "matcher": "layer4.matchers.tls", "matched": true}
layer4.handlers.tls     terminated TLS  {"remote": "127.0.0.1:55498", "server_name": "localhost"}
layer4.handlers.proxy   dial upstream   {"remote": "127.0.0.1:55498", "upstream": "127.0.0.1:10080"}
layer4  connection stats        {"remote": "127.0.0.1:11478", "read": 426, "written": 1444, "duration": 0.5150603}

Your logic of making matcher order less relevant is causing more problems than it solves. For example, if we swap the two handlers in your example config, the codes will still work.

Yes I know. I still don't see the problem with this and think it is a neat feature. If there is a route that matches why not use it?

and I post the infinite loop exploit I mentioned

Yes please do. If it's the same issue I already mentioned in the initial post of #210 (comment) then it is fixed since yesterday by 768c9f7

WeidiDeng · 2024-07-19T07:27:50Z

Yes I know. I still don't see the problem with this and think it is a neat feature. If there is a route that matches why not use it?

There are some pieces of code that can very easily cause problems. Using this config and this code snippet (it's destroyed after viewing to prevent abuse).

ydylla · 2024-07-19T21:48:32Z

Using this config and this code snippet (it's destroyed after viewing to prevent abuse).

The pastebin is already unavailable, can you please send it to [email protected]

mholt · 2024-08-05T13:20:25Z

I'm trying to decide between this and #210 😵

@ydylla Would you like to join our developer Slack? I'd love to have you there so we can chat more in real-time and get the fixes discussed and merged 👍 Just let me know which email to invite.

ydylla · 2024-08-05T19:48:13Z

@mholt I don't use Slack for anything else and prefer async. Which means I will never be online anyway. But if you feel it is necessary for the decision you can invite [email protected] the alias is still valid.

mholt · 2024-08-05T22:06:45Z

@ydylla Ah no worries. Thanks for being willing. But I totally respect it if you don't care for Slack/sync chat. 👍 We can keep it async here if you prefer.

I went ahead and sent the code snippet that Weidi shared with me, to your email, so if you haven't received it already you can take a look. 👍 I have not looked closely myself yet... still been so busy.

ydylla · 2024-08-06T20:29:34Z

I went ahead and sent the code snippet that Weidi shared with me

Thanks I have looked at the code. I do not really see a problem with it. It's not an endless loop since the nesting is bound by the for loop counter at

caddy-l4/layer4/routes.go

Line 115 in 154bf6f

    
           for i := 0; i < 10000; i++ { // Limit number of tries to mitigate endless matching bugs.

In practice the client will also run into the matching timeout far before the upper bound is reached.

WeidiDeng · 2024-08-07T00:22:21Z

It's not the loop that's endless, it's resource exhaustion. Image several thousands of clients doing this thing to your server. Also, the data it sends makes no sense and should be rejected right away. Image what users will say when they find their caddy uses excessive resources due to this attack. In their config, they never specify their want this looping behavior.

In practice the client will also run into the matching timeout far before the upper bound is reached.

With a modest timeout of 10 seconds I was able to spike the cpu usage to 20% using only one client from the normal 1%. Timeout only works when clients are well behaving. In practice if others want to attack your server, timeout alone won't help you. In practice, a successful attack combines serveral attack surfaces. You don't want this to be part of it.

As you said in the comments, it's a mitigation, not a solution.

ydylla · 2024-08-07T18:05:30Z

It's not the loop that's endless, it's resource exhaustion.

It would have been helpful if you actually said what you meant instead of vaguely hinting at some infinite loop problems multiple times (#210 (comment), #208 (comment)).
I also noticed the cpu load and agree with you it's not helpful from a defense in depth perspective. The attack code fully loads a thread with the work of doing tls handshakes. The same high cpu load can be achieved by just opening many tls connections as fast as possible (arguable this could be easier detected by tcp monitoring tools). Or by sending many http requests in a handful of tls connections. At least these all require symmetric work for client and server. But I think you made it sound far worse than it actually is.

On the other hand I can accept that we just don't want to support that routes could be matched multiple times. Because its confusing and often probably not what the user wants. I don't have a real use case for it myself it was just an artifact of the new implementation. I will experiment with preventing this on my latest version again. But it will probably look very similar to your solution with lastNeedsMoreIdx and lastMatchedRouteIdx so we might as well just use your branch.

mholt · 2024-08-09T17:58:35Z

This is indeed a tough problem. I've learned a lot from both of these approaches, each of which have their pros and cons. I think both are valuable to have to compare and discuss, so thank you for working on them and pouring so much thought and energy into them! 🙏

I guess, based on the discussion, this is the PR that we'll merge. We can continue to iterate and improve on things.

I think this was only possible thanks to both of you for tackling this problem!

@WeidiDeng Want to rebase and then we'll merge it in?

ydylla · 2024-08-09T23:54:27Z

I tried some different approaches and ended up here: 905fa81

Both versions should now behave similar. But we can go ahead with this PR. Since I will not have much time in the next days/week and don't want to delay this any longer.

…ler behavior

…, this match is eligible for retest.

WeidiDeng · 2024-08-12T01:22:56Z

@mholt I did a rebase, but it completely messed up the changed files (from 10 to 45), some of them I didn't change. For now, this branch has no conflicts with the base branch.

mholt · 2024-08-12T21:32:21Z

That happens to me sometimes too. Let's give it a shot. Thank you both so much for helping and maintaining this 🙏 !

axelmierczuk · 2024-08-16T00:02:31Z

Hey guys, just a heads up - I believe this PR is causing issues specifically with the prefetch function. I haven't looked too deep into the root cause from the commits in this PR, but when reverting to e491c44895fe3f11e24ad4d8c4f6a668144c0ef9 everything works fine. I was working on a custom module and everytime my matcher tried to read from a connection I would get ErrMatchingBufferFull. Maybe we should setup some actions to help catch these things earlier :) When testing I used the following to read from the connection:

	// Stolen directly from https://github.com/mholt/caddy-l4/blob/afa78d72257b949486b24fa6f0351381c786a4b3/modules/l4dns/matcher.go#L66
	var (
		msgBuf   []byte
		msgBytes uint16
	)
	// Detect the connection protocol: TCP or UDP.
	// Note: all non-TCP connections are treated as UDP, so no TCP packets could be matched while testing
	// with net.Pipe() unless a valid cx.LocalAddr() response is provided using a fakeTCPConn wrapper.
	if _, ok := cx.LocalAddr().(*net.TCPAddr); ok {
		// Read the first 2 bytes, validate them and adjust the DNS message length
		// Note: these 2 bytes represent the length of the remaining part of the packet
		// as a big endian uint16 number.
		err := binary.Read(cx, binary.BigEndian, &msgBytes)
		if err != nil || msgBytes < dnsHeaderBytes || msgBytes > dns.MaxMsgSize {
			return false, err
		}

		// Read the remaining bytes
		msgBuf = make([]byte, msgBytes)
		_, err = io.ReadFull(cx, msgBuf)
		if err != nil {
			return false, nil
		}

		// Validate the remaining connection buffer
		// Note: if at least 1 byte remains, we can technically be sure, the protocol isn't DNS.
		// This behaviour may be changed in the future if there are many false negative matches.
		extraBuf := make([]byte, 1)
		_, err = io.ReadFull(cx, extraBuf)
		if err == nil {
			return false, nil
		}
	} else {
		// Read a minimum number of bytes
		msgBuf = make([]byte, dnsHeaderBytes)
		n, err := io.ReadAtLeast(cx, msgBuf, int(dnsHeaderBytes))
		if err != nil {
			return false, nil 
		}

		// Read the remaining bytes and validate their length
		var nn int
		tmpBuf := make([]byte, dns.MinMsgSize)
		for err == nil {
			nn, err = io.ReadAtLeast(cx, tmpBuf, 1)
			msgBuf = append(msgBuf, tmpBuf[:nn]...)
			n += nn
		}
		if n > dns.MaxMsgSize {
			return false, nil
		}
		msgBytes = uint16(n)
	}

Appreciate the hard work you guys are putting in!

WeidiDeng · 2024-08-16T01:08:57Z

This issue was discovered in slack. You should just return the error whenever it's encountered instead of returning false, nil. The problem is now that the matchers run in async mode, sometimes the available data isn't enough to make a decision. Returning the error can help the loop to make a decision about whether the matcher matches.

* Adjust matchers after #208: return reading errors * Adjust matchers after #208: modify error handling in tests

WeidiDeng force-pushed the routes-fixes branch from e038c51 to 1349d38 Compare June 28, 2024 09:03

WeidiDeng commented Jun 28, 2024

View reviewed changes

WeidiDeng mentioned this pull request Jun 28, 2024

Non blocking matchers & matching timeout breaking l4 as a listener wrapper #207

Open

vnxme mentioned this pull request Jul 5, 2024

Use layer4 as a listener wrapper OR pipe layer4 connection to http FOR ocserv multiplexing caddyserver/caddy#6439

Closed

mholt mentioned this pull request Jul 15, 2024

fix: make fallback handler reachable again when no routes match #210

Closed

vnxme mentioned this pull request Jul 17, 2024

Caddyfile support #217

Merged

WeidiDeng force-pushed the routes-fixes branch 2 times, most recently from 887eb90 to 3670820 Compare August 7, 2024 02:06

WeidiDeng added 6 commits August 12, 2024 09:13

remove unnecessary routes matching and restore old fallback next hand…

adcfffb

…ler behavior

fix embedded route list stuck

b99137c

If a matcher returns false but some matcher before it needs more data…

abc192d

…, this match is eligible for retest.

use correct error when matching http

8a25df6

return error in case there is not enough data available

66db9b9

read data only when matchers need some to determine match status

8259243

WeidiDeng force-pushed the routes-fixes branch from 55fff1a to 8259243 Compare August 12, 2024 01:18

mholt merged commit afa78d7 into mholt:master Aug 12, 2024
6 checks passed

vnxme mentioned this pull request Aug 16, 2024

Adjust matchers after #208: return reading errors #236

Merged

mholt pushed a commit that referenced this pull request Aug 23, 2024

Adjust matchers after #208: return reading errors (#236)

3d22d6d

* Adjust matchers after #208: return reading errors * Adjust matchers after #208: modify error handling in tests

This was referenced Aug 25, 2024

Collection of interesting configs #209

Open

fix: make prefetch lazy again #229

Closed

ydylla mentioned this pull request Oct 8, 2024

change prefetch buffer size from 1024 -> 2048 bytes #248

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove unnecessary routes matching and restore old fallback next hand… #208

remove unnecessary routes matching and restore old fallback next hand… #208

WeidiDeng commented Jun 28, 2024

WeidiDeng left a comment

mholt commented Jun 29, 2024

ydylla commented Jun 30, 2024

mholt commented Jul 15, 2024

ydylla commented Jul 17, 2024

WeidiDeng commented Jul 18, 2024

ydylla commented Jul 18, 2024

WeidiDeng commented Jul 19, 2024

ydylla commented Jul 19, 2024

mholt commented Aug 5, 2024

ydylla commented Aug 5, 2024

mholt commented Aug 5, 2024

ydylla commented Aug 6, 2024

WeidiDeng commented Aug 7, 2024

ydylla commented Aug 7, 2024

mholt commented Aug 9, 2024

ydylla commented Aug 9, 2024 •

edited

Loading

WeidiDeng commented Aug 12, 2024

mholt commented Aug 12, 2024

axelmierczuk commented Aug 16, 2024

WeidiDeng commented Aug 16, 2024

remove unnecessary routes matching and restore old fallback next hand… #208

remove unnecessary routes matching and restore old fallback next hand… #208

Conversation

WeidiDeng commented Jun 28, 2024

WeidiDeng left a comment

Choose a reason for hiding this comment

mholt commented Jun 29, 2024

ydylla commented Jun 30, 2024

mholt commented Jul 15, 2024

ydylla commented Jul 17, 2024

WeidiDeng commented Jul 18, 2024

ydylla commented Jul 18, 2024

WeidiDeng commented Jul 19, 2024

ydylla commented Jul 19, 2024

mholt commented Aug 5, 2024

ydylla commented Aug 5, 2024

mholt commented Aug 5, 2024

ydylla commented Aug 6, 2024

WeidiDeng commented Aug 7, 2024

ydylla commented Aug 7, 2024

mholt commented Aug 9, 2024

ydylla commented Aug 9, 2024 • edited Loading

WeidiDeng commented Aug 12, 2024

mholt commented Aug 12, 2024

axelmierczuk commented Aug 16, 2024

WeidiDeng commented Aug 16, 2024

ydylla commented Aug 9, 2024 •

edited

Loading