Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: DNS-01 ACME validation failures do not mark challenge as invalid #1365

Open
aglasgall opened this issue Apr 26, 2023 · 14 comments
Open
Assignees
Labels
bug needs triage Waiting for discussion / prioritization by team
Milestone

Comments

@aglasgall
Copy link

aglasgall commented Apr 26, 2023

Steps to Reproduce

  1. Open an ACME order against a step-ca server
  2. Answer the DNS-01 challenge for the order without properly populating DNS (or populating it with the wrong value)
  3. Wait for the server to try validation

Your Environment

  • OS - Ubuntu Jammy
  • step-ca Version - 0.24.1

Expected Behavior

The challenge should eventually transition state to INVALID

Actual Behavior

The challenge stays in PENDING state indefinitely

Additional Context

Looking at acme/challenge.go, I see that both of the failure paths in dns01Validate() pass markInvalid=false to storeError():

txtRecords, err := vc.LookupTxt("_acme-challenge." + domain)
	if err != nil {
		return storeError(ctx, db, ch, false, WrapError(ErrorDNSType, err,
			"error looking up TXT records for domain %s", domain))
	}

and

	if !found {
		return storeError(ctx, db, ch, false, NewError(ErrorRejectedIdentifierType,
			"keyAuthorization does not match; expected %s, but got %s", expectedKeyAuth, txtRecords))
	}

Either these should fail on the first go or there should be some capped (configurable?) number of retries or retry period after which validation should fail and the challenge marked as invalid.

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

I'm willing to work to fix this issue, ideally in a way that doesn't involve indefinite bikeshedding about what the retry policy should be :)

@aglasgall aglasgall added bug needs triage Waiting for discussion / prioritization by team labels Apr 26, 2023
@aglasgall
Copy link
Author

aglasgall commented May 2, 2023

I've done a quick prototype of a fix for this which I can share if desired. It's not very forgiving, though - it fails if dns01validate() fails once, and given that validation checks are performed synchronously when clients answer challenges, it's not entirely clear how to make this retriable. Ideally validation would happen asynchronously, but that'd be a much larger architectural change :/

Possibly the fix is to also perform validation (if the authz is not already positively valid/invalid) when a client polls the status of an authorization and add a retry counter (or a "give up after" deadline) to either Challenge or Authorization. It's a little weird to have a state-changing thing happen as the result of a POST-as-GET, but given that the spec requires that clients poll for the state transition, we might be able to get away with it.

@tashian
Copy link
Contributor

tashian commented May 2, 2023

Thanks Anna! We'll take a look at this in our open source triage meeting tomorrow morning and get back to you.

@aglasgall
Copy link
Author

Thanks so much for agreeing to take a look!

I should add that this is critical to our use case: if DNS validation fails, we really do need the challenge and authorization to become invalid so that clients can notice and act accordingly.

@dopey
Copy link
Contributor

dopey commented May 3, 2023

@tashian
Copy link
Contributor

tashian commented May 3, 2023

Hi @mholt,

We were just discussing this in our open source meeting, and your name came up.

Our philosophy with the ACME server is to align to the spec as well as we can, and when the spec is inconclusive we try to align with what the most popular ACME clients actually do in the field.

Right now, we handle ACME orders using a simple strategy where DNS challenges never become invalid.
We rely on the client to do some sort of exponential back off and eventually give up.
This approach was up to spec when we built it, and we think it still is.
But things change and we're open to reconsidering.

How does acmez treat this situation?
As a client maintainer, do you have a strong opinion on how a CA should respond?

Thanks,
Carl

@mholt
Copy link

mholt commented May 3, 2023

ACMEz polls until the authorization is finalized:

https://github.com/mholt/acmez/blob/d2a3d7e78082af60d929a514085d00678c67a73a/acme/authorization.go#L241-L283

Right now, we handle ACME orders using a simple strategy where DNS challenges never become invalid

Why's that? What becomes of the DNS challenge if it's not validated?

Maybe I'm missing something obvious from a server perspective (which I have almost no experience with), but IMO a server should never leave things hanging. It sounds like in this situation the client has initiated the DNS-01 challenge, indicating that the proper record should be in place, so when the server looks it up it should either be validated or it is not. RFC 8555:

To validate a DNS challenge, the server performs the following steps:

  1. Compute the SHA-256 digest [FIPS180-4] of the stored key
    authorization

  2. Query for TXT records for the validation domain name

  3. Verify that the contents of one of the TXT records match the
    digest value

If all of the above verifications succeed, then the validation is
successful. If no DNS record is found, or DNS record and response
payload do not pass these checks, then the validation fails.

The client SHOULD de-provision the resource record(s) provisioned for
this challenge once the challenge is complete, i.e., once the
"status" field of the challenge has the value "valid" or "invalid".

IMO this is pretty clear, it either fails or succeeds. I don't see why leaving it hanging would be useful. I think ACMEz would keep polling if no status change occurs.

@aglasgall
Copy link
Author

aglasgall commented May 3, 2023

Entirely agreed. The spec is pretty clear that validation failure is supposed to make a challenge transition from processing -> invalid when validation of that challenge fails (RFC 8555 7.1.6 p31).

Which does bring up another curious issue, which is that step-ca doesn't use the 'processing' status for challenges at all...

I have an extremely hacky proof of concept implementation of this working now that stores a validation deadline and a flag indicating whether the challenge has been answered as part of the challenge in the data store. The flag should be replaced by making step-ca use 'processing' and 'pending' per spec, but that's going to be a bigger effort

@aglasgall
Copy link
Author

IMO this is pretty clear, it either fails or succeeds. I don't see why leaving it hanging would be useful. I think ACMEz would keep polling if no status change occurs.

FWIW this is what our application (which is using the Certbot Python acme library) does, and how I noticed this problem in the first place. I can confirm that both Boulder and Pebble do make DNS-01 challenges positively succeed or fail.

@aglasgall
Copy link
Author

I updated my proof of concept to make challenges start in state processing per spec (RFC 8555 7.1.6 again); it was less work than I expected. I still need to update tests and clean up some other details, though.

The approach I'm taking is to

  1. Make challenges start in status processing
  2. Change status to pending when GetChallenge (i.e. answering the challenge) is called. Also, for DNS-01 challenges, set a deadline in the Challenge object to now() + 2 minutes (I intend to make this configurable if I can figure out how)
  3. make Authorization.UpdateStatus loop through the authz's challenges and call Validate for each one that is in status pending. This simulates us polling the status of the challenge asynchronously by causing the check to happen when the client polls for authorization status. If a DNS-01 challenge has not succeeded by the deadline, its status gets set to invalid. If any challenges have status invalid, set the authorization's status to invalid as well to signal failure.

Currently I'm only doing the additional check at GetAuthorization time for DNS-01 challenges, but this should work for the trinity of upstream challenge types (DNS-01, HTTP-01, ALPN-01). DEVICEATTEST-01 will be more work, since it has an actual payload for the answer challenge call; we'll need to store that with the challenge at answer time if we want this strategy to cover that as well, but I'm not super familiar with the challenge type.

I have confirmed that this strategy both covers the case of "we need validation failures to be positive and invalidating" as well as "sometimes our local DNS setup is flaky and the first query fails", both of which are needed for our use case

@aglasgall
Copy link
Author

aglasgall commented May 4, 2023

As a style note, it's more than a little confusing that GetAuthorization and GetChallenge can change the state of authorizations and challenges! It's convenient in this case, but was definitely not behavior I expected from functions named GetSomething :/ I'm relatively new to Go, though, so this might be a norm I'm just not familiar with.

@tashian
Copy link
Contributor

tashian commented May 5, 2023

Thank you both.
It's helpful to know that both Certbot and ACMEz libraries are impacted by this.
Definitely something we need to address and, in fact, I'm surprised it hasn't come up sooner!
And, Anna, I appreciate you diving in to fix it.
I'm not the best person to give feedback on your approach.
Maybe someone else from the team can chime in on that.

@aglasgall
Copy link
Author

aglasgall commented May 5, 2023

I'm quite aware that my implementation is a hack, incidentally; obviously actual asynchronous validation would be ideal, but it's not currently implemented for anything else in step-ca and it would be a heavy lift to add it to this.

I don't have unit tests written specifically for it yet, but I do have the rest of the test suite passing. I can try to get a PR up for review sometime next week if people would like to see what I have so far and think this approach is worth pursuing at least in the short term

@dopey dopey self-assigned this May 23, 2023
@dopey
Copy link
Contributor

dopey commented Aug 3, 2023

Hey @aglasgall 👋 . First off, thanks for opening this issue and I apologize for the long silence in getting back to you.

I briefly looked into how the ACME spec and Boulder handle this case. Boulder, as far as I can tell, does indeed mark the challenge as invalid on the first validation failure. The ACME spec, however, seems to be a bit more nuanced (at least in my interpretation). I don't believe there are any MUST style requirements in the ACME spec about server retries. Rather I'm seeing an entire section on how to handle Challenge Retries - https://datatracker.ietf.org/doc/html/rfc8555#section-8.2. I think it was based on this section that I originally implemented the validation this way to start with.

We're open to changing this - so that we invalidate challenges as soon as they fail a validation attempt - however, 1) the ACME spec in this matter is only a recommendation, and 2) I'm slightly worried about breaking implementations that have come to rely on infinite retry.

Would you be open to sharing a PR for the changes you detailed above?

To your point about validating challenges out of band - I left a short comment about this upstream in this issue thread. Basically we had an implementation a handful of years ago and it was our opinion at the time that it would overcomplicate CA logic and startup configuration for very little gain.

To your point about the function names being confusing. Yes. It's definitely bad style. The names should be updated to reflect the fact that they can modify the state.

@hslatman hslatman added this to the v0.25.1 milestone Oct 17, 2023
@hslatman hslatman modified the milestones: v0.25.1, v0.25.2 Nov 29, 2023
@jacobgreenleaf
Copy link

One thing that some tools like lego do that is somewhat neat is that they opt to or have an option to use the authoritative nameserver for the domain in question when validating DNS-01. It can prevent issues like this or #168 because the ACME client is technically allowed to respond to the challenge without awareness of the caches between them merely if it "believe[s]" that the challenge will succeed, which is a true statement if, unbeknownst to the client, the server is using a caching resolver. Otherwise the client would need to wait the entire duration of the TTL to be sure that no cache could prevent the challenge from succeeding.

Perhaps if validation fails you could optimistically retry using the authoritative nameserver?

@hslatman hslatman modified the milestones: v0.26.0, v0.26.1 Mar 29, 2024
@hslatman hslatman modified the milestones: v0.26.1, v0.26.2 Apr 25, 2024
@hslatman hslatman modified the milestones: v0.26.2, v0.26.3 Jun 17, 2024
@hslatman hslatman modified the milestones: v0.27.0, v0.27.2 Jul 15, 2024
@hslatman hslatman modified the milestones: v0.27.2, v0.27.3 Jul 23, 2024
@hslatman hslatman modified the milestones: v0.27.3, v0.27.5 Sep 16, 2024
@hslatman hslatman modified the milestones: v0.27.5, v0.27.6 Oct 22, 2024
@hslatman hslatman modified the milestones: v0.28.0, v0.28.1 Oct 30, 2024
@hslatman hslatman modified the milestones: v0.28.1, v0.28.2 Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs triage Waiting for discussion / prioritization by team
Projects
None yet
Development

No branches or pull requests

6 participants