Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --format-email to perform full validation on "email" and "idn-email" formats #460

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

trzejos
Copy link

@trzejos trzejos commented Jul 10, 2024

Setting --format-email=full will perform a stronger regex matching on strings with format set to email or idn-email, rather than doing the simple check from jsonschema. The default option of --format-email=default will only check that there is a @ symbol in the string.

idn-email regex based on: https://gist.github.com/baker-ling/3b4b014ee809aa9732f9873fe060c098
email regex based on: https://stackoverflow.com/questions/13992403/regex-validation-of-email-addresses-according-to-rfc5321-rfc5322

@trzejos trzejos changed the title wip: email format options Add --format-email to perform full validation on "email" and "idn-email" formats Jul 10, 2024
@trzejos trzejos marked this pull request as ready for review July 10, 2024 17:39
@trzejos
Copy link
Author

trzejos commented Jul 10, 2024

I was able to try this out locally and confirmed that strings like "user@[email protected]" will fail the email and idn-email format with --format-email=full set.

Copy link
Member

@sirosen sirosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for coming to the table with working code and putting forth so much initial effort! This is one where I need to think about what exactly to do.

There are three issues, as I see it:

  1. a featureset this size needs unit tests
  2. as a maintainer, I need to understand it
  3. what is the right CLI interface?

Regarding (1), adding tests is something that we can address, but having good tests involves some understanding of the regexes.
Which leads to (2), understanding the expressions. I left an inline comment on one of them -- I think it needs a little bit of work to clarify but not too much. The RFC6531 regex is, TBH, really intimidating. I need to take some time to figure out how to even approach learning what it does.

My third item, (3), is that I don't know that if we're adding these regexes they shouldn't be the only behavior. The presence of a new CLI switch for this gives me some pause. There's more nuance around the regex format, and I don't think it's necessarily a good example to follow. Too many CLI flags and options can overwhelm a user, so we should only expose a control if there's a strong reason to do so. Do these expressions have false negatives? If not, I think it would be better not to expose any kind of flag and simply make these the behavior.


I'm not sure what the exact next step is here. Adding tests and tinkering on the interface are relatively straightforward TODO items. Understanding that sizable regex has a less obvious path. I'll need to find some time to really sit down and read through it carefully, and probably sit with RFC 6531 as well. (I did this with the datetime formats and it wasn't too bad, but the email format is significantly more complex.)

\[[\t -Z^-~]*]
)
$
""",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding this regex a bit difficult to read. In particular, I'm seeing some unconventional range expressions in the character classes, like /-9, -Z, and ^-~.
These are valid, but they aren't the way character classes are typically written. Perhaps other people have an expert and intuitive knowledge of what chr(ord(" ")+1) will be, but I definitely don't.

Can these be rewritten such that the suite of characters matched is more obvious to a reader? For example, rather than [/-9], I would much rather see [/0-9]. The fact that the lowercase letters are captured with ^-~ caught me particularly off-guard.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I'm not the original author of these regexes either (links in PR description). I'll see what I can do about cleaning up those character classes though.

Copy link
Author

@trzejos trzejos Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sirosen

I had some time this week to revisit this, and reworked the simpler RFC5321 validation. I also added in some length checks as well and tried validating it against the examples in this wikipedia page: https://en.wikipedia.org/wiki/Email_address#Examples

It validated/invalidated as expected except for a couple cases:

  • I❤️[email protected] was incorrectly found invalid, likely because of UTF-8
    • This is actually the correct behavior. UTF-8 email addresses should only be allowed in the idn-email format
  • i.like.underscores@but_they_are_not_allowed_in_this_part was incorrectly found invalid, we allow underscores in the domain part of the regex.

Let me know if the regex is easier to understand and if you think we should need to handle non-ascii strings like that utf-8 one.

I have not revisited the idn-email validator yet

src/check_jsonschema/formats/implementations/rfc5321.py Outdated Show resolved Hide resolved
src/check_jsonschema/formats/implementations/rfc6531.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants