Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Java Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart() character classes via \p{javaJavaIdentifierStart} and \p{javaJavaIdentifierPart}. #21

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cmf
Copy link

@cmf cmf commented Jul 28, 2016

This change takes a brute-force approach and generates the ranges accepted by these methods by testing all the code points. MakeJavaCategories.java generates the source file containing the category ranges. I've added simple tests for the parsing and matching, let me know if you'd like to see more about any particular aspect of this. Since it reuses the existing technique used by the current Unicode character ranges and shouldn't affect any existing code paths, hopefully the change is pretty safe.

I only generated these two ranges since they're the ones I need, but it would be trivial to add the rest of the Character.is* ranges if required. I suspect they're probably covered reasonably well by the existing unicode ranges.

One thing I learned from this change - Character.isJavaIdentifierPart(0) == true - who knew?

I haven't signed a CLA - let me know if this looks ok and I'll do so.

…isJavaIdentifierPart() character classes via \p{javaJavaIdentifierStart} and \p{javaJavaIdentifierPart}.
@googlebot
Copy link
Collaborator

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please let us know the company's name.

@cursive-ide
Copy link

Oh, one other thing - MakeJavaCategories is currently under a source root, so it will be included in the distributed artefact. If you'd prefer, I can move that under the test hierarchy somewhere.


package com.google.re2j;

// AUTOGENERATED by MakeJavaCategories.java - do not modify
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't check in autogenerated files. They should be generated as part of the build process.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the case? The existing unicode tables are autogenerated and are checked in. Generating this file during the build would probably involve writing a Maven plugin.

@rschlussel-zz
Copy link
Collaborator

Can you explain why you don't just use Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart()? They are constant time operations.

@cmf
Copy link
Author

cmf commented Jul 28, 2016

I looked for a way to do this, but it seemed like a much more complex change. There is no Regexp.Op or Machine.Frag that supports anything like this, and I was less confident of being able to make the change without breaking anything. If that is the preferred approach I can attempt that - any pointers would be appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants