Add support for Java Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart() character classes via \p{javaJavaIdentifierStart} and \p{javaJavaIdentifierPart}. #21

cmf · 2016-07-28T03:52:19Z

This change takes a brute-force approach and generates the ranges accepted by these methods by testing all the code points. MakeJavaCategories.java generates the source file containing the category ranges. I've added simple tests for the parsing and matching, let me know if you'd like to see more about any particular aspect of this. Since it reuses the existing technique used by the current Unicode character ranges and shouldn't affect any existing code paths, hopefully the change is pretty safe.

I only generated these two ranges since they're the ones I need, but it would be trivial to add the rest of the Character.is* ranges if required. I suspect they're probably covered reasonably well by the existing unicode ranges.

One thing I learned from this change - Character.isJavaIdentifierPart(0) == true - who knew?

I haven't signed a CLA - let me know if this looks ok and I'll do so.

…isJavaIdentifierPart() character classes via \p{javaJavaIdentifierStart} and \p{javaJavaIdentifierPart}.

googlebot · 2016-07-28T03:52:20Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please let us know the company's name.

cursive-ide · 2016-07-28T04:02:19Z

Oh, one other thing - MakeJavaCategories is currently under a source root, so it will be included in the distributed artefact. If you'd prefer, I can move that under the test hierarchy somewhere.

rschlussel-zz · 2016-07-28T11:55:55Z

java/com/google/re2j/JavaCategoryTables.java

+
+package com.google.re2j;
+
+// AUTOGENERATED by MakeJavaCategories.java - do not modify


You shouldn't check in autogenerated files. They should be generated as part of the build process.

Is this the case? The existing unicode tables are autogenerated and are checked in. Generating this file during the build would probably involve writing a Maven plugin.

rschlussel-zz · 2016-07-28T12:01:31Z

Can you explain why you don't just use Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart()? They are constant time operations.

cmf · 2016-07-28T13:21:11Z

I looked for a way to do this, but it seemed like a much more complex change. There is no Regexp.Op or Machine.Frag that supports anything like this, and I was less confident of being able to make the change without breaking anything. If that is the preferred approach I can attempt that - any pointers would be appreciated.

Add support for Java Character.isJavaIdentifierStart() and Character.…

7edf5b7

…isJavaIdentifierPart() character classes via \p{javaJavaIdentifierStart} and \p{javaJavaIdentifierPart}.

googlebot added the cla: no label Jul 28, 2016

rschlussel-zz reviewed Jul 28, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Java Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart() character classes via \p{javaJavaIdentifierStart} and \p{javaJavaIdentifierPart}. #21

Add support for Java Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart() character classes via \p{javaJavaIdentifierStart} and \p{javaJavaIdentifierPart}. #21

cmf commented Jul 28, 2016

googlebot commented Jul 28, 2016

cursive-ide commented Jul 28, 2016

rschlussel-zz Jul 28, 2016

cmf Jul 28, 2016

rschlussel-zz commented Jul 28, 2016

cmf commented Jul 28, 2016


		package com.google.re2j;

		// AUTOGENERATED by MakeJavaCategories.java - do not modify

Add support for Java Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart() character classes via \p{javaJavaIdentifierStart} and \p{javaJavaIdentifierPart}. #21

Are you sure you want to change the base?

Add support for Java Character.isJavaIdentifierStart() and Character.isJavaIdentifierPart() character classes via \p{javaJavaIdentifierStart} and \p{javaJavaIdentifierPart}. #21

Conversation

cmf commented Jul 28, 2016

googlebot commented Jul 28, 2016

cursive-ide commented Jul 28, 2016

rschlussel-zz Jul 28, 2016

Choose a reason for hiding this comment

cmf Jul 28, 2016

Choose a reason for hiding this comment

rschlussel-zz commented Jul 28, 2016

cmf commented Jul 28, 2016