Skip to content

Add grapheme_levenshtein function. #18087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

youkidearitai
Copy link
Contributor

Measure levenshtein for grapheme cluster unit.
RFC: https://wiki.php.net/rfc/grapheme_levenshtein

@TimWolla TimWolla changed the title [Require RFC][Draft] Add grapheme_levenshtein function. Add grapheme_levenshtein function. Mar 16, 2025
@youkidearitai youkidearitai marked this pull request as ready for review April 16, 2025 00:41
@youkidearitai
Copy link
Contributor Author

This RFC was accepted. I will go to implementation.

@devnexen
Copy link
Member

Nice, could you possibly rebase with last master state to redo the CI ? cheers.

@youkidearitai
Copy link
Contributor Author

I rebased and force pushed. I'm waiting for CI end.

@devnexen
Copy link
Member

should be fine, thanks. Looking good but I ll go back at it in couple of hours.

Copy link
Member

@devnexen devnexen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM minus the remark

if (ustring1) {
efree(ustring1);
}
RETURN_FALSE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that false is being returned on failure, rather than throwing an exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, Userland can uses intl_get_error_message function after this function if this error. Therefore, this block is returns false.

@youkidearitai
Copy link
Contributor Author

Hmm... Windows CI is failed.


unsigned char u_break_iterator_buffer1[U_BRK_SAFECLONE_BUFFERSIZE];
unsigned char u_break_iterator_buffer2[U_BRK_SAFECLONE_BUFFERSIZE];
bi1 = grapheme_get_break_iterator((void*)u_break_iterator_buffer1, &ustatus1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the error checks for these calls and the next ones?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error checks seem to be still missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also don't need the void* cast, you can remove that.

RETURN_THROWS();
}

zend_long *p1, *p2, *tmp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where possible, I think it's better to merge the declaration and assignment as it reduces the area/scope where something is valid/can be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The declarations and assignments are still not merged

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second review round.
As a more general comment: the code can probably be simplified even more by using goto-style error handling.

Comment on lines 1070 to 1098
UStringSearch *srch = usearch_open(ustring1 + current1, pos1 - current1, ustring2 + current2, pos2 - current2, "", NULL, &ustatus2);
if (U_FAILURE(ustatus2)) {
intl_error_set_code(NULL, ustatus2);
intl_error_set_custom_msg(NULL, "Error usearch_open", 0);
ubrk_close(bi1);
ubrk_close(bi2);

efree(ustring1);
efree(ustring2);

efree(p1);
efree(p2);
RETURN_FALSE;
}
usrch_pos = usearch_first(srch, &ustatus2);
if (U_FAILURE(ustatus2)) {
intl_error_set_code(NULL, ustatus2);
intl_error_set_custom_msg(NULL, "Error usearch_first", 0);
ubrk_close(bi1);
ubrk_close(bi2);

efree(ustring1);
efree(ustring2);

efree(p1);
efree(p2);
RETURN_FALSE;
}
usearch_close(srch);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire comparison is too complex and inefficient.
You can just do something like this (I can't use a suggestion because the changes are not centralised):

diff --git a/ext/intl/grapheme/grapheme_string.c b/ext/intl/grapheme/grapheme_string.c
index c4d0c3987a8..c0f21f1ad7e 100644
--- a/ext/intl/grapheme/grapheme_string.c
+++ b/ext/intl/grapheme/grapheme_string.c
@@ -1041,6 +1041,9 @@ PHP_FUNCTION(grapheme_levenshtein)
 		RETURN_FALSE;
 	}
 
+	UCollator *collator = ucol_open("", &ustatus1);
+	// TODO: handle error
+
 	p1 = safe_emalloc(strlen_2 + 1, sizeof(zend_long), 0);
 	p2 = safe_emalloc(strlen_2 + 1, sizeof(zend_long), 0);
 
@@ -1052,7 +1055,6 @@ PHP_FUNCTION(grapheme_levenshtein)
 	int32_t current2 = 0;
 	int32_t pos1 = 0;
 	int32_t pos2 = 0;
-	int32_t usrch_pos = 0;
 
 	while (true) {
 		current1 = ubrk_current(bi1);
@@ -1067,37 +1069,8 @@ PHP_FUNCTION(grapheme_levenshtein)
 			if (pos2 == UBRK_DONE) {
 				break;
 			}
-			UStringSearch *srch = usearch_open(ustring1 + current1, pos1 - current1, ustring2 + current2, pos2 - current2, "", NULL, &ustatus2);
-			if (U_FAILURE(ustatus2)) {
-				intl_error_set_code(NULL, ustatus2);
-				intl_error_set_custom_msg(NULL, "Error usearch_open", 0);
-				ubrk_close(bi1);
-				ubrk_close(bi2);
-
-				efree(ustring1);
-				efree(ustring2);
-
-				efree(p1);
-				efree(p2);
-				RETURN_FALSE;
-			}
-			usrch_pos = usearch_first(srch, &ustatus2);
-			if (U_FAILURE(ustatus2)) {
-				intl_error_set_code(NULL, ustatus2);
-				intl_error_set_custom_msg(NULL, "Error usearch_first", 0);
-				ubrk_close(bi1);
-				ubrk_close(bi2);
-
-				efree(ustring1);
-				efree(ustring2);
-
-				efree(p1);
-				efree(p2);
-				RETURN_FALSE;
-			}
-			usearch_close(srch);
 
-			if (usrch_pos != USEARCH_DONE) {
+			if (ucol_strcoll(collator, ustring1 + current1, pos1 - current1, ustring2 + current2, pos2 - current2) == UCOL_EQUAL) {
 				c0 = p1[i2];
 			} else {
 				c0 = p1[i2] + cost_rep;
@@ -1118,6 +1091,8 @@ PHP_FUNCTION(grapheme_levenshtein)
 		p2 = tmp;
 	}
 
+	ucol_close(collator);
+
 	ubrk_close(bi1);
 	ubrk_close(bi2);
 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't intend to use collation. Please give me some time to judge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion. I could not find problem. Use your code.


unsigned char u_break_iterator_buffer1[U_BRK_SAFECLONE_BUFFERSIZE];
unsigned char u_break_iterator_buffer2[U_BRK_SAFECLONE_BUFFERSIZE];
bi1 = grapheme_get_break_iterator((void*)u_break_iterator_buffer1, &ustatus1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error checks seem to be still missing.

RETURN_THROWS();
}

zend_long *p1, *p2, *tmp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The declarations and assignments are still not merged

int32_t ustring2_len = 0;

UErrorCode ustatus1 = U_ZERO_ERROR;
UErrorCode ustatus2 = U_ZERO_ERROR;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just use a single status variable that you reuse all the time, that makes things a bit less complex.


unsigned char u_break_iterator_buffer1[U_BRK_SAFECLONE_BUFFERSIZE];
unsigned char u_break_iterator_buffer2[U_BRK_SAFECLONE_BUFFERSIZE];
bi1 = grapheme_get_break_iterator((void*)u_break_iterator_buffer1, &ustatus1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also don't need the void* cast, you can remove that.

if (U_FAILURE(ustatus1)) {
intl_error_set_code(NULL, ustatus1);

intl_error_set_custom_msg(NULL, "Error on ubrk_setText on ustring1", 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message could probably be improved (same below)
e.g. "Error on ubrk_setText for argument #1 ($string1)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants