-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Add grapheme_levenshtein function. #18087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This RFC was accepted. I will go to implementation. |
Nice, could you possibly rebase with last master state to redo the CI ? cheers. |
dd9a015
to
5fc7475
Compare
I rebased and force pushed. I'm waiting for CI end. |
should be fine, thanks. Looking good but I ll go back at it in couple of hours. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM minus the remark
if (ustring1) { | ||
efree(ustring1); | ||
} | ||
RETURN_FALSE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that false is being returned on failure, rather than throwing an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, Userland can uses intl_get_error_message
function after this function if this error. Therefore, this block is returns false.
Hmm... Windows CI is failed. |
ext/intl/grapheme/grapheme_string.c
Outdated
|
||
unsigned char u_break_iterator_buffer1[U_BRK_SAFECLONE_BUFFERSIZE]; | ||
unsigned char u_break_iterator_buffer2[U_BRK_SAFECLONE_BUFFERSIZE]; | ||
bi1 = grapheme_get_break_iterator((void*)u_break_iterator_buffer1, &ustatus1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are the error checks for these calls and the next ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error checks seem to be still missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also don't need the void*
cast, you can remove that.
ext/intl/grapheme/grapheme_string.c
Outdated
RETURN_THROWS(); | ||
} | ||
|
||
zend_long *p1, *p2, *tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where possible, I think it's better to merge the declaration and assignment as it reduces the area/scope where something is valid/can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The declarations and assignments are still not merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second review round.
As a more general comment: the code can probably be simplified even more by using goto-style error handling.
ext/intl/grapheme/grapheme_string.c
Outdated
UStringSearch *srch = usearch_open(ustring1 + current1, pos1 - current1, ustring2 + current2, pos2 - current2, "", NULL, &ustatus2); | ||
if (U_FAILURE(ustatus2)) { | ||
intl_error_set_code(NULL, ustatus2); | ||
intl_error_set_custom_msg(NULL, "Error usearch_open", 0); | ||
ubrk_close(bi1); | ||
ubrk_close(bi2); | ||
|
||
efree(ustring1); | ||
efree(ustring2); | ||
|
||
efree(p1); | ||
efree(p2); | ||
RETURN_FALSE; | ||
} | ||
usrch_pos = usearch_first(srch, &ustatus2); | ||
if (U_FAILURE(ustatus2)) { | ||
intl_error_set_code(NULL, ustatus2); | ||
intl_error_set_custom_msg(NULL, "Error usearch_first", 0); | ||
ubrk_close(bi1); | ||
ubrk_close(bi2); | ||
|
||
efree(ustring1); | ||
efree(ustring2); | ||
|
||
efree(p1); | ||
efree(p2); | ||
RETURN_FALSE; | ||
} | ||
usearch_close(srch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire comparison is too complex and inefficient.
You can just do something like this (I can't use a suggestion because the changes are not centralised):
diff --git a/ext/intl/grapheme/grapheme_string.c b/ext/intl/grapheme/grapheme_string.c
index c4d0c3987a8..c0f21f1ad7e 100644
--- a/ext/intl/grapheme/grapheme_string.c
+++ b/ext/intl/grapheme/grapheme_string.c
@@ -1041,6 +1041,9 @@ PHP_FUNCTION(grapheme_levenshtein)
RETURN_FALSE;
}
+ UCollator *collator = ucol_open("", &ustatus1);
+ // TODO: handle error
+
p1 = safe_emalloc(strlen_2 + 1, sizeof(zend_long), 0);
p2 = safe_emalloc(strlen_2 + 1, sizeof(zend_long), 0);
@@ -1052,7 +1055,6 @@ PHP_FUNCTION(grapheme_levenshtein)
int32_t current2 = 0;
int32_t pos1 = 0;
int32_t pos2 = 0;
- int32_t usrch_pos = 0;
while (true) {
current1 = ubrk_current(bi1);
@@ -1067,37 +1069,8 @@ PHP_FUNCTION(grapheme_levenshtein)
if (pos2 == UBRK_DONE) {
break;
}
- UStringSearch *srch = usearch_open(ustring1 + current1, pos1 - current1, ustring2 + current2, pos2 - current2, "", NULL, &ustatus2);
- if (U_FAILURE(ustatus2)) {
- intl_error_set_code(NULL, ustatus2);
- intl_error_set_custom_msg(NULL, "Error usearch_open", 0);
- ubrk_close(bi1);
- ubrk_close(bi2);
-
- efree(ustring1);
- efree(ustring2);
-
- efree(p1);
- efree(p2);
- RETURN_FALSE;
- }
- usrch_pos = usearch_first(srch, &ustatus2);
- if (U_FAILURE(ustatus2)) {
- intl_error_set_code(NULL, ustatus2);
- intl_error_set_custom_msg(NULL, "Error usearch_first", 0);
- ubrk_close(bi1);
- ubrk_close(bi2);
-
- efree(ustring1);
- efree(ustring2);
-
- efree(p1);
- efree(p2);
- RETURN_FALSE;
- }
- usearch_close(srch);
- if (usrch_pos != USEARCH_DONE) {
+ if (ucol_strcoll(collator, ustring1 + current1, pos1 - current1, ustring2 + current2, pos2 - current2) == UCOL_EQUAL) {
c0 = p1[i2];
} else {
c0 = p1[i2] + cost_rep;
@@ -1118,6 +1091,8 @@ PHP_FUNCTION(grapheme_levenshtein)
p2 = tmp;
}
+ ucol_close(collator);
+
ubrk_close(bi1);
ubrk_close(bi2);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't intend to use collation. Please give me some time to judge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion. I could not find problem. Use your code.
ext/intl/grapheme/grapheme_string.c
Outdated
|
||
unsigned char u_break_iterator_buffer1[U_BRK_SAFECLONE_BUFFERSIZE]; | ||
unsigned char u_break_iterator_buffer2[U_BRK_SAFECLONE_BUFFERSIZE]; | ||
bi1 = grapheme_get_break_iterator((void*)u_break_iterator_buffer1, &ustatus1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error checks seem to be still missing.
ext/intl/grapheme/grapheme_string.c
Outdated
RETURN_THROWS(); | ||
} | ||
|
||
zend_long *p1, *p2, *tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The declarations and assignments are still not merged
ext/intl/grapheme/grapheme_string.c
Outdated
int32_t ustring2_len = 0; | ||
|
||
UErrorCode ustatus1 = U_ZERO_ERROR; | ||
UErrorCode ustatus2 = U_ZERO_ERROR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just use a single status variable that you reuse all the time, that makes things a bit less complex.
ext/intl/grapheme/grapheme_string.c
Outdated
|
||
unsigned char u_break_iterator_buffer1[U_BRK_SAFECLONE_BUFFERSIZE]; | ||
unsigned char u_break_iterator_buffer2[U_BRK_SAFECLONE_BUFFERSIZE]; | ||
bi1 = grapheme_get_break_iterator((void*)u_break_iterator_buffer1, &ustatus1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also don't need the void*
cast, you can remove that.
ext/intl/grapheme/grapheme_string.c
Outdated
if (U_FAILURE(ustatus1)) { | ||
intl_error_set_code(NULL, ustatus1); | ||
|
||
intl_error_set_custom_msg(NULL, "Error on ubrk_setText on ustring1", 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message could probably be improved (same below)
e.g. "Error on ubrk_setText for argument #1 ($string1)"
Measure levenshtein for grapheme cluster unit
1da6e4b
to
6f41f98
Compare
Measure levenshtein for grapheme cluster unit.
RFC: https://wiki.php.net/rfc/grapheme_levenshtein