-
Notifications
You must be signed in to change notification settings - Fork 690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic in analysis/tokenizers/icu #185
Comments
I isolated the problem to text that was saved in another encoding - not UTF-8. As a result it had an illegal UTF-8 character. Is there some way avoid panicking in this scenario? |
Thanks for tracking it down. I plan on reviewing the code tomorrow, I took a quick look but couldn't quite remember why it works the way it does. Can you supply the problematic text? Obviously we should try not to panic in this case, but it might end up in form of an optional utf-8 validation step. Right now we expect/trust input to always be valid utf-8. Obviously thats not always the case or something that can be guaranteed. CC'ing @steveyen because I think he will run into this soon too. |
The problematic text was like this:
(That is supposed to represent a code point 96 embedded there.) This is an en-dash in Windows-1252 encoding. Note: Don't follow my example and remove |
Interesting I cannot reproduce a panic under similar conditions, though I have icu 52.1. For input:
I get the token stream:
This is obviously wrong, but it didn't panic. It could be that icu 54.1 behaves differently, or it could be that in my test I'm somehow just getting lucky. I have to review the icu api to see what we can do differently. |
I've also opened up a related issue, because Bleve's handling of invalid utf-8 goes beyond just this one issue: #186 |
Hi, I had the same exception and fixed it by adding // #include "unicode/ucnv.h"
func init() {
C.ucnv_setDefaultName(C.CString("UTF-8"))
} to the icu package. On my windows machine it always treated the input as ANSI not as UTF-8... Hope that helps anyone :) |
Thanks, yes that is related. The current implementation relies on the default being UTF-8, which then leads to another class of bugs, where the input is valid utf-8, but your default isn't set to utf-8. We can either set the default to utf-8 (as you suggest) or use another function which explicitly treats the input as utf-8. |
Now that the ICU tokenizer is part of blevex, moving this issue there: blevesearch/blevex#34 |
I am using bleve to index some Danish text using the
github.com/blevesearch/bleve/analysis/language/da
package. I get awhich occurs at
running on Max OSX. I have
icu4c 54.1
installed using brew.I noticed that
U_BUFFER_OVERFLOW_ERROR
is being ignored (lines 85 and 93). When I modify:to be
then the panic disappears. I haven't isolated what specific text causes the panic yet, so I haven't been able to provide an example to reproduce the panic.
The text was updated successfully, but these errors were encountered: