Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic in analysis/tokenizers/icu #185

Closed
chowey opened this issue Apr 5, 2015 · 8 comments
Closed

panic in analysis/tokenizers/icu #185

chowey opened this issue Apr 5, 2015 · 8 comments
Labels
Milestone

Comments

@chowey
Copy link

chowey commented Apr 5, 2015

I am using bleve to index some Danish text using the github.com/blevesearch/bleve/analysis/language/da package. I get a

panic: runtime error: slice bounds out of range

which occurs at

github.com/blevesearch/bleve/analysis/tokenizers/icu.(*UnicodeWordBoundaryTokenizer).Tokenize(0xc208046860, 0xc20ef4f440, 0xbd, 0xbd, 0x0, 0x0, 0x0)
    /.../src/github.com/blevesearch/bleve/analysis/tokenizers/icu/boundary.go:104 +0x621

running on Max OSX. I have icu4c 54.1 installed using brew.

I noticed that U_BUFFER_OVERFLOW_ERROR is being ignored (lines 85 and 93). When I modify:

if err > C.U_ZERO_ERROR && err != C.U_BUFFER_OVERFLOW_ERROR {

to be

if err > C.U_ZERO_ERROR {

then the panic disappears. I haven't isolated what specific text causes the panic yet, so I haven't been able to provide an example to reproduce the panic.

@chowey
Copy link
Author

chowey commented Apr 6, 2015

I isolated the problem to text that was saved in another encoding - not UTF-8. As a result it had an illegal UTF-8 character.

Is there some way avoid panicking in this scenario?

@mschoch
Copy link
Contributor

mschoch commented Apr 6, 2015

Thanks for tracking it down. I plan on reviewing the code tomorrow, I took a quick look but couldn't quite remember why it works the way it does.

Can you supply the problematic text? Obviously we should try not to panic in this case, but it might end up in form of an optional utf-8 validation step. Right now we expect/trust input to always be valid utf-8. Obviously thats not always the case or something that can be guaranteed.

CC'ing @steveyen because I think he will run into this soon too.

@chowey
Copy link
Author

chowey commented Apr 6, 2015

The problematic text was like this:

something\96something

(That is supposed to represent a code point 96 embedded there.)

This is an en-dash in Windows-1252 encoding.

Note: Don't follow my example and remove U_BUFFER_OVERFLOW_ERROR in the conditional. It completely breaks indexing. My bad for not further testing before posting.

@mschoch
Copy link
Contributor

mschoch commented Apr 6, 2015

Interesting I cannot reproduce a panic under similar conditions, though I have icu 52.1.

For input:

"something\x96something"

I get the token stream:

analysis.TokenStream{
                {
                    Start:    0,
                    End:      9,
                    Term:     []byte("something"),
                    Position: 1,
                    Type:     analysis.AlphaNumeric,
                },
                {
                    Start:    12,
                    End:      21,
                    Term:     []byte("mething\x00\x00"),
                    Position: 2,
                    Type:     analysis.AlphaNumeric,
                },
            },

This is obviously wrong, but it didn't panic. It could be that icu 54.1 behaves differently, or it could be that in my test I'm somehow just getting lucky. I have to review the icu api to see what we can do differently.

@mschoch
Copy link
Contributor

mschoch commented Apr 6, 2015

I've also opened up a related issue, because Bleve's handling of invalid utf-8 goes beyond just this one issue: #186

@boombuler
Copy link

Hi,

I had the same exception and fixed it by adding

// #include "unicode/ucnv.h"
func init() {
    C.ucnv_setDefaultName(C.CString("UTF-8"))
}

to the icu package. On my windows machine it always treated the input as ANSI not as UTF-8... Hope that helps anyone :)

@mschoch
Copy link
Contributor

mschoch commented Apr 20, 2015

Thanks, yes that is related. The current implementation relies on the default being UTF-8, which then leads to another class of bugs, where the input is valid utf-8, but your default isn't set to utf-8.

We can either set the default to utf-8 (as you suggest) or use another function which explicitly treats the input as utf-8.

@mschoch
Copy link
Contributor

mschoch commented May 17, 2017

Now that the ICU tokenizer is part of blevex, moving this issue there: blevesearch/blevex#34

@mschoch mschoch closed this as completed May 17, 2017
abhinavdangeti added a commit that referenced this issue Dec 14, 2023
* 86b1161 Mohd Shaad Khan | MB-59447: support nested/chunked vectors (#185)
* 0f89c37 Aditi Ahuja | MB-60084: Change index merge procedure to account for existing index type. (#197)
abhinavdangeti added a commit that referenced this issue Dec 14, 2023
* 86b1161 Mohd Shaad Khan | MB-59447: support nested/chunked vectors
(#185)
* 0f89c37 Aditi Ahuja | MB-60084: Change index merge procedure to
account for existing index type. (#197)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants