panic in analysis/tokenizers/icu #185

chowey · 2015-04-05T22:16:16Z

I am using bleve to index some Danish text using the github.com/blevesearch/bleve/analysis/language/da package. I get a

panic: runtime error: slice bounds out of range

which occurs at

github.com/blevesearch/bleve/analysis/tokenizers/icu.(*UnicodeWordBoundaryTokenizer).Tokenize(0xc208046860, 0xc20ef4f440, 0xbd, 0xbd, 0x0, 0x0, 0x0)
    /.../src/github.com/blevesearch/bleve/analysis/tokenizers/icu/boundary.go:104 +0x621

running on Max OSX. I have icu4c 54.1 installed using brew.

I noticed that U_BUFFER_OVERFLOW_ERROR is being ignored (lines 85 and 93). When I modify:

if err > C.U_ZERO_ERROR && err != C.U_BUFFER_OVERFLOW_ERROR {

to be

if err > C.U_ZERO_ERROR {

then the panic disappears. I haven't isolated what specific text causes the panic yet, so I haven't been able to provide an example to reproduce the panic.

The text was updated successfully, but these errors were encountered:

chowey · 2015-04-06T01:55:28Z

I isolated the problem to text that was saved in another encoding - not UTF-8. As a result it had an illegal UTF-8 character.

Is there some way avoid panicking in this scenario?

mschoch · 2015-04-06T01:59:22Z

Thanks for tracking it down. I plan on reviewing the code tomorrow, I took a quick look but couldn't quite remember why it works the way it does.

Can you supply the problematic text? Obviously we should try not to panic in this case, but it might end up in form of an optional utf-8 validation step. Right now we expect/trust input to always be valid utf-8. Obviously thats not always the case or something that can be guaranteed.

CC'ing @steveyen because I think he will run into this soon too.

chowey · 2015-04-06T06:17:52Z

The problematic text was like this:

something\96something

(That is supposed to represent a code point 96 embedded there.)

This is an en-dash in Windows-1252 encoding.

Note: Don't follow my example and remove U_BUFFER_OVERFLOW_ERROR in the conditional. It completely breaks indexing. My bad for not further testing before posting.

mschoch · 2015-04-06T14:02:04Z

Interesting I cannot reproduce a panic under similar conditions, though I have icu 52.1.

For input:

"something\x96something"

I get the token stream:

analysis.TokenStream{
                {
                    Start:    0,
                    End:      9,
                    Term:     []byte("something"),
                    Position: 1,
                    Type:     analysis.AlphaNumeric,
                },
                {
                    Start:    12,
                    End:      21,
                    Term:     []byte("mething\x00\x00"),
                    Position: 2,
                    Type:     analysis.AlphaNumeric,
                },
            },

This is obviously wrong, but it didn't panic. It could be that icu 54.1 behaves differently, or it could be that in my test I'm somehow just getting lucky. I have to review the icu api to see what we can do differently.

mschoch · 2015-04-06T14:57:40Z

I've also opened up a related issue, because Bleve's handling of invalid utf-8 goes beyond just this one issue: #186

boombuler · 2015-04-20T10:09:12Z

Hi,

I had the same exception and fixed it by adding

// #include "unicode/ucnv.h"
func init() {
    C.ucnv_setDefaultName(C.CString("UTF-8"))
}

to the icu package. On my windows machine it always treated the input as ANSI not as UTF-8... Hope that helps anyone :)

mschoch · 2015-04-20T10:11:34Z

Thanks, yes that is related. The current implementation relies on the default being UTF-8, which then leads to another class of bugs, where the input is valid utf-8, but your default isn't set to utf-8.

We can either set the default to utf-8 (as you suggest) or use another function which explicitly treats the input as utf-8.

mschoch · 2017-05-17T12:38:31Z

Now that the ICU tokenizer is part of blevex, moving this issue there: blevesearch/blevex#34

* 86b1161 Mohd Shaad Khan | MB-59447: support nested/chunked vectors (#185) * 0f89c37 Aditi Ahuja | MB-60084: Change index merge procedure to account for existing index type. (#197)

mschoch added the bug label Jun 26, 2016

mschoch added this to the 1.0 milestone Jun 26, 2016

mschoch mentioned this issue May 17, 2017

icu tokenizer may panic on invalid UTF-8 blevesearch/blevex#34

Open

mschoch closed this as completed May 17, 2017

abhinavdangeti added a commit that referenced this issue Dec 14, 2023

Upgrade zapx/v16

3c39eb9

* 86b1161 Mohd Shaad Khan | MB-59447: support nested/chunked vectors (#185) * 0f89c37 Aditi Ahuja | MB-60084: Change index merge procedure to account for existing index type. (#197)

abhinavdangeti mentioned this issue Dec 14, 2023

Upgrade zapx/v16 #1939

Merged

abhinavdangeti added a commit that referenced this issue Dec 14, 2023

Upgrade zapx/v16 (#1939)

1653a39

* 86b1161 Mohd Shaad Khan | MB-59447: support nested/chunked vectors (#185) * 0f89c37 Aditi Ahuja | MB-60084: Change index merge procedure to account for existing index type. (#197)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic in analysis/tokenizers/icu #185

panic in analysis/tokenizers/icu #185

chowey commented Apr 5, 2015

chowey commented Apr 6, 2015

mschoch commented Apr 6, 2015

chowey commented Apr 6, 2015

mschoch commented Apr 6, 2015

mschoch commented Apr 6, 2015

boombuler commented Apr 20, 2015

mschoch commented Apr 20, 2015

mschoch commented May 17, 2017

panic in analysis/tokenizers/icu #185

panic in analysis/tokenizers/icu #185

Comments

chowey commented Apr 5, 2015

chowey commented Apr 6, 2015

mschoch commented Apr 6, 2015

chowey commented Apr 6, 2015

mschoch commented Apr 6, 2015

mschoch commented Apr 6, 2015

boombuler commented Apr 20, 2015

mschoch commented Apr 20, 2015

mschoch commented May 17, 2017