Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokens beginning with # cause a crash when using count files #1

Open
mjwillson opened this issue Feb 11, 2015 · 0 comments
Open

Tokens beginning with # cause a crash when using count files #1

mjwillson opened this issue Feb 11, 2015 · 0 comments

Comments

@mjwillson
Copy link

(Reporting this here as well as https://code.google.com/p/mitlm/issues/detail?id=44 in case github gets more attention these days)

The crash only happens if the ngram order is higher than 1, and only if the # occurs at the start of a token.

I'm guessing this is because it interprets a # at the beginning of a line in a text counts file as a comment and skips it, meaning a unigram beginning with a # is missing from the term dictionary when it's encountered in a later bigram.

What steps will reproduce the problem?

$ estimate-ngram -wc counts -text <(echo 'a #hashtag')
0.001   Loading corpus /dev/fd/63...
0.002   Smoothing[1] = ModKN
0.002   Smoothing[2] = ModKN
0.002   Smoothing[3] = ModKN
0.002   Set smoothing algorithms...
0.002   Saving counts to counts...

$ cat counts
<s>     1
a       1
#hashtag        1
<s> a   1
a #hashtag      1
#hashtag </s>   1
<s> a #hashtag  1
a #hashtag </s> 1

$ estimate-ngram -counts counts -wl lm.arpa
0.001   Loading counts counts...
estimate-ngram: src/NgramModel.cpp:800: void mitlm::NgramModel::_ComputeBackoffs(): Assertion `allTrue(backoffs != NgramVector::Invalid)' failed.
Aborted (core dumped)

What version of the product are you using? On what operating system?

Built from latest master on github. Ubuntu 14.04.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant