You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Create a large counts file in which there is an ngram (e.g. "foo bar baz") whose suffix ngram ("bar baz") doesn't exist earlier in the file.
Run estimate-ngram -wl lm.arpa -counts counts on it.
Note this doesn't always happen consistently for me with smaller count files, but seems to replicate fairly consistently with larger (or at least middle-sized) files.
What is the expected output? What do you see instead?
I'd ideally expect it allow a language model to be built in this case, even if it means removing/skipping over the ngram in question, or making some assumption about the count for the missing suffix (e.g. same as the higher-order ngram).
I realise that these missing suffixes won't occur if I use MITLM itself to compute the counts from a corpus, however if dealing with large amounts of count-based source data from some other tools/sources (e.g. MapReduce jobs), it's possible for these kinds of constraints to be violated accidentally due to data corruption or bugs beyond your control, and so it would be convenient if MITLM could cope gracefully with these cases.
Alternatively if this is a WONTFIX then it would be good to at least document what the constraint is on acceptable input for counts files, and give a more friendly error message if the constraint is violated, so people know how to fix up their input files in order to get MITLM to work.
What steps will reproduce the problem?
Create a large counts file in which there is an ngram (e.g. "foo bar baz") whose suffix ngram ("bar baz") doesn't exist earlier in the file.
Run
estimate-ngram -wl lm.arpa -counts counts
on it.Note this doesn't always happen consistently for me with smaller count files, but seems to replicate fairly consistently with larger (or at least middle-sized) files.
What is the expected output? What do you see instead?
I'd ideally expect it allow a language model to be built in this case, even if it means removing/skipping over the ngram in question, or making some assumption about the count for the missing suffix (e.g. same as the higher-order ngram).
I realise that these missing suffixes won't occur if I use MITLM itself to compute the counts from a corpus, however if dealing with large amounts of count-based source data from some other tools/sources (e.g. MapReduce jobs), it's possible for these kinds of constraints to be violated accidentally due to data corruption or bugs beyond your control, and so it would be convenient if MITLM could cope gracefully with these cases.
Alternatively if this is a WONTFIX then it would be good to at least document what the constraint is on acceptable input for counts files, and give a more friendly error message if the constraint is violated, so people know how to fix up their input files in order to get MITLM to work.
Currently what you see is:
What version of the product are you using? On what operating system?
Built from latest github master, Ubuntu 14.04.1
The text was updated successfully, but these errors were encountered: