-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cost cutoff #783
Comments
Seems like a bug to me; Please recall the original usage of cost-cutoff: most (almost all) parses can be done using disjuncts with a cost of two or less. The exception are sentences which cannot be parsed unless one or more words are skipped. Of these, some are extremley long, and take more than 30 seconds to parse (or used to .. things are faster now). The 30 seconds then triggered the timeout, and went into panic mode. In panic mode, disjuncts with a cost of 3 are allowed -- the hope/desire/goal is that this "looser" setting will result in some, any parse, so that panic mode can report something. Several things have changed since those bad old days: first the parser is faster. Second, the dictionary is better. Thus, the number of sentences that trigger the panic parse is much much lower than every before. The main, unanswered question is this: which is more accurate: a parse with skipped words, and all links costs < 2.7, or a parse with fewer skipped words, and some link costs > 2.7 ? The answer is unknown. Depends strongly on the sentence: is it a junky sentence? Or is it a good sentence, while the dictionary is incomplete? If its a junk sentence, just raise the costs and hope for the best. If its a good sentence, but the dictionary is bad, then .. ??? |
See table2 and the article itself for discussion of why So it seems something is wrong in the code, or I have some misunderstanding. |
I would like to push soon my patch to the SAT-parser for handling disjunct cost. Hence I would like to understand the rationale behind the apparent contradiction between the actual used disjunct cost and its cost for the purpose of cutoff (which is often lower). BTW, someone apparently also wondered about that, see the comment in the code. Or maybe the SAT code should just fixed to behave the same? |
Hmm. Confusing. For link-grammar, I am trying to interpret cost as if it was minus the log likelihood (i.e. the entropy or mutual information of something). Viz, cost = -log p where p is the probability of the connector or disjunct or whatever. (I am currently writing a paper that shows in painful detail how link-grammar disjuncts are a whole lot like adagram n-grams; I'm hoping that this direct comparison will help clarify the role of cost). Anyway, the point is that probabilities multiply, therefore log-probabilities (costs) add. I'm not quite sure how to interpret the maxcost function, in this context. Perhaps its behaving kind-of-like a log of a conditional probability? Perhaps we should throw away maxcost, and use only cost? See, e.g. the comments on lines 204-209 of Re: the paper that you cite: its interesting. But keep in mind the following:
All of this is basically saying, in an abstract way "gee, that's nice, but why should I choose max, instead of addition? What's the conceptual reason for using one instead of the other?" and the answer to that question is, right know "I don't know". I'm a bit suspicious of |
Ooops. Clarification/correction to above:
|
So I will do it in a way that is compatible to how it is done now in the classic parser. |
Nothing to do on this issue for now, so I'm closing it. |
By looking on the expressions generated by problematic db dictionaries, I realized that building disjuncts could be sped up if the expressions would get simplified. I specified 2 places in which this can be done:
Constructing expressions in So I implemented an expression simplifier, and indeed even one pass on the expression toplevel it leads to a significant speedup of the But I got one result different in the When looking on that more, I found an inconsistency with the current implementation of cost-cutoff: Example (cost-cutoff=2.7): 234 for (c1 = c; c1 != NULL; c1 = c1->next)
235 {
236 c1->cost += e->cost;
237 /* c1->maxcost = MAX(c1->maxcost,e->cost); */
238 /* Above is how Dennis had it. Someone changed it to below.
239 * However, this can sometimes lead to a maxcost that is less
240 * than the cost ! -- which seems wrong to me ... seems Dennis
241 * had it right!?
242 */
243 c1->maxcost += e->cost;
244 /* Note: The above computation is used as a saving shortcut in
245 * build_clause(). If it is changed here, it needs to be changed
246 * there too. */
247 } If I use the commented-out way (line 237 instead of line 243) the problem EDIT: It indeed goes away for this sentence (I tested it), but my thought that it depends on the place the cost is distributed is not correct, since the costs of AND is additive. So I don't know why apparently it "fixes" something. 3 BTWs:
EDIT: As said in the EDIT above, it will not help. |
The way I see it,, both If this causes |
I'm confused regarding the needed fix. But using the way of Dennis seems "fix" it from a reason I don't understand yet. |
BTW -- a related issued in the sql-back dicts -- the ones that I sent you will have exactly the same disjunct appearing multiple times, with different costs, each in a different UNKNOWN-WORD. In the end, only one of the UNKNOWN-WORD choices will get selected, and t will always be the one with the lowest cost; so all the duplicates are just wasted storage and CPU. Fixing this at dict creation time seems inelegant... mostly because I want each unknwon-word to correspond one-to-one with an existing class, rather than creating a special unknown-word-class with the high-cost disjuncts removed. |
I agree (I looked at the file) it is bogus, precisely because the cost has to be distributive. The 'Denis' variant kind-of-ish might have sort-of made sense a long time ago, when everything was simpler, but as things are now, it doesn't make any kind of reasonable sense. So, yes, rip that out. If the Dennis-thing "fixes" |
In an old post above I said:
So the cost is additive and doesn't depend on how outer costs are distributed, but Possible solutions:
It seems to me so too. |
Just now I tested using Here are quick benchmarks:
|
I would like to get rid of max-cost, and argue that the original article was faulty. The original max-cost is an idea that almost seems to kind-of make sense, but it fails to fit correctly when one thinks of the linkage as a kind-of Bayesian-network (or Markov-network, whatever). In the network model, the cost is just minus the log-probability, making unlikely arrangements improbable. But for probability to work "as normal" i.e follow the Kolmogorov axioms, the cost would have to be distributive, and the max-cost idea is just this weird thing to the side. At least, that's how it strikes me. BTW, I just pushed a change to fix |
I posted above comments before I saw your last; I'm now confused; hang on. |
Oh ..
How did you change line 176 ? I haven't quite figured it out, but it seems all of |
Here is the changed loop:
|
My change to line 177 is not optimal. It should be actually be completely commented out for this test. |
A correct (hopefully) efficiency shortcut for this loop: for (c4 = c2; c4 != NULL; c4 = c4->next)
{
//double maxcost = MAX(c3->maxcost,c4->maxcost);
//if (maxcost + e->cost > ct->cost_cutoff) continue;
c = pool_alloc(ct->Clause_pool);
c->cost = c3->cost + c4->cost;
if (c->cost > ct->cost_cutoff) continue;
//c->maxcost = maxcost;
c->c = catenate(c3->c, c4->c, ct->Tconnector_pool);
c->next = c_head;
c_head = c;
} However, it can be implemented in a much more efficient way. |
What are these? |
basic:
Unparsable after the cost-cutoff change (need -cost=3):
fixes:
Unparsable after the cost-cutoff change (need -cost=3):
Unparsable after the cost-cutoff change (need -cost=3.3):
Additional batch: failures:
|
Thanks. I'll have to look at that. Clearly, "goes" has a problem, I guess. |
I'm now encountering this Since many AND expressions have
But with this simplification:
So my question is how to proceed.
|
A current parse:
The disjunct that has a too high cost: The reason of this cost=3:
|
I can analyze the source of the high cost for other sentences too if this may help. |
As part of cleaning up cost-handling code, I would like to define the following in
|
I think I may miss here something: are the word-strings of these different UNKNOWN-WORD all equal exactly to |
I think it's that. It's been a few years, I forgot what I did there. it seemed like a pretty good idea at the time, but there was some combinatoric explosion somewhere that took hours to parse just one sentence. Some sentences took seconds or a minute, and some just took many hours. |
I ask because I don't understand this:
The only place that I know in which the lower cost is selected is when eliminating duplicate disjunct. Anyway, merging disjunct of different words (as I did in my (yet WIP) sentence-generation branch (when a merged disjunct contains a list of words, each with the cost of its original disjunct) saves much processing,. Not only that it drastically reduces the number of disjuncts for word positions that have several different word-strings, but it allows for a great tracon compression.
When I checked it then, I found a bottleneck in building the disjuncts. It seems this can be improved very much (but I have not started implementing this yet). There was also an inefficiency in the sql-disct expressions from reasons that I didn't check, and when I applied an expression-simplification function, it significantly reduced the time of building disjuncts. |
I really do forget the details. I could recover it from the diary/notes. What I remember is that the code was learning the grammar for some small number of words (dozens ?) but the rest were not being classified. So the idea was to create unknown-word classes that corresponded to the known words. One problem was that there wasn't (isn't) any code for compressing disjuncts. So, instead of having a dictionary with The code for compressing link-types (merging many different link types into fewer common classes) is incomplete and experimental and poorly-measured. Thus, thousands of link types are produced, as is a huge variety of disjuncts that are combinations of these types. So the number of disjuncts is explosively large. The goal of random languages is to be able to actually measure the quality of the above last step. Trying to measure it against the English 4.0.dict is not a viable way forward. |
I'm closing this defect because its long and wandered a bit too much. In it's place,:
|
It was confusing, broken, and prevented good parses from happening. See opencog#1450 opencog#1449 and opencog#783 and opencog#1453
Background
I'm trying to fix the SAT-parser disjunct cost calculations to get accurate linkage disjunct and sentence costs, and mostly succeeded with that. (See issue #188 on the ignoring the costly-nulls cost there.)
(BTW, I started again to investigate the SAT inaccurate disjunct cost problem due to the WIP of dialects support (issue #402). For inspecting the modified expressions I needed to fix
print_expression()
(PR #781), and due to the needed investigation of expressions I realize I may have a fix to the said SAT-parser problem.)When validating to the disjunct costs as reported by my said SAT-parser fix, I inspected these costs as reported by the classic parser. I found an apparent problem in the cost cutoff as applied by the classic parser or in my understanding of it - in any case one of them needs a fix...
The apparent problem in the classic parser
I noted that sometimes disjuncts which have
disjunct->cost > cost_cutoff
are used in the parse.This is because
build_disjunct()
uses the criteriacl->maxcost <= cost_cutoff
but set the actual cost of the disjunct to cl->cost (which seems to be the correct cost) which is often greater thancl->maxcost
.Example
Test sentence: What's up?
There are examples (other sentences) of even larger disjunct costs that are used (with cost-max=2.7).
The disjunct in question (for
's.v
) Is:[Ss- & dWV- & [K+ & [()]]]
The
& [()]]
is due to expression pruning. The cost is indeed 3.What is maxcost, and why it can apparently be set to be less then the cost?
(My fix to the SAT-parser also somehow neglects to make this same cutoff.)
The text was updated successfully, but these errors were encountered: