Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cost cutoff #783

Closed
ampli opened this issue Jun 3, 2018 · 32 comments
Closed

Cost cutoff #783

ampli opened this issue Jun 3, 2018 · 32 comments

Comments

@ampli
Copy link
Member

ampli commented Jun 3, 2018

Background

I'm trying to fix the SAT-parser disjunct cost calculations to get accurate linkage disjunct and sentence costs, and mostly succeeded with that. (See issue #188 on the ignoring the costly-nulls cost there.)

(BTW, I started again to investigate the SAT inaccurate disjunct cost problem due to the WIP of dialects support (issue #402). For inspecting the modified expressions I needed to fix print_expression() (PR #781), and due to the needed investigation of expressions I realize I may have a fix to the said SAT-parser problem.)

When validating to the disjunct costs as reported by my said SAT-parser fix, I inspected these costs as reported by the classic parser. I found an apparent problem in the cost cutoff as applied by the classic parser or in my understanding of it - in any case one of them needs a fix...

The apparent problem in the classic parser

I noted that sometimes disjuncts which have disjunct->cost > cost_cutoff are used in the parse.
This is because build_disjunct() uses the criteria cl->maxcost <= cost_cutoff but set the actual cost of the disjunct to cl->cost (which seems to be the correct cost) which is often greater than cl->maxcost.

Example

Test sentence: What's up?

...
cost-max     Largest cost to be considered                  2.70
...
	Linkage 3, cost vector = (UNUSED=0 DIS= 3.00 LEN=4)

    +---------Xp---------+
    +---->WV---->+       |
    +-->Ws--+Ss*w+--K-+  |
    |       |    |    |  |
LEFT-WALL what 's.v up.r ?

            LEFT-WALL    0.000  hWs+ hWV+ Xp+ 
                 what    0.000  Ws- Ss*w+ 
                 's.v    3.000  Ss- dWV- K+ 
                 up.r    0.000  K- 
                    ?    0.000  Xp- RW+ 
           RIGHT-WALL    0.000  RW- 

There are examples (other sentences) of even larger disjunct costs that are used (with cost-max=2.7).

The disjunct in question (for 's.v) Is:
[Ss- & dWV- & [K+ & [()]]]
The & [()]] is due to expression pruning. The cost is indeed 3.
What is maxcost, and why it can apparently be set to be less then the cost?

(My fix to the SAT-parser also somehow neglects to make this same cutoff.)

@linas
Copy link
Member

linas commented Jun 8, 2018

What is maxcost, and why it can apparently be set to be less then the cost?

Seems like a bug to me; build_disjunct() would appear to be mis-applying the cutoff if-test.

Please recall the original usage of cost-cutoff: most (almost all) parses can be done using disjuncts with a cost of two or less. The exception are sentences which cannot be parsed unless one or more words are skipped. Of these, some are extremley long, and take more than 30 seconds to parse (or used to .. things are faster now). The 30 seconds then triggered the timeout, and went into panic mode. In panic mode, disjuncts with a cost of 3 are allowed -- the hope/desire/goal is that this "looser" setting will result in some, any parse, so that panic mode can report something.

Several things have changed since those bad old days: first the parser is faster. Second, the dictionary is better. Thus, the number of sentences that trigger the panic parse is much much lower than every before.

The main, unanswered question is this: which is more accurate: a parse with skipped words, and all links costs < 2.7, or a parse with fewer skipped words, and some link costs > 2.7 ?

The answer is unknown. Depends strongly on the sentence: is it a junky sentence? Or is it a good sentence, while the dictionary is incomplete? If its a junk sentence, just raise the costs and hope for the best. If its a good sentence, but the dictionary is bad, then .. ???

@ampli
Copy link
Member Author

ampli commented Jun 9, 2018

build_clause() maintains 2 totally different disjunct costs - one for the purpose of cut-off and one for computing the sentence metrics.

See table2 and the article itself for discussion of why max() is used. According to it, a disjunct cost is the maximum cost of its conjuncts, and it seems this is what maxcost in build_clause() does for AND_type. However, the code still maintains a disjunct cost which is the sum of the conjunct costs, to be used for the sentence cost.

So it seems something is wrong in the code, or I have some misunderstanding.

@ampli
Copy link
Member Author

ampli commented Jun 23, 2018

I would like to push soon my patch to the SAT-parser for handling disjunct cost.

Hence I would like to understand the rationale behind the apparent contradiction between the actual used disjunct cost and its cost for the purpose of cutoff (which is often lower). BTW, someone apparently also wondered about that, see the comment in the code.

Or maybe the SAT code should just fixed to behave the same?

@linas
Copy link
Member

linas commented Jul 11, 2018

Hmm. Confusing. For link-grammar, I am trying to interpret cost as if it was minus the log likelihood (i.e. the entropy or mutual information of something). Viz, cost = -log p where p is the probability of the connector or disjunct or whatever. (I am currently writing a paper that shows in painful detail how link-grammar disjuncts are a whole lot like adagram n-grams; I'm hoping that this direct comparison will help clarify the role of cost).

Anyway, the point is that probabilities multiply, therefore log-probabilities (costs) add. I'm not quite sure how to interpret the maxcost function, in this context. Perhaps its behaving kind-of-like a log of a conditional probability? Perhaps we should throw away maxcost, and use only cost? See, e.g. the comments on lines 204-209 of prepare/build-disjuncts.c. I don't know. Perhaps I'll know when I get done writing the above-mentioned paper, and perhaps the analysis there won't answer this question.

Re: the paper that you cite: its interesting. But keep in mind the following:

  1. probability is essentially exactly the same thing as measure theory; see, for example, the Kolomogorov axioms. This is useful to understand, because measure theory talks about the sizes of sets, and their intersections and their unions, and this view helps avoid almost all of the confusion that the word "probability" tends to cause. Probabilities are just measures of sets.

  2. It is "well-known" that the logical operators and, or correspond to set-intersection, set-union. This can be seen on the Borel sets of measure theory, which come from the abstract mathematical definitions of a Boolean algebra (or rather, from Borel sets as providing a "representation" of a Boolean algebra, or perhaps a "model" of a Boolean algebra)

  3. Measures are additive. Viz m(A or B) = m(A) + m(B) when A and B = empty-set (that is, when A intersect B = empty-set). One can generalize, and replace additivity by subadditivity. In the paper that you reference, conditions 1,2,3 of section 4.1 on page 2458 are essentially the "usual" sub-additive axioms. (here, m is the measure, essentially the same thing as p the probability)

All of this is basically saying, in an abstract way "gee, that's nice, but why should I choose max, instead of addition? What's the conceptual reason for using one instead of the other?" and the answer to that question is, right know "I don't know". I'm a bit suspicious of max, but I can't strongly opine on it right now.

@linas
Copy link
Member

linas commented Jul 11, 2018

Ooops. Clarification/correction to above:

  1. If costs are log of probabilities, then I am mis-stating something in bullet 3. The conclusion is still right, because it is "well-known" that the log function is "convex", in exactly the sense that you want it to be when working with log-probabilities.

@ampli
Copy link
Member Author

ampli commented Sep 22, 2018

I would like to push soon my patch to the SAT-parser for handling disjunct cost.

So I will do it in a way that is compatible to how it is done now in the classic parser.

@ampli ampli mentioned this issue Oct 2, 2018
@ampli
Copy link
Member Author

ampli commented Nov 11, 2018

Nothing to do on this issue for now, so I'm closing it.

@ampli ampli closed this as completed Nov 11, 2018
@ampli
Copy link
Member Author

ampli commented Jun 28, 2019

By looking on the expressions generated by problematic db dictionaries, I realized that building disjuncts could be sped up if the expressions would get simplified.

I specified 2 places in which this can be done:

  • In read-sql.c, add OR expressions at the same level of previous ones.
  • For the purpose of regular dicts efficiency, maybe modify expression_prune() to remove unnecessary levels or parens.

Constructing expressions in read-sql.c can be improved to use long E-List chains (in my TODO list).
But I thought it is interesting to test whether removing unnecessary levels or parens can speed up building disjuncts even for the current en/ru dicts.

So I implemented an expression simplifier, and indeed even one pass on the expression toplevel it leads to a significant speedup of the en batch benchmarks (I did it per sentence and I also plan to check doing it when reading the dict). (No speedup for the ru dict - I still need to investigate that.)

But I got one result different in the fixes batch (when I simplified all levels).
Like I said, she'll do it!
It doesn't currently parse.
But after expression simplifying, it gets parsed. This is an effect of cost-cutoff, since even now it gets parsed fine with -cost=3.

When looking on that more, I found an inconsistency with the current implementation of cost-cutoff:
When applying a costs to an AND expression, it depends on where you distribute the cost.

Example (cost-cutoff=2.7): [ [A+] & [[B+]] ]
The result depends whether it is interpreted as [[A+]] & [[B+]] or as [A+] & [[[B+]]] due to this code (and its corresponding shortcut):

    234         for (c1 = c; c1 != NULL; c1 = c1->next)
    235         {
    236                 c1->cost += e->cost;
    237                 /* c1->maxcost = MAX(c1->maxcost,e->cost);  */
    238                 /* Above is how Dennis had it. Someone changed it to below.
    239                  * However, this can sometimes lead to a maxcost that is less
    240                  * than the cost ! -- which seems wrong to me ... seems Dennis
    241                  * had it right!?
    242                  */
    243                 c1->maxcost += e->cost;
    244                 /* Note: The above computation is used as a saving shortcut in
    245                  * build_clause(). If it is changed here, it needs to be changed
    246                  * there too. */
    247         }

If I use the commented-out way (line 237 instead of line 243) the problem of course goes away, because it doesn't mater to where the cost is distributed. So I think the current method of doing that is indeed incorrect (and even unstable - may give different results in different situations) and should be fixed.

EDIT: It indeed goes away for this sentence (I tested it), but my thought that it depends on the place the cost is distributed is not correct, since the costs of AND is additive. So I don't know why apparently it "fixes" something.

3 BTWs:

  1. If the sentence Like I said, she'll do it! is correct than tweaking the dict costs may be needed.
  2. The word "simplifier` (used above) is not in the dict and is not regex-guessed, so its gets spell corrected. I have a list of missing words, most of them are "technical" like this one.
    3, The SAT parser still has several cost differences than the classic one, I think that maybe some of them are due to this problem (of cost redistribution).

My proposal: Change build_clause() to use the maxcost implementation of Dennis.

EDIT: As said in the EDIT above, it will not help.
So it now seems to me the concept of maxcost may be bogus and just the cost should be used.

@ampli ampli reopened this Jun 28, 2019
@linas
Copy link
Member

linas commented Jun 28, 2019

The way I see it,, both [[A+]] & [[B+]] and also [A+] & [[[B+]]] should be the same as [[[[ A+ & B+]]]] and all of these variants should be interchangeable. (i.e should be the same as A+ & [[[[B+]]]], etc.)

If this causes Like I said, she'll do it! to fail, that is a distinct issue -- I can fix that one in the english dict, as needed.

@ampli
Copy link
Member Author

ampli commented Jun 28, 2019

I'm confused regarding the needed fix.
See "EDIT" in the original post above.
My thought that it depends on the place the cost is distributed is not correct, since the cost of AND is additive.

But using the way of Dennis seems "fix" it from a reason I don't understand yet.
So for now I don't have a proposed fix and will continue to investigate this.

@linas
Copy link
Member

linas commented Jun 28, 2019

BTW -- a related issued in the sql-back dicts -- the ones that I sent you will have exactly the same disjunct appearing multiple times, with different costs, each in a different UNKNOWN-WORD. In the end, only one of the UNKNOWN-WORD choices will get selected, and t will always be the one with the lowest cost; so all the duplicates are just wasted storage and CPU.

Fixing this at dict creation time seems inelegant... mostly because I want each unknwon-word to correspond one-to-one with an existing class, rather than creating a special unknown-word-class with the high-cost disjuncts removed.

@linas
Copy link
Member

linas commented Jun 28, 2019

the concept of maxcost may be bogus

I agree (I looked at the file) it is bogus, precisely because the cost has to be distributive. The 'Denis' variant kind-of-ish might have sort-of made sense a long time ago, when everything was simpler, but as things are now, it doesn't make any kind of reasonable sense. So, yes, rip that out.

If the Dennis-thing "fixes" Like I said, she'll do it!, that is only an "accident", a side-effect of how the dictionary is structured. I will look at this now.

@ampli
Copy link
Member Author

ampli commented Jun 28, 2019

In an old post above I said:

See table2 and the article itself for discussion of why max() is used. According to it, a disjunct cost is the maximum cost of its conjuncts, and it seems this is what maxcost in build_clause() does for AND_type. However, the code still maintains a disjunct cost which is the sum of the conjunct costs, to be used for the sentence cost.

So the cost is additive and doesn't depend on how outer costs are distributed, but maxcost depends on the max. conjunct cost, which do depend on the actual outer cost distribution that may happen by chance. So my original observation (of dependency on the actual place of distribution) seems to be correct.

Possible solutions:

  1. Distribute evenly the outer cost of an AND expression among its conjuncts.
    The code above needs a change in order to support that (a change in line 243, and a similar one in line 177):
    c1->maxcost += e-cost / num_conjuncts
    However, I'm not sure it will solve the consistency problem (i.e. that the result doesn't depend on the order costs are mentioned and distributed in the expression). I guess it can be mathematically proved whether it does so or not, but it is beyond my ability...

  2. Use disjunct cost as the cut-off value.

If the Dennis-thing "fixes" Like I said, she'll do it!, that is only an "accident", a side-effect of how the dictionary is structured.

It seems to me so too.

@ampli
Copy link
Member Author

ampli commented Jun 28, 2019

Just now I tested using cost instead of maxcost as a cost-cutoff.
Since maxcost <= cost, less disjuncts may be included. As a result, the basic and fixes batches have more errors. However, they run faster.... the fix-longbatch runs much faster...
(I can provide the list of sentences that become incorrect.)

Here are quick benchmarks:

corpus-fixes:
Branches: master / cost-cutoff
master 4.97 / cost-cutoff 4.75: 4.63%
Error: Different number of errors (384 != 398)

corpus-basic:
Branches: master / cost-cutoff
master 1.74 / cost-cutoff 1.69: 2.96%
Error: Different number of errors (82 != 83)

corpus-fix-long:
Branches: master / cost-cutoff
master 45.27 / cost-cutoff 36.91: 22.65%

@linas
Copy link
Member

linas commented Jun 28, 2019

I would like to get rid of max-cost, and argue that the original article was faulty. The original max-cost is an idea that almost seems to kind-of make sense, but it fails to fit correctly when one thinks of the linkage as a kind-of Bayesian-network (or Markov-network, whatever). In the network model, the cost is just minus the log-probability, making unlikely arrangements improbable. But for probability to work "as normal" i.e follow the Kolmogorov axioms, the cost would have to be distributive, and the max-cost idea is just this weird thing to the side. At least, that's how it strikes me.

BTW, I just pushed a change to fix Like I said, she'll do it! and also BTW, I published version 5.6.2 four days ago, but forgot to update github, so there's some branching mess there now.

@linas
Copy link
Member

linas commented Jun 28, 2019

I posted above comments before I saw your last; I'm now confused; hang on.

@linas
Copy link
Member

linas commented Jun 28, 2019

Oh ..

using cost instead of maxcost

How did you change line 176 ? I haven't quite figured it out, but it seems all of build_clause suddenly gets a lot simpler.

@ampli
Copy link
Member Author

ampli commented Jun 28, 2019

Here is the changed loop:

				for (c4 = c2; c4 != NULL; c4 = c4->next)
				{
					//double maxcost = MAX(c3->maxcost,c4->maxcost);
					if (e->cost > ct->cost_cutoff) continue;

					c = pool_alloc(ct->Clause_pool);
					c->cost = c3->cost + c4->cost;
					//c->maxcost = maxcost;
					c->c = catenate(c3->c, c4->c, ct->Tconnector_pool);
					c->next = c_head;
					c_head = c;
				}

@ampli
Copy link
Member Author

ampli commented Jun 28, 2019

My change to line 177 is not optimal.
However, it doesn't change the correctness as long as a too-low cost is used (this line is for efficiency only).

It should be actually be completely commented out for this test.
I did it and the efficiency didn't get noticeably worse.

@ampli
Copy link
Member Author

ampli commented Jun 28, 2019

A correct (hopefully) efficiency shortcut for this loop:

				for (c4 = c2; c4 != NULL; c4 = c4->next)
				{
					//double maxcost = MAX(c3->maxcost,c4->maxcost);
					//if (maxcost + e->cost > ct->cost_cutoff) continue;

					c = pool_alloc(ct->Clause_pool);
					c->cost = c3->cost + c4->cost;
					if (c->cost > ct->cost_cutoff) continue;
					//c->maxcost = maxcost;
					c->c = catenate(c3->c, c4->c, ct->Tconnector_pool);
					c->next = c_head;
					c_head = c;
				}

However, it can be implemented in a much more efficient way.

@linas
Copy link
Member

linas commented Jun 29, 2019

Error: Different number of errors (384 != 398)
Error: Different number of errors (82 != 83)

What are these?

@ampli
Copy link
Member Author

ampli commented Jun 29, 2019

basic:
Currently parsable but unparsable (as desired) after the cost-cutoff change:

*My some female friends were angered by the hearings

Unparsable after the cost-cutoff change (need -cost=3):

I gave him for his birthday a very expensive present
Monday sounds good for the meeting

fixes:
Currently parsable but unparsable (I guess as desired) after the cost-cutoff change:

*"To" It's to let; there's a notice on the door

Unparsable after the cost-cutoff change (need -cost=3):

Here goes nothing.
I ran 10 fewer miles than Ben.
I ran 10 more miles than Ben.
Maybe it works now...
That car goes 90 MPH.
There goes one of the strongest competitors this sport has seen.
There goes the biggest loser ever.
There goes the cutest guy ever!
There goes the neighborhood!
There went the cutest guy ever!
We collected HTLV-I infected T-cells.
we went asdfing and qrwerting

Unparsable after the cost-cutoff change (need -cost=3.3):

B: Press button firmly.
Step B. Press button firmly.
Step B: Press button firmly.

Additional batch: failures:

Being part of the natural world and a proper object of scientific study, X is predictable on the basis of X's preferences and information, which are in turn the result of X's nature and nurture.
Far-carrying voice a loud, clear, vibrant, fluty string of double notes; also a single repeated musical yelp.
In one case, the court indicated that it would intervene to prevent a breach by the visitor of the rules of natural justice : see Bently v. Bishop of Ely (1729) 1 Barn.K.B. 192.
The men who were acquitted, Danial Winter, who's 19, and Wisdom Smith, who's also 19, had nothing to say after the judge directed they should be found not guilty.
They tiptoe down and see that he's got it open, all right, but he's lying there dead.
To have to sit in an office all day playing the same records -- all of which are awful -- over and over and over again -- well, it's not funny, is it ?
Well at least that proves e knows where e's goin', whispered Tommy.

@linas
Copy link
Member

linas commented Jun 29, 2019

Thanks. I'll have to look at that. Clearly, "goes" has a problem, I guess.

@ampli
Copy link
Member Author

ampli commented Jul 9, 2019

I'm now encountering this maxcost problem when I try to further simplify expressions in expression_prune() in order to have less total work in farther simplifications there and especially in build_clause().

Since many AND expressions have [()]x terms, when x is a cost, I added this simplification:
[(... & [()]x & ...)]y -> [(... & ...)]x+y
However, this may change maxcost so, for example, the following sentence doesn't parse any more:
Currently, without this simplification:

linkparser> Monday sounds good for the meeting
Found 4 linkages (4 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS= 3.00 LEN=8)

    +------>WV------>+             +-----J-----+
    +-->Wd---+--Ss*s-+--Paf--+-MVp-+    +--Dmu-+
    |        |       |       |     |    |      |
LEFT-WALL Monday sounds.v good.a for.p the meeting.g

            LEFT-WALL    0.000  hWd+ hWV+ RW+
               Monday    3.000  Wd- Ss*s+
             sounds.v    0.000  Ss- dWV- Pa+
               good.a    0.000  Paf- @MV+
                for.p    0.000  MVp- J+
                  the    0.000  D+
            meeting.g    0.000  Dmu- J-
           RIGHT-WALL    0.000  RW-

But with this simplification:

linkparser> Monday sounds good for the meeting
No complete linkages found.
Found 2 linkages (2 had no P.P. violations) at null count 2
	Linkage 1, cost vector = (UNUSED=2 DIS= 6.02 LEN=20)

    +-------------------->Wa---------------------+
    |        +-----------------AN----------------+
    |        |       +-------------AN------------+
    |        |       |       +---------A---------+
    |        |       |       |                   |
LEFT-WALL Monday sounds.n good.a [for] [the] meeting.n

            LEFT-WALL    0.000  hWa+ RW+
               Monday    2.000  AN+
             sounds.n    2.000  AN+
               good.a    0.000  A+
            meeting.n    2.020  @A- @AN- Wa-
           RIGHT-WALL    0.000  RW-

Press RETURN for the next linkage.
linkparser> !cost=3
cost-max set to  3.00
linkparser> Monday sounds good for the meeting
Found 4 linkages (4 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS= 3.00 LEN=8)

    +------>WV------>+             +-----J-----+
    +-->Wd---+--Ss*s-+--Paf--+-MVp-+    +--Dmu-+
    |        |       |       |     |    |      |
LEFT-WALL Monday sounds.v good.a for.p the meeting.g

            LEFT-WALL    0.000  hWd+ hWV+ RW+
               Monday    3.000  Wd- Ss*s+
             sounds.v    0.000  Ss- dWV- Pa+
               good.a    0.000  Paf- @MV+
                for.p    0.000  MVp- J+
                  the    0.000  D+
            meeting.g    0.000  Dmu- J-
           RIGHT-WALL    0.000  RW-

So my question is how to proceed.

  1. Don't do such simplifications for now.
  2. Make cut-off according to disjunct cost only (without maxcost).
    2.1 Let some sentences not to be parsed until the dict is fixed.
    2.2 Fix the dict first.

@ampli
Copy link
Member Author

ampli commented Jul 9, 2019

[...] "goes" has a problem [...]
Here are my findings:
From the dict:

<verb-wall>: ((dWV- or dCV- or dIV-) & {VC+}) or [()];
goes.v:
  (<verb-y-s> & (<vc-go> or ({[[O*t+]]} & {@MV+} & <verb-wall>)))
  or (<verb-and-s-> & <vc-go>)
  or (<vc-go> & <verb-and-s+>);

A current parse:

linkparser> That car goes 90 MPH.
Found 2 linkages (2 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS= 5.00 LEN=9)

    +-----------------Xp-----------------+
    +---------->WV--------->+            |
    +------>Wd-------+      +---Opt--+   |
    |         +-Dsu*c+-Ss*s-+    +-ND+   |
    |         |      |      |    |   |   |
LEFT-WALL that.j-d car.n goes.v 90 MPH.u .

            LEFT-WALL    0.000  hWd+ hWV+ Xp+
             that.j-d    1.000  D*u+
                car.n    0.000  Ds**c- Wd- Ss*s+
               goes.v    3.000  Ss- dWV- O*t+
                   90    0.000  ND+
                MPH.u    1.000  ND- Op-
                    .    0.000  Xp- RW+
           RIGHT-WALL    0.000  RW-

The disjunct that has a too high cost: goes.v 3.000 Ss- dWV- O*t+

The reason of this cost=3:
({[[O*t+]]} & {@MV+} & <verb-wall>))) matches, when [[O*t+]] is cost=2 and <verb-wall> is cost=2 due to the [()] term, that was added with this comment:

+% There are some other such connectors that don't quite fit this pattern:
+% AF, Z, and in many cases B (for example TOt+ & B+). For this reason, we
+% have to have a costly null [()] below, although we would really really

@ampli
Copy link
Member Author

ampli commented Jul 9, 2019

linkparser> Maybe it works now...
Found 4 linkages (4 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS= 3.06 LEN=12)

    +---------------Xx---------------+
    +-------->WV------->+            |
    +----->Wd------+    |            |
    |        +<COa<+-Ss-+--MVp-+     |
    |        |     |    |      |     |
LEFT-WALL maybe.e it works.v now.r ....y

            LEFT-WALL    0.060  hWd+ hWV+ Xx+ RW+
              maybe.e    3.000  dCOa+
                   it    0.000  @hCO- Wd- Ss+
              works.v    0.000  Ss- dWV- @MV+
                now.r    0.000  MVp-
                ....y    0.000  Xx-
           RIGHT-WALL    0.000  RW-
someday.e sometime.e maybe.e
afterwards.e afterward.e worldwide.e nationwide.e
statewide.e world-wide.e nation-wide.e state-wide.e industrywide.e
the_world_over:
  <directive-opener>
  or ({Xd- & Xc+} & (MVp- or E+))
  or (Wa- & {Wa+});

% The use of COa here needs to be carefully re-examined;
%   it is used much too freely.
% COa+ is used to block links to COd-
% Xc+ & Ic+: connect to imperatives (infinitive verbs): "Anyhow, don't"
% Wc- & Xc+ & Qd+: subject-object inversion: "anyhow, am I right?"
%       This gets a fairly stiff cost if the comma is missing.
<directive-opener>:
  {[[Wa-]]} &
    ((Xc+ & Ic+) or
    (Wc- & (Xc+ or [()]1.2) & Qd+) or
    ({Xd-} & (Xc+ or [[()]]) & [dCOa+]));

I can analyze the source of the high cost for other sentences too if this may help.

@ampli
Copy link
Member Author

ampli commented Jul 10, 2019

As part of cleaning up cost-handling code, I would like to define the following in dict-structures.h:

  1. Define cost type (currently "double") so it could be easily changed to float if desired.
    I propose cost_t.
  2. Define print format (currently 2,3,4,6 digits after the decimal point are printed in various places).
    I propose const unsigned int cost_precision = 3;.
  3. Define function costs_eq(cost1, cost2) (currently comparing cost1-cost2)to an inconsistent value 1E-4 or 1E-5).
    I propose const cost_t cost_epsilon = 1E-5;.

@ampli ampli mentioned this issue Jul 12, 2019
@ampli ampli mentioned this issue Jul 20, 2019
@ampli
Copy link
Member Author

ampli commented Feb 21, 2021

BTW -- a related issued in the sql-back dicts -- the ones that I sent you will have exactly the same disjunct appearing multiple times, with different costs, each in a different UNKNOWN-WORD. In the end, only one of the UNKNOWN-WORD choices will get selected, and it will always be the one with the lowest cost; so all the duplicates are just wasted storage and CPU.

I think I may miss here something: are the word-strings of these different UNKNOWN-WORD all equal exactly to UNKNOW-WORD? Or are these UNKNOWN-WORD.a, UNKNOWN_WORD.n etc.?

@linas
Copy link
Member

linas commented Feb 22, 2021

Or are these UNKNOWN-WORD.a, UNKNOWN_WORD.n etc.?

I think it's that. It's been a few years, I forgot what I did there. it seemed like a pretty good idea at the time, but there was some combinatoric explosion somewhere that took hours to parse just one sentence. Some sentences took seconds or a minute, and some just took many hours.

@ampli
Copy link
Member Author

ampli commented Feb 22, 2021

Or are these UNKNOWN-WORD.a, UNKNOWN_WORD.n etc.?

I think it's that.

I ask because I don't understand this:

it will always be the one with the lowest cost

The only place that I know in which the lower cost is selected is when eliminating duplicate disjunct.
If different word-strings at the same position have the same disjunct with different costs, both are used on different linkages and the order of these linkages is not guaranteed.

Anyway, merging disjunct of different words (as I did in my (yet WIP) sentence-generation branch (when a merged disjunct contains a list of words, each with the cost of its original disjunct) saves much processing,. Not only that it drastically reduces the number of disjuncts for word positions that have several different word-strings, but it allows for a great tracon compression.

Some sentences took seconds or a minute, and some just took many hours.

When I checked it then, I found a bottleneck in building the disjuncts. It seems this can be improved very much (but I have not started implementing this yet). There was also an inefficiency in the sql-disct expressions from reasons that I didn't check, and when I applied an expression-simplification function, it significantly reduced the time of building disjuncts.

@linas
Copy link
Member

linas commented Feb 23, 2021

I really do forget the details. I could recover it from the diary/notes. What I remember is that the code was learning the grammar for some small number of words (dozens ?) but the rest were not being classified. So the idea was to create unknown-word classes that corresponded to the known words.

One problem was that there wasn't (isn't) any code for compressing disjuncts. So, instead of having a dictionary with @A- & <stuff> there was <stuff> or (A- & stuff) or (A- & A- & A- & stuff) or ... worse, there would be holes, e.g. maybe the twoA- case was missing by random chance.

The code for compressing link-types (merging many different link types into fewer common classes) is incomplete and experimental and poorly-measured. Thus, thousands of link types are produced, as is a huge variety of disjuncts that are combinations of these types. So the number of disjuncts is explosively large.

The goal of random languages is to be able to actually measure the quality of the above last step. Trying to measure it against the English 4.0.dict is not a viable way forward.

@linas
Copy link
Member

linas commented Feb 24, 2023

I'm closing this defect because its long and wandered a bit too much. In it's place,:

@linas linas closed this as completed Feb 24, 2023
linas added a commit to linas/link-grammar that referenced this issue Feb 25, 2023
It was confusing, broken, and prevented good parses from happening.
See opencog#1450 opencog#1449 and opencog#783 and opencog#1453
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants