-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get rid of max-cost #1453
Comments
There are also things that get a cost >= 4 through
|
The current panic parse parameters are indeed not useful to find a parse in a shorter time. |
Yes. For backwards compat, it would be a no-op.
The discussion and confusion in #783 and about what Dennis had done indicate that its a poorly formed concept, at least for the English dict. The English parses seem to work better, if it is set to infinity. For the Atomese dicts, it is perhaps ... not such a bad concept, as there, some outrageously large costs show up, and could be discarded.
!? Hmm. Interesting question. I'm not sure it's a good idea to turn dialects on and off, based on some cost threshold. OK, how about a proposal B:
|
Arghh. Why is this so hard? I just set |
OK, here's what happened. After removing costs, this sentence from corpus-basic parses:
Why? Because
and
and apparently Doing this gives
Then Setting the headline dialect to 99 allows corpus-basic to parse with only one additional error. That error is due to I will remove |
So |
Have you looked again at the document I referred you to there, that talks about the reason for such a cost computation?
Does it include the "headline" parsing? As you commented out in the old dict, it was commented out because allows often bad passing. But with "infinite cost" it will be enabled permanently. But on the other hand, if you remove it, how one could enable headline parsing (that you may want to try if there is no parsing otherwise)?
Just now they can be turned on/off individually just by mentioning their name.
Of course, it seems to me fine...
I guess this may be fine too. But you may need to adjust some dictionary costs accordingly.
What should we do with |
The increase in cost cutoff is intended for more sentences would get parsed. However, when sentences marked with And indeed:
|
This is very simple. The BTW, internally, ~10000 costs are used for dialect handling. From dialect.h: link-grammar/dict-common/dialect.h: #define DIALECT_COST_MAX 9999.0F /* Less than that is a real cost */
link-grammar/dict-common/dialect.h: #define DIALECT_COST_DISABLE 10000.0F
link-grammar/dict-common/dialect.h: #define DIALECT_SUB 10001.0F /* Sub-dialect (a vector name) */
link-grammar/dict-common/dialect.h: #define DIALECT_SECTION 10002.0F /* Section header (a vector name) */ |
This one? https://www.semanticscholar.org/paper/Predicate-Expression-Cost-Functions-to-Guide-Search-Bottaci/52f5feca45ea50dc2eaba649a885ff66af575234 It does not seem to be relevant to our situation. I'm not sure if I should provide a long reply to this, but you seem to enjoy long replies, so what the heck. That paper is about evolutionary search for test data. It seems to describe what we now call "program fuzzing". Although that paper uses the terms "disjunct", "conjunct" it does not seem to be relevant to what we do with LG. Let's review the concept of natural language parsing.
The above was the situation in 1995. Since then, some new ideas:
At the same time, there are other forces at work:
The current questions are:
My personal answer to the second question is "sure, maybe yes, probably, I guess so, lets try it. eff around and find out". My personal mental picture of the situation is that it looks kind-of-like Morse theory, except in extremely high dimensions. Each human is coming at grammar from a slightly different direction. Each bifurcation is driven by phonetic discrimination, by ability or inability to understand and articulate. Great. What does this have to do with cost-max, and the paper you reference? Well, the disjuncts and conjuncts that LG uses are a fragment of what is called "linear logic": its a fragment, because some of the logical operators of linear logic are missing. Also, LG is not left-right symmetric; it is instead an associative algebra, and so resembles that of a pregroup grammar https://en.wikipedia.org/wiki/Pregroup_grammar -- and the resemblence is not accidental; LG appears to be isomorphic to categorial combinatorial grammar. See https://github.com/opencog/atomspace/blob/master/opencog/sheaf/docs/ccg.pdf for details. What should the cost of.... OK, next comment. The above is too long. I don't know why I wrote it. You already understand this stuff. |
Some cost questions:
The second question does not make sense, because the two terms results in two different parses. Any suggestion from the Bottaci paper does not apply here. For the first question, we have five possibilities. The cost of
The answer to this question serves two purposes:
If the preferred theory is information theory and that costs are log-probabilities, then I cannot think of any good reasons why other formulas could be used, or should be explored. There is nothing in the Bottaci paper that seems apropriate for our current situation: Bottaci is talking about logical operators and DeMorgan laws; the LG operators are NOT logical operators, and they do NOT obey DeMorgan's laws. Bottaci is talking about statistical sampling; we are not doing statistical sampling.... I don't know where else to go with this. Yes, tiny details like this can sometimes be very important; but here, I just don't see it. Looks like a cul-de-sac to me. |
BTW, I think that the LG system should also have another type of cost, an integer one that is applied "locally" to disjuncts during tokenization and not summed up (using log or any other function) over the sentence. I tried to make my point several times in the last few years but you have never told me your opinion on that.
But a common thing is that all the connectors of a disjunct must match for the disjunct to be considered matched.
I agree that this is preferable for combining the costs of disjuncts. As far as I know, the LG code also computes the disjunct cost by adding together the costs of its connectors. Here the question is what is the meaning of the cost cutoff - is it applies to disjuncts or connectors? If to connectors, then the max function results. But this has a property that looks improper - disjuncts that have a "high" cost are participating. |
It was confusing, broken, and prevented good parses from happening. See opencog#1450 opencog#1449 and opencog#783 and opencog#1453
My apologies; Maybe I did not notice that you were making a proposal, or asking a question. Maybe I did not understand what you were saying. Maybe I was not able to develop a strong opinion, one way or the other. Maybe I was in a hurry, and just forgot to respond. Open a new discussion/issue to clearly articulate the idea. I still might not respond, and I might not respond clearly. My general attitude will be "does this make sense from an information theory perspective?" Also, as conversations with Anton Kolonin @akolonin over the last year have made clear, we do not have any good high-quality theories or systems for text segmentation. This includes segmentation of Chinese, splitting of Indoeuropean languages into morphemes+suffix, detection of sentence boundaries, etc. And without some clear, good theory on how this can work, it will be hard to decide if this or that modification to LG tokenization is a good idea.
What I meant is that the upside-down ampersand is missing, the question-mark is missing, etc. https://en.wikipedia.org/wiki/Linear_logic and https://plato.stanford.edu/entries/logic-linear/ -- linear logic describes tensor algebras; natural-language is tensor-like, but is not the full thing (LG connectors can be understood as tensor indexes, and LG disjuncts can be understood as tensors. However, not all of the required axioms for tensors apply.) I find this interesting but its another cul-de-sac.
I got tired of agonizing over this, and merged #1456 -- that thing did not help English in any way, and it was not needed for Atomese, so I just got rid of it. |
As of this time, after merging the last few pull reqs #1456 and #1455, the code base implements "proposal B" of comment #1453 (comment) |
There is a long, old discussion about max-cost in #783 which suggests that getting rid of the concept of max-cost is a good idea. Newer evidence, in #1449 confirms this.
My memory of
max-cost
was that, originally, it was related to panic-parsing, and that setting it to a higher value would somehow make panic parsing "find something, anything". But this seems untrue. In the English dict, there is only one thing with three brackets:[[[@Xc-]]]
that would be chopped by max-cost. (There is nothing with[FOO+]3.0
because the panic-parse stuff predates the floating point costs.)Rather than continuing conversations in the already-long #783, lets just do it here. Or rather, lets just .. get rid of it, and be done with it.
As a side-effect, maybe get rid of panic-parse, also. Maybe.
The text was updated successfully, but these errors were encountered: