-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
link-generator -l en uses unexpectedly high memory and CPU #1501
Comments
Here it is on a fast Linux machine (high-end i9 with 64GB memory). I added
You can see that it took about 136s CPU. So it seems you just need to wait some more. |
Thank you, you're right, it did complete after around five minutes.
I guess if this is expected behavior then there is no bug. It seemed like a bug to me because with Lithuanian or German it returned results instantly. Using up all my memory and CPU for over a minute with no output seemed like an infinite loop allocating unbounded amounts of memory. I had read in #1020 that the code originated in 1995 and that while memory usage was a concern then it isn't anymore on modern machines. It says there that machines with hundreds of megabytes of memory or less would be memory constrained, thus I didn't think that my 16GB machine would be. I had also seen this comment in the source reinforcing that: link-grammar/link-grammar/utilities.c Lines 368 to 372 in 3a26127
|
It is not a "bug" - it is due to a huge number of disjuncts. Most CPU time could be saved by randomly discarding most disjuncts before the "Eliminated duplicate disjuncts" step. Additional speed and usefulness could be added by an API to select desired features for the generated sentences (like ending punctuations only as the last word and which type and amount of punctuation in the middle). See below for an example of current available constraints. I have also numerous efficiency fixes that I would like to send (as time permits).
As you can see, it is not an infinite loop, as it eventually finishes (even if you try to generate longer sentences). My "time" prompt said it used 17758976Kb of memory - about 17GB. This is more memory that you have, a thing which may cause page swapping (indicated by 92% CPU that you got, instead of ~99.5% that I got). Now to an example of using the 4 types of constraints that are implemented just now, demonstrating the speed up:
The step "Finished creating list" too ~13 seconds (it can be optimized to almost nothing). So the actual generation took only ~4.5 seconds (code optimizations can drastically reduce it). The type of constraints that were used here:
More attributes, like optional words, or the ability to specify a mix of attributes, may be added, but it seems it is desired to do it in a more general way. One of these ways is to add filtering plugins to specify attributes to any desired group of words (including individual words and the whole sentence). BTW, I think that the repeating template-replacing words or word classes in the generated sentences may hint at a bug. |
Right, I see that now. I was explaining what my impression was at the time that I filed the bug report. |
FYI I have recently started making plans to rethink/rework/redo the generator and how generation works, however, I won't be able to start anything for months. The ideas are sufficiently vague, that I can't explain them without a lot of effort. Something something more flexible word order. I would like to have the link-generator to use the same user-interface as link-parser: so that one could type in "John * * * by the Red Sea." have the generator fill in those blanks, and then have a prompt so I can do it again. I would also like to be able to say something like "the first * should be verb-like and the second * should be adverb-like" or even "adverb-like denoting larger or bigger". but I haven't yet tried to imagine a detailed API for that. "adverb-like approximate synonym for bigger" |
So this comment:
yes. The following PDF's give a general feel and inspiration for the "more general way". https://kahanedotfr.files.wordpress.com/2017/01/mtt-handbook2003.pdf It gives the general flavor. How to actually do it, to fill in all the precise bits and pieces... create something usable rather than theoretical, that's the hard part. I don't know. |
Now you can get a one-time (non-editable) prompt by
I also don't know how to jump to such sophistication. I meant a much basic and simpler implementation (by several degrees), which MTT may inspire and maybe implement some MTT-like "easy" ideas.
For that (and for implementing an MTT interface), we need an external dictionary that lists the word's grammatical features. Or maybe a |
About five years ago, I spoke to someone who proudly proclaimed about how their company is able to do so well, because they implemented MTT (as a proprietary product in their software stack.) So, there is certainly a "proof of concept". The "biggest" idea in MTT is the "lexical function" LF in wikipedia and if you keep your nose to the grindstone, its not hard to figure out how to implement that, and it would even be fun. But ... But of course one needs to have some "external dictionary", which means creating that dictionary. There are two ways to do this:
As I hope its clear, the hand-crafted version will never do. That leaves option two: machine learning. But how? How would that work? I'm not sure. I think I know how to get to there, but ... But of course, there are two ways to machine learn:
Of course, 2.A. is relatively easy, but is also a kind-of dead-end. Option B is harder. I'm pursing option B. Let me explain ... (next post) |
I think I can bootstrap up to MTT. I've gone half-way up the bootstrap, and it works well, but ran into maintainability problems. The bootstrap is this:
I've done the above. It more-or-less works. The problem is that it runs as large batch jobs, taking days or weeks to run. The training framework is fragile. Bugs in the code means throwing away datasets and starting from scratch. Very tedious. To fix that, I'm redesigning the pipeline, going back to square one, and creating a data-flow architecture instead of a batch process. But this is a lot of work. (I've got no help.) If/when I get this working, I can resume the journey:
I said "I think it'll work" because the above has been "done already" by assorted academics and grad students working in hand-crafted perl scripts and java monstrosities. As usual, that code is badly-written, obsolete, and can no longer be found on the internet, if it was ever published to begin with. What I'm doing is trying to create a single unified framework that can integrate this, "organically", instead of being a bunch of stove-piped java classes and perl scripts. However, doing this is ... hard. It's like that country-western song: "one step forward and two steps back, you can't get too far like that". That's the plan. It made sense 5 or 10 years ago. Hidden under the surface are a bunch of other technical and philosophical and design and engineering questions, some meta-level questions, and I'm struggling with the meta-level questions now. So when I say "read about MTT" I don't mean "jump to start coding now", but more like "day-dream about how that could be done." and more importantly "what's wrong with brute-force MTT aka option 1 or 2.A." I dunno. |
@ampli Here's one more, for your horror and delight. The demo located here: https://github.com/opencog/sensory/blob/master/examples/filesys.scm creates a link-grammar style API into a"sensori-motor" or "perception-action" environment. The possible "perceptions" are described as connectors, and so are the possible "actions". The "environment" is supposed to be "the "external world", something outside of the "agent". For this demo, this external word is the unix file system, and the (eventual) goal of the demo is to build an agent that bounces around the file-system, doing things. There's another one, for IRC chat. For example, the unix "ls" command becomes a disjunct; in that disjunct is a couple of "+" direction connectors that must connect in order for the command to run, and the "-" connector must be connected to something that can accept the output of "ls" (which is a text stream). If the agent wants to perform some action in this external world, it just needs to hook up appropriate stuff to each of the connectors, and then say "go". Data will then flow across the links. So, each LG link becomes somewhat like a unix pipe, with data flowing whichever way. The pipes are typed, to describe the kind of data that flow in them. That is, the LG connector rules must be obeyed, in order to maintain compatibility for the data flowing in those pipes. The demo is horribly ugly and complicated, because LG is not actually being used. Instead, the demo painfully dissects each disjunct (called a There's also an IRC chatbot demo there, the only thing the chatbot does is echo. you can think of the chatbot as an LG parse of one word: LEFT-WALL the input, a processing node, and RIGHT-WALL the output. The processing node just copies from input to output, so its an echobot. Why am I mentioning this here? I envision having agents "do stuff" by hooking together these "pipes", using LG-style connector-type matching rules. Some of the processing stages could be wild-cards, and so Anyway, this is all still a daydream. An experiment in progress. Maybe doing things this way will turn out to be interesting and useful, and maybe it will turn out to be a dumb trick. Not sure. The demos work. I'm still a ways away from actually being able to use LG to perform the hookups. |
Could you create an LG dictionary on the fly, parse (or generate) a "sentence", and then use the linkage? |
I said:
By external dictionary, I meant something like a database derived from Wiktionary and other similar resources.
How would it know what is any of "verb-like", "adverb-like", "adverb-like denoting larger etc.? I even don't have any idea how it will know what a simple "verb" is or what are tenses, unless you tell it or it uses deep learning. |
Being able to modify dictionaries on the fly is needed. Not a problem: the LG atomese backend already does this, with the "main" dictionaries being stored in the AtomSpace, and small subsets getting placed into the LG
Dunno. Maybe. Haven't gotten that far. Anyway, link-crossing is "not a problem": If you want A to cross B, then just write |
I wish to avoid Wiktionary, for multiple reasons. I'm interested in systems that learn this info on their own.
Solutions range from cheap and easy to complicated and powerful. The dumb cheap solution is just to use the existing
Bingo. Except that are many other kinds of learning besides deep learning. Deep learning is currently winning due to certain ease-of-scaling, ease-of-compute issues. I think some of the other learning algos could also scale, but perhaps DL has sucked all the air out of the room. |
I've built link-grammar 5.12.4 from source on macOS 12. If I run
link-generator -l en
to specify English, since it defaults to Lithuanian (#1499), it hangs with 100% CPU use while consuming all memory. I cancel it after about a minute when it has consumed over 15 GB of memory (my computer has 16 GB of real RAM) and started allocating ever more swap space.In case relevant, I am using
--disable-pcre2 --with-regexlib=c
since it cannot build when using pcre2 (#1495).The text was updated successfully, but these errors were encountered: