-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid the use of ListLink with many outgoing nodes #164
Comments
? What does this mean? in what way is it "too big"? Does it not fit in RAM? Takes too long to load? PLN is too slow? Pattern miner is too slow?
I did not design this system, and I don't understand the data flow. If I "check here", I can clearly see that the ListLink is being used to wrap together a collection of related results. Why should things be wrapped together? I don't know -- I have to assume some other system needs them wrapped up like this. Which system needs them wrapped up is not documented. What it does with this wrapped-up data is not documented. I mean, you could change
I will create a pull request right now to make it easier to get rid of the ListLink, without actually getting rid of it. You could even turn them on and off with a simple flag. It's important to write up at least a few paragraph define who the users of |
As discussed in issue MOZI-AI#164. Untested -- I have not tried running this code; I think it's correct but I might have made a mistake, or missed a spot where a change was needed.
I should say they are interested in experimenting with the ~800 genes only for now.
Two main reasons that we used ListLink
But as of now, we only have few GroundedSchemaNode so, it will be easier to get rid of the ListLink I think.
Mainly its @mjsduncan , Nil, Manhin ... are working to find interesting patterns from the ~800 genes annotation result using PLN and pattern miner. Recently Nil wrote this code here https://github.com/ngeiswei/reasoning-bio-as-xp to do experimenting. The data format they expect is atomese but there should not be a giant ListLink with many outgoing nodes, which require us to get rid of the ListLinks we used to wrap many EvaluationLinks in one. |
See #165 -- this moves the ListLink from There are two or three different/related solutions that are possible.
All three approaches seem to be equally complicated. My personal preference would be to use Values, but I won't complain (very much) if you don't use them. |
Per discussion in issue MOZI-AI#164. I doubt this makes any impact on performance. It does simplify the code slightly.
As discussed in MOZI-AI#164, this helps avoid some of the pollution of the AtomSpace with spurious ListLinks.
@linas to be clear the arity in question was in order of tenth of thousands. Replacing the atomese |
@ngeiswei OK, so here is what I see: There are many places where a half-dozen searches are performed, to find gene interaction and protein expression and cellular location and what-not, and the result of these searches are a half-dozen "things" wrapped in a ListLink. I have no clue why those things are being wrapped in a ListLink; I assume some down-stream module needs them in these bite-size chunks. I was proposing that the structure of these bite-sized chunks be redesigned to be more meaningful. Most things that I see wrapped in a ListLink seem to contain only half-a-dozen things. Now, maybe one of those "things" is actually a list of thousands of other "things", but that is non-obvious from the code. It is also possible that something, somewhere else, is wrapping the entire collection of these bite-sized chunks in one giant ListLink. If so, it should be hunted down and "fixed". But again, without knowing why it was wrapped in the first place, I don't know how to "fix" it. |
OK, well, so I looked. @ngeiswei there is this spot right here: annotation-scheme/annotation/biogrid.scm Lines 60 to 61 in 26243ad
The two ListLink's there wrap up lists with tens-of-thousands of entries in them |
Also here: annotation-scheme/annotation/gene-go.scm Line 44 in 26243ad
and here: annotation-scheme/annotation/rna.scm Line 50 in 26243ad
and here: annotation-scheme/annotation/gene-pathway.scm Lines 68 to 69 in 26243ad
and here: annotation-scheme/annotation/main.scm Line 75 in 26243ad
and I think that's it. All of these are immediately followed by a call to I'm trying to think about this problem generically/abstractly, right now, like "what would SQL do" or "what would other graph db's do" or "what would javascript do" or "what would R do". if it was in this situation. I'm not sure. I mean, the old rough-n-ready solution would be to just say:
so that each "thing" is tagged. Now you get 10K MemberLinks .... I don't know if that is "better" or not... |
Yeah, that's also what I recommended. |
The outer |
@Habush what is this json format you speak of? Is it documented somewhere? Atomese was designed to look "json-like" but it's possible the design is not really good enough, and at any rate, very few active users think of it as being json-like, (with you being almost the sole exception). (Kind-of unrelated to this, but maybe we should have a generic atomese->json and json->atomese export/import functions.... see issue #179 for proposal) |
So, Based on Nil's recommendation, the result of each annotation will have a MemberLink to the conceptNode (which is the name of the annotation) for example the following biogrid results will each have a MemberLink to the conceptNode (ConceptNode "biogrid_interaction_annotation") ?
|
I was explaining the reason why the outer ListLinks ( But like @ngeiswei suggested and @tanksha demonstrated above, this could be replaced with
I agree. Atomese needs to be exported into popular data formats for interoperability with other systems. We can start with JSON and then add XML, YAML as we move forward. |
BTW @linas this, in a way relates, to what you suggested a while ago: to decouple the annotation and the parser code and the develop the annotation module into its own separate system. |
My NLP databases have dozens of millions of atoms in them. A few thousand is a drop in the bucket. Your databases ... I'm not sure, but I'm guessing that your default datasets have around a million atoms (I will measure this shortly) so a few thousand more doesn't affect anything. Large ListLinks do not hurt the atomspace, and it might be slightly more efficient to have one large ListLink. Buit it is clear that guile is allergic to large lists, and seems to choke on them. As-is the current annotation code spends almost half it's CPU time in guile GC which is crazy-huge, and I don't know why its so big. So the MemberLink proposal is good/better, certainly better for Nil. |
@Habush I think the existing parser code can be modified to get group information from the MemberLink (as it was getting from the ListLink previously) |
The MOZI database I've been loading, from late Dec 2019 has a total of 6.7 million atoms in it -- 6757334 to be precise. The breakdown is:
so -- half-a-million ConceptNodes, almost 2 million MemberLinks, almost 2 million EvaluationLinks, 400K proteins, 50K genes ... this is using up 3.8GBytes RSS on my system, 4.3 GBytes virtual, so that works out to about 600 bytes/Atom for the MOZI dataset. This will fit in a mid/hi-range laptop from yesteryear, and even a 10-year-old desktop. Or even a modern cellphone or maybe a high-end raspberry-pi, just barely (You'd need 8GB for room for other stuff) (In my NLP datasets, its more like 1.5KBytes/Atom, but those have a different kind of structure, including more statistics data on each atom.) So ... 20K more MemberLinks won't change much at all. That would be 1% of all MemberLinks, and 0.3% of all links. |
@tanksha is this a still issue? Please see the linked issues |
pull req #165 that ameliorates this is still unmerged ... |
@linas @Habush
The scheme result of the annotation-service is supposed to be used for Pattern miner and PLN works. As the bio-atomspace is too big to work with, we wanted to have a smaller version by annotating the ~800 genes from the mosses model. The problem with the annotation result is it has additional ListLink's ( which we used For example to return >1 hypergraphs on certain condition
The ListLink causes error (when the scheme result is used as a dataset for the Pattern miner code here https://github.com/ngeiswei/reasoning-bio-as-xp)
a ListLink is considered not not have many outgoing nodes. What should we do to get rid of these ListLinks?
The text was updated successfully, but these errors were encountered: