Avoid the use of ListLink with many outgoing nodes #164

tanksha · 2020-03-17T16:19:01Z

The scheme result of the annotation-service is supposed to be used for Pattern miner and PLN works. As the bio-atomspace is too big to work with, we wanted to have a smaller version by annotating the ~800 genes from the mosses model. The problem with the annotation result is it has additional ListLink's ( which we used For example to return >1 hypergraphs on certain condition

check here

The ListLink causes error (when the scheme result is used as a dataset for the Pattern miner code here https://github.com/ngeiswei/reasoning-bio-as-xp)

a ListLink is considered not not have many outgoing nodes. What should we do to get rid of these ListLinks?

linas · 2020-03-17T18:45:57Z

bio-atomspace is too big to work with

? What does this mean? in what way is it "too big"? Does it not fit in RAM? Takes too long to load? PLN is too slow? Pattern miner is too slow?

The problem with the annotation result is it has additional ListLink's (check here)

I did not design this system, and I don't understand the data flow. If I "check here", I can clearly see that the ListLink is being used to wrap together a collection of related results. Why should things be wrapped together? I don't know -- I have to assume some other system needs them wrapped up like this. Which system needs them wrapped up is not documented. What it does with this wrapped-up data is not documented.

I mean, you could change generate-result to return an ordinary list, but I don't know what that might break or not break.

What should we do to get rid of these ListLinks?

I will create a pull request right now to make it easier to get rid of the ListLink, without actually getting rid of it. You could even turn them on and off with a simple flag.

It's important to write up at least a few paragraph define who the users of annotate-scheme are, and what data formats they expect, and why.

As discussed in issue MOZI-AI#164. Untested -- I have not tried running this code; I think it's correct but I might have made a mistake, or missed a spot where a change was needed.

tanksha · 2020-03-17T19:17:43Z

? What does this mean? in what way is it "too big"? Does it not fit in RAM? Takes too long to load? PLN is too slow? Pattern miner is too slow?

I should say they are interested in experimenting with the ~800 genes only for now.

I can clearly see that the ListLink is being used to wrap together a collection of related results. Why should things be wrapped together? I don't know -- I have to assume some other system needs them wrapped up like this.

Two main reasons that we used ListLink

A function in GroundedSchemaNode calls, should return an opencog object (which makes us use ListLink instead of scheme list)

But as of now, we only have few GroundedSchemaNode so, it will be easier to get rid of the ListLink I think.

The result of annotate-genes function is wrapped with ListLink because we were having issue loading back the result into atomspace (as it is was a scheme list we were using)

It's important to write up at least a few paragraph define who the users of annotate-scheme are, and what data formats they expect, and why.

Mainly its @mjsduncan , Nil, Manhin ... are working to find interesting patterns from the ~800 genes annotation result using PLN and pattern miner. Recently Nil wrote this code here https://github.com/ngeiswei/reasoning-bio-as-xp to do experimenting.

The data format they expect is atomese but there should not be a giant ListLink with many outgoing nodes, which require us to get rid of the ListLinks we used to wrap many EvaluationLinks in one.

linas · 2020-03-17T19:18:07Z

I will create a pull request right now

See #165 -- this moves the ListLink from generate-result to the users of generate-result, which should make it easier to remove, if desired. There are many many other uses of ListLink all through that code, and until there is some general understanding of why these are needed, or what they do, they cannot be removed.

There are two or three different/related solutions that are possible.

Instead of using nested (ListLink (ListLink (ListLink ...))) structures to return data, you could create JSON structures with labelled fields holding the annotated data.
Instead of using nested (ListLinks...), use scheme records. My understanding is that they look JSON-like, in being structured data records (kind-of-like C/C++ structs or classes, or like database records).
Store annotation results as Atomese Values on specific Atoms. This has several advantages over the above two. One advantage is that this allows memoization: either a given value is there, and you can immediately use it, or it is not there, and so it needs to be computed. Like JSON, values are named fields, so you can access fields by name. Since Values live with the Atoms, they can be saved/restored: you can have memoization over long periods of time.

All three approaches seem to be equally complicated. My personal preference would be to use Values, but I won't complain (very much) if you don't use them.

linas · 2020-03-17T19:22:17Z

GroundedSchemaNode

There are only five of these left. They could be eliminated. #166 and #167

Anyway, this is not the reason why ListLinks are all over the place.

Per discussion in issue MOZI-AI#164. I doubt this makes any impact on performance. It does simplify the code slightly.

As discussed in MOZI-AI#164, this helps avoid some of the pollution of the AtomSpace with spurious ListLinks.

ngeiswei · 2020-03-19T15:39:03Z

@linas to be clear the arity in question was in order of tenth of thousands. Replacing the atomese List by scheme list changed nothing, meaning it is IMO a limitation with guile rather the atomspace. Nevertheless these large list links can and should be avoided as it's been agreed.

linas · 2020-03-19T15:53:12Z

@ngeiswei OK, so here is what I see: There are many places where a half-dozen searches are performed, to find gene interaction and protein expression and cellular location and what-not, and the result of these searches are a half-dozen "things" wrapped in a ListLink. I have no clue why those things are being wrapped in a ListLink; I assume some down-stream module needs them in these bite-size chunks. I was proposing that the structure of these bite-sized chunks be redesigned to be more meaningful.

Most things that I see wrapped in a ListLink seem to contain only half-a-dozen things. Now, maybe one of those "things" is actually a list of thousands of other "things", but that is non-obvious from the code.

It is also possible that something, somewhere else, is wrapping the entire collection of these bite-sized chunks in one giant ListLink. If so, it should be hunted down and "fixed". But again, without knowing why it was wrapped in the first place, I don't know how to "fix" it.

linas · 2020-03-19T15:59:04Z

OK, well, so I looked. @ngeiswei there is this spot right here:

annotation-scheme/annotation/biogrid.scm

Lines 60 to 61 in 26243ad

    
           [res (ListLink (ConceptNode "biogrid-interaction-annotation") 
        
                          (ListLink result))])

The two ListLink's there wrap up lists with tens-of-thousands of entries in them

linas · 2020-03-19T16:13:48Z

Also here:

annotation-scheme/annotation/gene-go.scm

Line 44 in 26243ad

[res (ListLink (ConceptNode "gene-go-annotation") (ListLink result))]

and here:

annotation-scheme/annotation/rna.scm

Line 50 in 26243ad

(let ([res (ListLink (ConceptNode "rna-annotation") rna)])

and here:

annotation-scheme/annotation/gene-pathway.scm

Lines 68 to 69 in 26243ad

    
           [res (ListLink (ConceptNode "gene-pathway-annotation") 
        
                          (ListLink result))])

and here:

annotation-scheme/annotation/main.scm

Line 75 in 26243ad

(res (ListLink (ConceptNode "main") info))

and I think that's it. All of these are immediately followed by a call to write-to-file So this is something about how file-data is represented/marked up.

I'm trying to think about this problem generically/abstractly, right now, like "what would SQL do" or "what would other graph db's do" or "what would javascript do" or "what would R do". if it was in this situation. I'm not sure.

I mean, the old rough-n-ready solution would be to just say:

(for-each
    (lambda (thing)
         (write-to-file 
              (MemberLink thing (Concept "biogrid-interaction-annotation"))))
   list-of-things)

so that each "thing" is tagged. Now you get 10K MemberLinks .... I don't know if that is "better" or not...

ngeiswei · 2020-03-19T17:38:16Z

Now you get 10K MemberLinks

Yeah, that's also what I recommended.

Habush · 2020-03-19T18:06:29Z

I have no clue why those things are being wrapped in a ListLink;

The outer ListLinks (like the one you pointed above) are used for identifying from which annotation database (in terms of the code function, i.e whether gene-go, biogrid..etc) the results are coming from while parsing them to JSON.

linas · 2020-03-19T18:11:53Z

@Habush what is this json format you speak of? Is it documented somewhere? Atomese was designed to look "json-like" but it's possible the design is not really good enough, and at any rate, very few active users think of it as being json-like, (with you being almost the sole exception).

(Kind-of unrelated to this, but maybe we should have a generic atomese->json and json->atomese export/import functions.... see issue #179 for proposal)

tanksha · 2020-03-19T18:14:25Z

So, Based on Nil's recommendation, the result of each annotation will have a MemberLink to the conceptNode (which is the name of the annotation) for example the following biogrid results will each have a MemberLink to the conceptNode (ConceptNode "biogrid_interaction_annotation") ?

   (MemberLink
      (EvaluationLink
         (PredicateNode "has_pubmedID")
         (ListLink
            (EvaluationLink
               (PredicateNode "interacts_with")
               (ListLink
                  (GeneNode "MAP2K4")
                  (GeneNode "MAP3K7")
               )
            )
            (ConceptNode "pubmed:12429732")
         )
      )
   (ConceptNode "biogrid_interaction_annotation")   
   )

   (MemberLink
      (EvaluationLink
         (PredicateNode "has_pubmedID")
         (ListLink
            (EvaluationLink
               (PredicateNode "interacts_with")
               (ListLink
                  (GeneNode "MAP2K4")
                  (GeneNode "MAP3K7")
               )
            )
            (ConceptNode "pubmed:12556533")
         )
      )
   (ConceptNode "biogrid_interaction_annotation")
   )
.....

Habush · 2020-03-19T18:41:12Z

@Habush what is this json format you speak of? Is it documented somewhere?

I was explaining the reason why the outer ListLinks ((ListLink (Concept "biogrid-interaction) ....)) are being used right now. It is to help the parser group nodes and edges. And by group I mean simply add a group field to JSON objects.

But like @ngeiswei suggested and @tanksha demonstrated above, this could be replaced with MemberLinks. Although as you said, I am also unsure how having thousands of MemberLinks would fare better than huge ListLinks in terms of performance.

(Kind-of unrelated to this, but maybe we should have a generic atomese->json and json-atomese export/import functions....)

I agree. Atomese needs to be exported into popular data formats for interoperability with other systems. We can start with JSON and then add XML, YAML as we move forward.

Habush · 2020-03-19T18:52:18Z

BTW @linas this, in a way relates, to what you suggested a while ago: to decouple the annotation and the parser code and the develop the annotation module into its own separate system.

linas · 2020-03-19T19:12:55Z

having thousands of MemberLinks would fare better than huge ListLinks in terms of performance.

My NLP databases have dozens of millions of atoms in them. A few thousand is a drop in the bucket. Your databases ... I'm not sure, but I'm guessing that your default datasets have around a million atoms (I will measure this shortly) so a few thousand more doesn't affect anything.

Large ListLinks do not hurt the atomspace, and it might be slightly more efficient to have one large ListLink. Buit it is clear that guile is allergic to large lists, and seems to choke on them. As-is the current annotation code spends almost half it's CPU time in guile GC which is crazy-huge, and I don't know why its so big. So the MemberLink proposal is good/better, certainly better for Nil.

tanksha · 2020-03-19T19:21:35Z

@Habush I think the existing parser code can be modified to get group information from the MemberLink (as it was getting from the ListLink previously)

linas · 2020-03-19T20:57:11Z

The MOZI database I've been loading, from late Dec 2019 has a total of 6.7 million atoms in it -- 6757334 to be precise. The breakdown is:

scheme@(guile-user)> (cog-report-counts)
$2 = ((ConceptNode . 454779) (PredicateNode . 12) 
(ListLink . 1925554) (MemberLink . 1850528) 
(AndLink . 98788) (EvaluationLink . 1887530) 
(InheritanceLink . 122184) (GeneNode . 49050) 
(MoleculeNode . 368909))

so -- half-a-million ConceptNodes, almost 2 million MemberLinks, almost 2 million EvaluationLinks, 400K proteins, 50K genes ... this is using up 3.8GBytes RSS on my system, 4.3 GBytes virtual, so that works out to about 600 bytes/Atom for the MOZI dataset. This will fit in a mid/hi-range laptop from yesteryear, and even a 10-year-old desktop. Or even a modern cellphone or maybe a high-end raspberry-pi, just barely (You'd need 8GB for room for other stuff)

(In my NLP datasets, its more like 1.5KBytes/Atom, but those have a different kind of structure, including more statistics data on each atom.)

So ... 20K more MemberLinks won't change much at all. That would be 1% of all MemberLinks, and 0.3% of all links.

Habush · 2020-08-24T11:39:29Z

@tanksha is this a still issue? Please see the linked issues

linas · 2020-08-24T14:54:16Z

pull req #165 that ameliorates this is still unmerged ...

linas mentioned this issue Mar 17, 2020

Change location of where ListLink wrapping occurs. #165

Open

linas added a commit to linas/annotation-scheme that referenced this issue Mar 17, 2020

Get rid of a GroundedSchemaNode

44ac359

Per discussion in issue MOZI-AI#164. I doubt this makes any impact on performance. It does simplify the code slightly.

linas mentioned this issue Mar 17, 2020

Get rid of a GroundedSchemaNode #166

Merged

linas added a commit to linas/annotation-scheme that referenced this issue Mar 17, 2020

Get rid of a GroundedSchemeNode

8dde206

As discussed in MOZI-AI#164, this helps avoid some of the pollution of the AtomSpace with spurious ListLinks.

This was referenced Mar 17, 2020

Get rid of a GroundedSchemeNode #167

Merged

Get rid of a grounded schema node #168

Merged

get rid of a grounded schema #169

Merged

Remove GrouondedSchema in add-go-info #174

Merged

Simplify find-similar-gene #175

Merged

linas mentioned this issue Mar 19, 2020

atomese->json and json->atomese #179

Open

Habush mentioned this issue Apr 24, 2020

Use channels for communication between parser and annotation functions #197

Closed

Habush mentioned this issue May 19, 2020

Add STRING db Support #207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid the use of ListLink with many outgoing nodes #164

Avoid the use of ListLink with many outgoing nodes #164

tanksha commented Mar 17, 2020

linas commented Mar 17, 2020

tanksha commented Mar 17, 2020

linas commented Mar 17, 2020

linas commented Mar 17, 2020 •

edited

Loading

ngeiswei commented Mar 19, 2020 •

edited

Loading

linas commented Mar 19, 2020

linas commented Mar 19, 2020

linas commented Mar 19, 2020 •

edited

Loading

ngeiswei commented Mar 19, 2020

Habush commented Mar 19, 2020

linas commented Mar 19, 2020 •

edited

Loading

tanksha commented Mar 19, 2020

Habush commented Mar 19, 2020

Habush commented Mar 19, 2020

linas commented Mar 19, 2020

tanksha commented Mar 19, 2020

linas commented Mar 19, 2020 •

edited

Loading

Habush commented Aug 24, 2020

linas commented Aug 24, 2020

Avoid the use of ListLink with many outgoing nodes #164

Avoid the use of ListLink with many outgoing nodes #164

Comments

tanksha commented Mar 17, 2020

linas commented Mar 17, 2020

tanksha commented Mar 17, 2020

linas commented Mar 17, 2020

linas commented Mar 17, 2020 • edited Loading

ngeiswei commented Mar 19, 2020 • edited Loading

linas commented Mar 19, 2020

linas commented Mar 19, 2020

linas commented Mar 19, 2020 • edited Loading

ngeiswei commented Mar 19, 2020

Habush commented Mar 19, 2020

linas commented Mar 19, 2020 • edited Loading

tanksha commented Mar 19, 2020

Habush commented Mar 19, 2020

Habush commented Mar 19, 2020

linas commented Mar 19, 2020

tanksha commented Mar 19, 2020

linas commented Mar 19, 2020 • edited Loading

Habush commented Aug 24, 2020

linas commented Aug 24, 2020

linas commented Mar 17, 2020 •

edited

Loading

ngeiswei commented Mar 19, 2020 •

edited

Loading

linas commented Mar 19, 2020 •

edited

Loading

linas commented Mar 19, 2020 •

edited

Loading

linas commented Mar 19, 2020 •

edited

Loading