Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filteringTerms: Add excluded flag #63

Open
mbaudis opened this issue Mar 27, 2023 · 21 comments
Open

filteringTerms: Add excluded flag #63

mbaudis opened this issue Mar 27, 2023 · 21 comments

Comments

@mbaudis
Copy link
Member

mbaudis commented Mar 27, 2023

There is currently no way to exclude specific ontologyTerms in filter queries although this would be desirable to expand query options (especially regarding the limitation of not having Booleans) and to stay in line with Phenopackets (which has an excluded flag e.g. in PhenotypicFeature.

(There is a current workaround to use some custom filter design, e.g. pre-pending an ontologyTerm in a request with ! - but this is non-standard & only would work for individual Beacons.)

Proposal:

  OntologyFilter:
    type: object
    description: Filter results to include records that contain a specific ontology
      term.
    required:
      - id
    properties:
      id:
        type: string
        description: >-
          Term ID to be queried, using CURIE syntax where possible.
        example: HP:0002664
      includeDescendantTerms:
        type: boolean
        default: true
        description: >-
          Define if the Beacon should implement the ontology hierarchy,
          thus query the descendant terms of `id`.
      similarity:
        type: string
        enum:
          - exact
          - high
          - medium
          - low
        default: exact
        description: >-
          Allow the Beacon to return results which do not match the filter
          exactly, but do match to a certain degree of similarity. The Beacon defines
          the semantic similarity model implemented and how to apply the thresholds
          of 'high', 'medium' and 'low' similarity.
      scope:
        type: string
        description: The entry type to which the filter applies
        example: biosamples
      excluded:
        description: >-
          Flag to indicate whether the filtering term was observed or not. The default is `false`,
          _i.e._ will be selected for **existence** of the term.
        type: boolean
        default: false
@mbaudis
Copy link
Member Author

mbaudis commented Jun 3, 2024

Use case examples

All breast adenocarcinoma samples by NCIt (NCIT:C5214), excluding triple-negative (NCIT:C71732)

This is a simple use case where a diagnosis is queried w/ the explicit exclusion of one specific subtype. In principle this could be achieved by an AND query for all wanted codes at the same hierarchy level as NCIT:C71732 but this requires understanding of the alternative codes and branch structures etc. Also more Beacon-like: "answer a simple query as good as possible".

filters:
  - id: "NCIT:C5214"
  - id: "NCIT:C71732"
    excluded: true

GET:

?filters=NCIT:C5214,!NCIT:C71732

COVID19 in a non-diabetic patient with cardiovascular risk factors

This is based on an example provided inn the Phenopackets documentation.

filters:
  - id: "NCIT:C2985"                # Diabetes Mellitus
    excluded: true
  - id: "NCIT:C34830"               # Cardiomyopathy"
  - id: "NCIT:C80389"               # Chronic Kidney Disease, Stage 3"
  - id: "NCIT:C3283"                # Obesity"
  - id: "MONDO:0100096"             # COVID-19"

GET:

?filters=!NCIT:C2985,NCIT:C34830,NCIT:C80389,NCIT:C3283,MONDO:0100096

@redmitry
Copy link
Collaborator

redmitry commented Jun 4, 2024

Hi all,

Just a naïve question. May the exclusion (negation) be included in filtering logic along with AND & OR?.
I mean do we need this special case for negation?

Cheers,

D.

@mbaudis
Copy link
Member Author

mbaudis commented Jun 4, 2024

@redmitry These are actually 2 questions.

May the exclusion (negation) be included in filtering logic along with AND & OR?

If Boolean queries are developed than negated filters would be included in that scenario; just treat them as another filter, with a flag.

do we need this special case for negation

In the Booleans? If allowing Boolean combinations a "normal" query (some filters connected by AND) are just a special case - so yes. It is not so much about dreaming up scenarios where this needs to happen (e.g. retrieving 2 groups of results in a single query); excluded is just a potential flag of any filter (such as similarity).

In any case adding the excluded flag as an option seems straightforward (and is IMO clearer as the procedural similarity flag...1).

Footnotes

  1. One actually could re-use similarity like "similarity": "NONE" or such but this would introduce a new special option & not be in line w/ e.g. Phenopackets - so please forget I wrote this.

@redmitry
Copy link
Collaborator

redmitry commented Jun 4, 2024

In any case adding the excluded flag as an option seems straightforward (and is IMO clearer as the procedural similarity flag...

I never understood the exact meaning of similarity for practical implementation.

From the usecase I understand that we want all but triple negative breast cancers. Without exclusion we would need repeat too many identifiers one by one.

?filters=NCIT:C5214,!NCIT:C71732 looks good, but for the POST I would use generic logical query...

"query": {
    "filters": [ 
         {"id": "NCIT:C5214"}, {"id": "NCIT:C71732"}
    ],
    "filters_logic": "NCIT:C5214 & !NCIT:C71732"
}

I mean that we may use "filters_logic" (similar to @anuradhawick "queryExpression": "(A|B)&(C&D|E&F)&!G").
I would prefer to repeat identifiers for robustness...

Cheers,

D.

@gsfk
Copy link
Collaborator

gsfk commented Jun 5, 2024

If I understand correctly, the proposal is to be able to make a query and get back results where a particular ontology is absent... is that correct? But this is quite a bit different from the phenopackets excluded field, where a term is present, but marked as "excluded": "true" (ie this condition was specifically ruled out). Obviously terms that are excluded in this sense will not be missing from the data.

Is "missing" or "absent" a less confusing name for this feature? Especially for those of us already using phenopackets with beacon.

@anuradhawick
Copy link

@redmitry happy with your implementation, but not sure if it would extend to support includeDescendentTerms, Alphanumeric filters and so on. Especially with LLMs in horizon similarity searches will take prominence searching "Melanoma or a related condition" in a vector space. In this case it would be nice to still have filters as a JSON object with other attributes such as descendent terms, similarity LOW/HIGH etc.

But keen to see where it lands :)

@gsfk one area this approach shines is when Cancer X is covered by datasets from HPO and SNOMED ontologies. Logical expressions makes querying this magically easy. Also when there are mutually exclusive conditions like "from Australia" or "from New Zealand" kind of queries are nearly impossible in one go without some logic.

If we can achieve logic, I think we cover all the bases for "missing", etc.

@dbujold
Copy link

dbujold commented Jun 6, 2024

I'll echo @gsfk point, it's not clear here whether we're talking about filtering on properties that are explicitly mentioned as excluded (as Phenopackets does), or if we're talking about adding a flag that checks whether a property is absent from a given record.

@mbaudis
Copy link
Member Author

mbaudis commented Jun 6, 2024

O.k., several convincing arguments although they fall into different camps: 1. the excluded flag would be ambiguous inits use since it doesn't define an explicit test so far (on could make this the default behavior, though - which in case of the diagnosis test would require to see if the excluded one is part of the existing filters etc.) but this requires to many conditions to agree on. The 2. argument is that a full conditional query model already solves such issues.

So excluded is too specific, but an expression of a negating condition (absent, missing, excluded) seems good - unless solving this through the query language (which IMO is still something "emergent").

negated: true? ... which also is a clearer expression of the filter value's intent, less tied to the implementation.

Btw. the comment by @gsfk Gordon actually points to a potential need for a specific excluded: true solution in the PXF sense...

@gsfk
Copy link
Collaborator

gsfk commented Jun 18, 2024

Surprisingly the name "negated" still seems ambiguous to me, in the same way as above (does it mean "negate this particular term in my query" or "this term is negated in the data")? "Negated" is actually the original name of the phenopackets excluded field.

@mbaudis
Copy link
Member Author

mbaudis commented Jun 26, 2024

@gsfk While I started w/ pointing to PXF it really is different when using the term negation here. So probably negated or absent is the best choice: One expresses that the term should not be found w/o the requirement that the subject has been explicitly tested.

All: Comments on absent?!

@redmitry
Copy link
Collaborator

redmitry commented Jun 26, 2024

I implemented a possible prototype and still do not understand the difference between "excluded" and "negation".

"query": {
  "filters": [
    { "id": "NCIT:C9335" }
  ],
  "filtersLogic": "NCIT:C9335 & !NCIT:C5213"

does this query do what we expect with "excluded"?

D.

@anuradhawick
Copy link

@redmitry i think your solution should imply exclusion. It looks really good and intuitive.

Can we use indices in logic expression? There's a slim chance same ontology term or filter appearing in different scopes. For example "NCIT:123456" filter could be present across multiple scopes. So if it does this ontology code in filter logic becomes ambiguous.

Also complex queries like

A & B | !A & C has to repeat same term multiple times.

What are your thoughts on that?

@redmitry
Copy link
Collaborator

redmitry commented Jun 26, 2024

The scopes are defined in the filters[] array. The "filtersLogic" only defines the logic. I use "ids" (or better should be "labels") in "filtersLogic" because we may reuse the same filter in a complex query.
The question is should "NCIT:C5213" (a subclass of "NCIT:C9335") be in "filters" also? By default descendant terms are enabled.

@mbaudis
Copy link
Member Author

mbaudis commented Jun 26, 2024

@redmitry Some notes:

  • filters[0].id should be NCIT:C9335, not double-prefixed
  • not sure how the filter logic would work

Regarding the flag names: As Gordon pointed out, excluded has a bit of a "tested for that explicitly" meaning, stronger than "absent". So e.g. in the case of testing against male sex the !male could be interpreted as
* absent: either female or not indicated / no value etc.
* excluded: only females match
So even if we could say that the flag applies to filter, not the data, it seems sensible to use "this term is absent" instead of "this term has been excluded".

Also, I really would prefer to separate the discussion of the filter flag from "how does this get implemented" ¯\_(ツ)_/¯

@redmitry
Copy link
Collaborator

redmitry commented Jun 26, 2024

@mbaudis

filters[0].id should be NCIT:C9335, not double-prefixed

fixed (typo).

This is a question of interpretation and what actually be generated for execution.

  1. For alphanumeric filters "! Weight > 25" is interpreted as "Weight <= 25"
  2. For ontological filters "!NCIT:C5213" "disease ne NCIT:C5213". We may include "and exists"

Note that in a query disease $in: ["NCIT:C9335", "NCIT:C4671", ...] and $ne NCIT:C5213 checks the disease field existence in the first "and" operand.

@mbaudis
Copy link
Member Author

mbaudis commented Jun 26, 2024

@redmitry My take here is:

  • We should not use this option for filters with numeric expressions since we already have other ways to do this here
  • We should stick to the "not seen" way of interpretation (absent), not to the "formally excluded". While there are cases for the latter it represents a specific subset of the former and goes (IMO) a bit too much in trying to be overly precise (and then losing on implementations). This seems to be in line with/ some addtl. emails (@jrambla & Tony).

@jrambla
Copy link
Contributor

jrambla commented Jul 11, 2024

My 2 cents:

  1. We already have the excluded flag in phenotypicFeature schema, Its meaning is stated in PXF and "inherited" by Beacon.
  2. We don't have a syntax for expressing that excluded in the query. @tb143 we could need one
  3. I understand that we are planning to add the NOT boolean operator together with OR logic. @tb143 please confirm
  4. If we adopt the NOT operator, we could not be able to avoid users adding them to any filter (ontology or alphanum), and we shouldn't. The user should be allowed to say ! Weight > 25" as synonym for Weight <= 25. Whatever is clear to the user should be fine to us.
  5. I guess that, in any case, some sanityzation would be required in the implementations to avoid queries like Age< 250 that will return everything or playing with boolean logic to trick the safety measures, But this should go into the implementation not in the spec.

@tb143
Copy link
Collaborator

tb143 commented Jul 23, 2024

Following a Filters Scout discussion on this issue, I will try to summarise what we are asking for votes on (👍,👎 or leave +1,-1).

Question: Do you agree that the ontology filter query options should include an excluded flag? Beacon will reuse the Phenopackets definition of excluded, essentially "tested for, but not found/observed".

Some considerations that have emerged from following the discussion:

  • You are voting on the principle of including an ontology filter excluded flag in the spec. The addition of other (related) flags - or how excluded could get implemented - can be taken up in separate issues/discussions.
  • This proposal is independent of other parallel proposals and discussions, for example AND/OR filter logic. If complementary proposals are accepted by the Scout, examples will be given of how they could work together.

@jrambla
Copy link
Contributor

jrambla commented Jul 23, 2024

+1

1 similar comment
@mbaudis
Copy link
Member Author

mbaudis commented Oct 29, 2024

+1

@jrambla
Copy link
Contributor

jrambla commented Oct 30, 2024

I also insist is that "excluded" is not an operator but a different property to take into account.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants