Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use explain API for term frequency analysis #1718

Open
lukavdplas opened this issue Dec 2, 2024 · 3 comments
Open

Use explain API for term frequency analysis #1718

lukavdplas opened this issue Dec 2, 2024 · 3 comments
Labels
backend changes to the django backend code quality code & performance improvements that do not affect user functionality visualisation changes to visualisation features

Comments

@lukavdplas
Copy link
Contributor

lukavdplas commented Dec 2, 2024

To calculate results in the term frequency graph, the backend uses the search API to get matching documents for a query, and then runs a Python function to count the number of matches within the document. (Which uses the termvectors and analyze APIs from Elasticsearch.)

As an alternative, we might use the explain API which gives more detailed info behind the relevance score of a document, including the absolute number of matches for each term. (You can also use "explain": true in the search request for the same data.)

I expect that this would be a lot more efficient (and thus faster) than our current method. However, I should note that the _explanation output is readable, but it's not trivial to write a function that will extract the absolute number of matches from it. (Definitely doable, though.)

Example

Here is the query I made:

{
  "query": {
    "bool": {
      "must": {
        "simple_query_string": {
            "query": "einddoel \"stip op de horizon\"",
            "fields": ["speech"]
            }
      },
      "filter": [
      ]
    }
  },
  "track_total_hits": true,
  "size": 5,
  "explain": true,
  "_source": false,
  "fields": ["speech", "speech.length"]
}

And here is a document in the results:

{
    "_id": "nl.proc.ob.d.h-tk-20122013-65-9.1.6.12",
    "_score": 39.17556,
    "fields": {
        "speech.length": [
            75
        ],
        "speech": [
            "Ik heb het niet zozeer over benchmarking. Het gaat mij erom dat je docenten een einddoel geeft: wat moet een kind hebben geleerd aan het einde van groep 8? Dat hebben we nu niet gedefinieerd in het basisonderwijs in Nederland. Daardoor heeft ook niemand dat einddoel in zicht. Wij praten hier heel vaak over het zetten van een stip op de horizon. Die stip op de horizon moet je zetten, anders kom je er niet."
        ]
    },
    "_explanation": {
        "value": 39.17556,
        "description": "sum of:",
        "details": [
            {
            "value": 12.398435,
            "description": "weight(speech:einddoel in 27808) [PerFieldSimilarity], result of:",
            "details": [
                {
                "value": 12.398435,
                "description": "score(freq=2.0), computed as boost * idf * tf from:",
                "details": [
                    {
                    "value": 2.2,
                    "description": "boost",
                    "details": []
                    },
                    {
                    "value": 7.2490835,
                    "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details": [
                        {
                        "value": 473,
                        "description": "n, number of documents containing term",
                        "details": []
                        },
                        {
                        "value": 666126,
                        "description": "N, total number of documents with field",
                        "details": []
                        }
                    ]
                    },
                    {
                    "value": 0.7774296,
                    "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details": [
                        {
                        "value": 2,
                        "description": "freq, occurrences of term within document",
                        "details": []
                        },
                        {
                        "value": 1.2,
                        "description": "k1, term saturation parameter",
                        "details": []
                        },
                        {
                        "value": 0.75,
                        "description": "b, length normalization parameter",
                        "details": []
                        },
                        {
                        "value": 72,
                        "description": "dl, length of field (approximate)",
                        "details": []
                        },
                        {
                        "value": 237.72823,
                        "description": "avgdl, average length of field",
                        "details": []
                        }
                    ]
                    }
                ]
                }
            ]
            },
            {
            "value": 26.777126,
            "description": "weight(speech:\"stip op de horizon\" in 27808) [PerFieldSimilarity], result of:",
            "details": [
                {
                "value": 26.777126,
                "description": "score(freq=2.0), computed as boost * idf * tf from:",
                "details": [
                    {
                    "value": 2.2,
                    "description": "boost",
                    "details": []
                    },
                    {
                    "value": 15.655978,
                    "description": "idf, sum of:",
                    "details": [
                        {
                        "value": 7.6807604,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                            "value": 307,
                            "description": "n, number of documents containing term",
                            "details": []
                            },
                            {
                            "value": 666126,
                            "description": "N, total number of documents with field",
                            "details": []
                            }
                        ]
                        },
                        {
                        "value": 0.7116165,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                            "value": 326968,
                            "description": "n, number of documents containing term",
                            "details": []
                            },
                            {
                            "value": 666126,
                            "description": "N, total number of documents with field",
                            "details": []
                            }
                        ]
                        },
                        {
                        "value": 0.18140924,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                            "value": 555612,
                            "description": "n, number of documents containing term",
                            "details": []
                            },
                            {
                            "value": 666126,
                            "description": "N, total number of documents with field",
                            "details": []
                            }
                        ]
                        },
                        {
                        "value": 7.082192,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                            "value": 559,
                            "description": "n, number of documents containing term",
                            "details": []
                            },
                            {
                            "value": 666126,
                            "description": "N, total number of documents with field",
                            "details": []
                            }
                        ]
                        }
                    ]
                    },
                    {
                    "value": 0.7774296,
                    "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details": [
                        {
                        "value": 2,
                        "description": "phraseFreq=2.0",
                        "details": []
                        },
                        {
                        "value": 1.2,
                        "description": "k1, term saturation parameter",
                        "details": []
                        },
                        {
                        "value": 0.75,
                        "description": "b, length normalization parameter",
                        "details": []
                        },
                        {
                        "value": 72,
                        "description": "dl, length of field (approximate)",
                        "details": []
                        },
                        {
                        "value": 237.72823,
                        "description": "avgdl, average length of field",
                        "details": []
                        }
                    ]
                    }
                ]
                }
            ]
            }
        ]
    }
}
@lukavdplas lukavdplas added code quality code & performance improvements that do not affect user functionality backend changes to the django backend visualisation changes to visualisation features labels Dec 2, 2024
@BeritJanssen
Copy link
Contributor

Will the explain API not suffer from the problem that Elasticsearch calculates values based on shards of data, and not the whole dataset? I do think it would be great to piggy-back on existing Lucene/Elasticsearch analysis, but we'd need to make sure that we get accurate results for the number of matches of a term.

@lukavdplas
Copy link
Contributor Author

I proposed using the explain API to get the absolute number of matches within a document. I don't see how using shards would affect that value?

@BeritJanssen
Copy link
Contributor

BeritJanssen commented Dec 12, 2024

Ah, I get it. I thought you were thinking of collecting term frequency over all documents that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend changes to the django backend code quality code & performance improvements that do not affect user functionality visualisation changes to visualisation features
Projects
None yet
Development

No branches or pull requests

2 participants