Use explain API for term frequency analysis #1718

lukavdplas · 2024-12-02T16:12:22Z

To calculate results in the term frequency graph, the backend uses the search API to get matching documents for a query, and then runs a Python function to count the number of matches within the document. (Which uses the termvectors and analyze APIs from Elasticsearch.)

As an alternative, we might use the explain API which gives more detailed info behind the relevance score of a document, including the absolute number of matches for each term. (You can also use "explain": true in the search request for the same data.)

I expect that this would be a lot more efficient (and thus faster) than our current method. However, I should note that the _explanation output is readable, but it's not trivial to write a function that will extract the absolute number of matches from it. (Definitely doable, though.)

Example

Here is the query I made:

{
  "query": {
    "bool": {
      "must": {
        "simple_query_string": {
            "query": "einddoel \"stip op de horizon\"",
            "fields": ["speech"]
            }
      },
      "filter": [
      ]
    }
  },
  "track_total_hits": true,
  "size": 5,
  "explain": true,
  "_source": false,
  "fields": ["speech", "speech.length"]
}

And here is a document in the results:

{
    "_id": "nl.proc.ob.d.h-tk-20122013-65-9.1.6.12",
    "_score": 39.17556,
    "fields": {
        "speech.length": [
            75
        ],
        "speech": [
            "Ik heb het niet zozeer over benchmarking. Het gaat mij erom dat je docenten een einddoel geeft: wat moet een kind hebben geleerd aan het einde van groep 8? Dat hebben we nu niet gedefinieerd in het basisonderwijs in Nederland. Daardoor heeft ook niemand dat einddoel in zicht. Wij praten hier heel vaak over het zetten van een stip op de horizon. Die stip op de horizon moet je zetten, anders kom je er niet."
        ]
    },
    "_explanation": {
        "value": 39.17556,
        "description": "sum of:",
        "details": [
            {
            "value": 12.398435,
            "description": "weight(speech:einddoel in 27808) [PerFieldSimilarity], result of:",
            "details": [
                {
                "value": 12.398435,
                "description": "score(freq=2.0), computed as boost * idf * tf from:",
                "details": [
                    {
                    "value": 2.2,
                    "description": "boost",
                    "details": []
                    },
                    {
                    "value": 7.2490835,
                    "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details": [
                        {
                        "value": 473,
                        "description": "n, number of documents containing term",
                        "details": []
                        },
                        {
                        "value": 666126,
                        "description": "N, total number of documents with field",
                        "details": []
                        }
                    ]
                    },
                    {
                    "value": 0.7774296,
                    "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details": [
                        {
                        "value": 2,
                        "description": "freq, occurrences of term within document",
                        "details": []
                        },
                        {
                        "value": 1.2,
                        "description": "k1, term saturation parameter",
                        "details": []
                        },
                        {
                        "value": 0.75,
                        "description": "b, length normalization parameter",
                        "details": []
                        },
                        {
                        "value": 72,
                        "description": "dl, length of field (approximate)",
                        "details": []
                        },
                        {
                        "value": 237.72823,
                        "description": "avgdl, average length of field",
                        "details": []
                        }
                    ]
                    }
                ]
                }
            ]
            },
            {
            "value": 26.777126,
            "description": "weight(speech:\"stip op de horizon\" in 27808) [PerFieldSimilarity], result of:",
            "details": [
                {
                "value": 26.777126,
                "description": "score(freq=2.0), computed as boost * idf * tf from:",
                "details": [
                    {
                    "value": 2.2,
                    "description": "boost",
                    "details": []
                    },
                    {
                    "value": 15.655978,
                    "description": "idf, sum of:",
                    "details": [
                        {
                        "value": 7.6807604,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                            "value": 307,
                            "description": "n, number of documents containing term",
                            "details": []
                            },
                            {
                            "value": 666126,
                            "description": "N, total number of documents with field",
                            "details": []
                            }
                        ]
                        },
                        {
                        "value": 0.7116165,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                            "value": 326968,
                            "description": "n, number of documents containing term",
                            "details": []
                            },
                            {
                            "value": 666126,
                            "description": "N, total number of documents with field",
                            "details": []
                            }
                        ]
                        },
                        {
                        "value": 0.18140924,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                            "value": 555612,
                            "description": "n, number of documents containing term",
                            "details": []
                            },
                            {
                            "value": 666126,
                            "description": "N, total number of documents with field",
                            "details": []
                            }
                        ]
                        },
                        {
                        "value": 7.082192,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                            "value": 559,
                            "description": "n, number of documents containing term",
                            "details": []
                            },
                            {
                            "value": 666126,
                            "description": "N, total number of documents with field",
                            "details": []
                            }
                        ]
                        }
                    ]
                    },
                    {
                    "value": 0.7774296,
                    "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details": [
                        {
                        "value": 2,
                        "description": "phraseFreq=2.0",
                        "details": []
                        },
                        {
                        "value": 1.2,
                        "description": "k1, term saturation parameter",
                        "details": []
                        },
                        {
                        "value": 0.75,
                        "description": "b, length normalization parameter",
                        "details": []
                        },
                        {
                        "value": 72,
                        "description": "dl, length of field (approximate)",
                        "details": []
                        },
                        {
                        "value": 237.72823,
                        "description": "avgdl, average length of field",
                        "details": []
                        }
                    ]
                    }
                ]
                }
            ]
            }
        ]
    }
}

The text was updated successfully, but these errors were encountered:

BeritJanssen · 2024-12-11T09:55:47Z

Will the explain API not suffer from the problem that Elasticsearch calculates values based on shards of data, and not the whole dataset? I do think it would be great to piggy-back on existing Lucene/Elasticsearch analysis, but we'd need to make sure that we get accurate results for the number of matches of a term.

lukavdplas · 2024-12-11T14:11:09Z

I proposed using the explain API to get the absolute number of matches within a document. I don't see how using shards would affect that value?

BeritJanssen · 2024-12-12T08:43:11Z

Ah, I get it. I thought you were thinking of collecting term frequency over all documents that way.

lukavdplas added code quality code & performance improvements that do not affect user functionality backend changes to the django backend visualisation changes to visualisation features labels Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use explain API for term frequency analysis #1718

Use explain API for term frequency analysis #1718

lukavdplas commented Dec 2, 2024 •

edited

Loading

BeritJanssen commented Dec 11, 2024

lukavdplas commented Dec 11, 2024

BeritJanssen commented Dec 12, 2024 •

edited

Loading

Use explain API for term frequency analysis #1718

Use explain API for term frequency analysis #1718

Comments

lukavdplas commented Dec 2, 2024 • edited Loading

BeritJanssen commented Dec 11, 2024

lukavdplas commented Dec 11, 2024

BeritJanssen commented Dec 12, 2024 • edited Loading

lukavdplas commented Dec 2, 2024 •

edited

Loading

BeritJanssen commented Dec 12, 2024 •

edited

Loading