This fork is dedicated to the implementation of custom extractors into WannaDB. Upon selection of a custom span within a document for an attribute, the current state of WannaDB simply searches for exact matches in all remaining documents. With this addition, the necessary code, multiple extractors and corresponding evaluation code is provided, intended to change this by searching for semantically and syntactically similar nuggets in all remaining documents.
The main work is provided in matching/custom_match_extraction.py
, where the base abstract class BaseCustomMatchExtractor
is implemented, which provides the structure for all extractors to adapt upon. See below for a full list of all implemented extractors. In wannadb_api.py
, the extractor that is to be used can be changed by changing the find_additional_nuggets
attribute of the matching pipeline. For all extractors, the exception being the FAISS
extractor, a ParallelWrapper
is provided in matching/custom_match_extraction.py
. This class can be wrapped around the extractor initialization, which causes the extractor invocation to be designed data-parallel by distributing the remaining documents over a team of threads. Note that the requirements.txt
has changed as well, as the faiss-cpu library has been added as a requirement.
Integration of the custom extractors into the workflow of WannaDB works as follows: Once the user selects a custom span as match for the corresponding attribute, the custom-match
message is used in the matching phase matching.py
, and the nugget pipeline is run and the resulting nugget is added as match for that attribute. Using this nugget, the __call__
method of the used extractor is invoked, and the created nugget is used as template to extract similar spans in the remaining documents. The extractor returns a list of tuples, where each tuple entry denotes a document where a match has been found, along with the start and end index of the match span. Using this, the nugget pipeline is run for every span, new nuggets are created, and the matching is updated.
ExactCustomMatchExtractor
: Based on extracting exact matches to the annotated span from the other documents. Corresponds to the status quo of WannaDB.QuestionAnsweringCustomMatchExtractor
: Prompts the pretrained question answering LLMdeepset/roberta-base-squad2
by asking to extract a similar phrase to that of the selected span. With this, one match for each remaining document is retrieved and is classified as match if the extraction score exceeds a threshold.WordNetSimilarityCustomMatchExtractor
: LeveragesWordNet
, a semantic and lexical network which captures relationships between concepts, in order to extract semantically similar words to the selected span. To this end, the Wu-Palmer-Similarity between the match and each token of remaining documents is computed, which quantifies the depth of the first common preprocessor w.r.t to the two concepts. If a high similarity is found, a span corresponding to the ngram structure of the input span is extracted around the match.FaissSemanticSimilarityExtractor
: Extracts semantically and syntactically similar spans to the match using the FAISS library, allowing for high temporal efficiency, even with a large number of documents and tokens. To this end, the embeddings of every token is computed once and indexed using `FAISS. If an embedding of a single token is found to be similar to the whole query, it is further examined by matching it to the ngram structure of the query. A threshold is used to determine whether a candidate ngram is sufficient to classify it as a match.SpacySimilarityExtractor
: Similar to the FAISS extractor, this extractor computes the cosine similarity between the custom match to all tokens of remaining documents, and extracting a similar span corresponding to the ngram structure of the query. The main distinction is that a spaCy corpus is used to embed all tokens. Important: If this extractor is to be used, the spaCy corpusen_core_web_md
needs to be loaded beforehand, since the kernel requires a restart.NgramCustomMatchExtractor
: An old approach to custom extraction that works similar to the SpacySimilarityExtractor. The main difference is that SBERT is used as an embedding model. However, the inference times are too high to be considered practical.ForestCustomMatchExtractor
: This extractor is based on the task of regex synthesis, where positive and negative examples of an attribute are used to produce a regex string which can be used to extract syntactically similar spans to the custom span. To this end, FOREST is used and integrated into WannaDB. However, since multiple examples are required for each attribute, and the fact that many attributes lack close syntactic similarity, such synthesizer might not terminate. For this reason, the extractor has been removed from the main branch, but is still preserved on thesb/forest-extractor
branch.