-
Notifications
You must be signed in to change notification settings - Fork 3
Home
This document describes the information to integrate a new plugin into FARO core project. In order to do so, both plain text and language have to be provided.
All plugins are expected to be included in a set of configurable entities. There are already several default entities and they can be extended if necessary. (For further details, please, check conf/commons.yml
file).
Next, a plugin template can be found here. It is recommended to follow this template to integrate plugins since it would allow yopu to benefit from all available detectors and other utilities.
A folder containing the name of the plugin is required by the plugin orchestrator to discover it. It may contain these files:
- context.txt: this file may contain a word list context for each language (these words help us confirm and strengthen the confidence in the detection of sensitive information).
- context-left.txt: a context applied only to those words placed on the left side of the detected entities.
- context-right.txt: a context applied only to those words placed on the right side of the detected entities.
- context.yaml: config file for context. It includes both left and right span in characters to search for context words.
- entrypoint.py: defines the plugin entrypoint interface. This interface must be implemented to allow for plugin discovery.
- pattern.py: defines regular expressions (both lax and strict) as well as any possible validation method.
- <lang_dir>: additional languages would require a lang directory which includes its own pattern structure (mimicking this folder structure). This directory should be named according to ISO 639-1
- test: this folder should contain all unit test required to validate a plugin.
Next, let's see how to create a plugin using a bitcoin address detector plugin (it can be found at plugins/address_bitcoin). This plugin has followed the template described before.
Several requirements have to be met in order to successfully integrate a plugin in FARO. These requirements will be highlighted in the following sections using the bitcoin example.
The entrypoint.py is mandatory for successful integration because the FARO orchestrator will instantiate it and call both its constructor and method run.
The example bellow shows the constructor. The constructor sets both language and text to make plugins operate. Besides that, the run method must be implemented in order to allow the orchestrator to run it.
MANIFEST = {
"key": "FINANCIAL_DATA",
"is_lang_dependent": False
}
class PluginEntrypoint(PluginPatternEntrypointBase):
def __init__(self, text, lang='uk'):
super().__init__(_CWD, MANIFEST["is_lang_dependent"], MANIFEST["key"], lang, text)
def output(self, unconsolidated_lax_dict=None, consolidated_lax_dict=None, strict_ent_dict=None,
validate_dict=None):
return super().output(validate_dict=validate_dict)
def run(self):
return super().run()
Additional elements are strongly recommended to satisfy integration requirements.
One of these elements is the MANIFEST. The MANIFEST is a good way to set two important required parameters: its targeted entity and its language dependency.
The targeted entity (key) is necessary to generate the output dictionary. This is because a plugin has to be classified in one of the entities present in conf/commons.yalm. If a plugin output does not fit in any of these entities it has to be created. This output dictionary will be generated by the default output method, which is possible to modify.
For the example of the Bitcoin plugin, we use the entity "FINANCIAL_DATA" as the key.
Language dependency can also be configured in MANIFEST. In the parameter is_lang_dependent. If true, the plugin base class (PluginPatternEntrypointBase) will try to load language patterns according to the folder structure defined above. The name of this directory must be the language nomenclature (ISO 639-1). For example, es, pt, en, etc.
In the example, the parameter is_lang_depende is False. Since it has no dependency on language.
In the class method (output) you define the output that the plugin returns. This method receives 4 objects which are described below.
- unconsolidated_lax_dict: Detected entities by lax regex expression from pattern.lax_regexp() not validated by context.
- consolidated_lax_dict: Detected entities by lax regex expression from pattern.lax_regexp() validated by context.
- strict_ent_dict: Detected entities by strict regex expression from pattern.strict_regexp() method.
- validate_dict: Validated detected entities regex expression both from lax and strict from pattern.validate() method.
It is recommended to return objects with a higher degree of reliability depending on the properties of the plugin.
If the plugin uses a validate function, it returns only the object: validate_dict. If the plugin does not use a validation function, return the strict_ent_dict (if it has pattern.strict_regexp()) and/or consolidated_lax_dict (if it has pattern.lax_regexp())
As a general rule, it is not recommended to return the unconsolidated_lax_dict object, because it could return many false positives. This object can be useful when you want to collect as much information as possible while sacrificing data reliability or it is not possible to use context.
In the example, we use the validate_dict object since the plugin uses a validation function.
In this file the regular expressions and validation are defined. It is necessary that there is at least one regular expression, strict or lax.
from stdnum.bitcoin import is_valid
class PluginPattern(PluginPatternBase):
def strict_regexp(self):
return {
"STRICT_REG_BITCOIN_P2PKH_P2SH_V0": r"[13][a-km-zA-HJ-NP-Z0-9]{26,33}",
"STRICT_REG_BITCOIN_BECH32_V0": r"(bc1)[a-zA-HJ-NP-Z0-9]{25,39}"
}
def validate(self, ent):
return is_valid(ent)
def __init__(self, cwd=_CWD, lax_regexp=None, strict_regexp=None):
lax_regexp = self.lax_regexp() if lax_regexp is None else lax_regexp
strict_regexp = self.strict_regexp() if strict_regexp is None else strict_regexp
super().__init__(cwd, lax_regexp, strict_regexp)
In the code shown, two strict regular expressions are defined.
A regular expression is made up of two parts. The regular expression and the name, which uses the following nomenclature:
<STRICT/LAX>_REG_<entity>_<regex_version>
(Internally this allows to know with which regular expression each entity was detected). The naming convention allows versioning of regular expressions to detect modifications or enhancements.
The validation function receives a discovered entity and should return a boolean (True or False).
For validation, the stdnum package (available in FARO) is used, specifically the bitcoin module, is_valid function.
The stdnum module allows you to parse, validate and reformat standard numbers and codes in different formats. Contains a large collection of number formats. It has a very complete list of validation mechanisms, so its use is recommended.
Context is an important part of lax regular expressions, especially when there is no validation function.
Lax regular expressions can be lax and generate many false positives. Context is a dictionary of words that are searched before or after the detected entity and helps reduce the number of false positives.
The phone plugin is a good example. It only uses lax regex and has no validation function.
If only the regular expression were used, all numbers that match the regular expression would be logged. Using context, before or after, false positives are eliminated.
For example, phone 888-999-000
The context word would be phone.
Context files:
- Context-left.txt: List of words to search before the detected entity.
- Context-right.txt: List of words to search after the detected entity.
- Context.txt: List of words to search before and after an entity.
If a plugin has a language dependency. That is, regular expressions are only valid for a certain language or country, you need to create a new directory for that language.
The name of this directory must be the language nomenclature (ISO 639-1). For example, es, pt, en, etc.
A good example would be the phone plugin (plugins / phone), so it is recommended to review it.
In each of the language directories their corresponding lax regular expressions, on his own pattern.py, and their contexts are applied.
Remember that for language dependencies to be used, the variable "is_lang_dependent" in the Entrypoint file must be true.
It is possible and advisable to use unit tests to verify that the plugin works correctly.
The structure of the test directory is as follows:
- test_<name_plugin>: Main test file.
- data: Directory of documents to test the plugin.
Here is an example of a unit test of the bitcoin address plugin.
import unittest
from pathlib import Path
from plugins.address_bitcoin.entrypoint import PluginEntrypoint, MANIFEST
CWD = Path(__file__).parent
INPUT_PATH = CWD / "data"
FILE_NAME = "document.txt"
GROUND_TRUTH_RESULT = ["1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2", "3J98t1WpEZ73CNmQviecrnyiWrnqRhWNLy",
"bc1qar0srrr7xfkvy5l643lydnw9re59gtzzwf5mdq"]
def load_file(file_path):
with open(INPUT_PATH / file_path, "r", encoding='utf8') as f_stream:
return [f_stream.read().replace('\n', '')]
class AddressBitcoinTest(unittest.TestCase):
def test_for_address_bitcoin(self):
text = load_file(FILE_NAME)
address_bitcoin_plugin = PluginEntrypoint(text=text)
plugin_data = address_bitcoin_plugin.run()
results = list(plugin_data[MANIFEST['key']])
self.assertTrue(len(results) == len(GROUND_TRUTH_RESULT))
diff_list = (set(results) ^ set(GROUND_TRUTH_RESULT))
self.assertTrue(len(diff_list) == 0)
if __name__ == '__main__':
unittest.main()
The constant GROUND_TRUTH_RESULT is used to record the expected results.
The constant FILE_NAME is assigned the name of the document to be analyzed.
Within the test class, each of the test functions is defined. It is recommended to check the unit tests in the test/ directory.