-
Notifications
You must be signed in to change notification settings - Fork 3
Home
This document describes the information to integrate a new plugin into FARO core project. In order to do so, both plain text and language have to be provided.
All plugins are expected to be included in a set of configurable entities. There are already several default entities and they can be extended if necessary. (For further details, please, check conf/commons.yml
file).
Next, a plugin template can be found here. It is recommended to follow this template to integrate plugins since it would allow you to benefit from all available detectors and other utilities.
A folder containing the name of the plugin is required by the plugin orchestrator to discover it. It may contain these files:
- entrypoint.py: defines the plugin entrypoint interface. This interface must be implemented to allow for plugin discovery.
- pattern.py: defines regular expressions (both lax and strict) as well as any possible validation method.
- context.txt: this file may contain a word list context for each language (these words help us confirm and strengthen the confidence in the detection of sensitive information).
- context-left.txt: a context applied only to those words placed on the left side of the detected entities.
- context-right.txt: a context applied only to those words placed on the right side of the detected entities.
- context.yaml: config file for context. It includes both left and right span in characters to search for context words.
- <lang_dir>: additional languages would require a lang directory which includes its own pattern structure (mimicking this folder structure). This directory should be named according to ISO 639-1
- test: this folder should contain all unit tests required to validate a plugin.
Next, to understand how to create a plugin we will use as an examplehow to detect a bitcoin wallet and walk you through all the needed steps. You can see the finished example here. We have used the already mentioned template to jumpstart its construction.
Several requirements have to be met in order to successfully integrate a plugin in FARO. These requirements will be highlighted in the following sections using the bitcoin example.
The implementation of entrypoint.py
is required for successful integration because the FARO orchestrator will instantiate it and call both its constructor and run
method.
The example bellow shows the constructor. The constructor sets both language and text to make plugins operate. Besides that, the run method must be implemented in order to allow the orchestrator to run it.
MANIFEST = {
"key": "FINANCIAL_DATA",
"is_lang_dependent": False
}
class PluginEntrypoint(PluginPatternEntrypointBase):
def __init__(self, text, lang='uk'):
super().__init__(_CWD, MANIFEST["is_lang_dependent"], MANIFEST["key"], lang, text)
def output(self, unconsolidated_lax_dict=None, consolidated_lax_dict=None, strict_ent_dict=None,
validate_dict=None):
return super().output(validate_dict=validate_dict)
def run(self):
return super().run()
Additional elements are strongly recommended to satisfy integration requirements.
One of these elements is the MANIFEST
. The MANIFEST
is a good way to set two important required parameters: its targeted entity and its language dependency.
The targeted entity (key) is necessary to generate the output dictionary. This is because a plugin has to be classified in one of the entities present in conf/commons.yaml
. If a plugin output does not fit in any of these entities it has to be created.
This output dictionary will be generated by the default output method, there you can tune this to your needs.
For this example we will use the entity "FINANCIAL_DATA" as the key
.
Language dependency can also be configured in MANIFEST using the is_lang_dependent
parameter. If True
, the plugin base class (PluginPatternEntrypointBase) will try to load language patterns according to the folder structure defined above. The name of this directory must follow the language nomenclature ISO 639-1. For example, es, pt, en, etc.
In our example, the parameter is_lang_dependent is False
. Since it has no dependency on language.
In the class method (output
) you define the output that the plugin returns. This method receives 4 objects that are described below:
- unconsolidated_lax_dict: Detected entities by lax regex expression from pattern.lax_regexp() not validated by context.
- consolidated_lax_dict: Detected entities by lax regex expression from pattern.lax_regexp() validated by context.
- strict_ent_dict: Detected entities by strict regex expression from pattern.strict_regexp() method.
- validate_dict: Validated detected entities regex expression both from lax and strict from pattern.validate() method.
It is recommended to return the objects with a higher degree of confidence depending on the properties of the plugin to avoid polluting the output with false positives.
If the plugin uses a validate function, it returns only the object: validate_dict. If the plugin does not use a validation function, return the strict_ent_dict (if it has pattern.strict_regexp()) and/or consolidated_lax_dict (if it has pattern.lax_regexp())
As a general rule, it is not recommended to return the unconsolidated_lax_dict object, because it could return many false positives. This object can be useful when you want to collect as much information as possible while sacrificing data reliability or when it is not possible to use context.
In the example, we use the validate_dict object since the plugin uses a validation function.
This is where regular expressions and validation are defined. It is necessary that there is at least one regular expression, strict or lax.
from stdnum.bitcoin import is_valid
class PluginPattern(PluginPatternBase):
def strict_regexp(self):
return {
"STRICT_REG_BITCOIN_P2PKH_P2SH_V0": r"[13][a-km-zA-HJ-NP-Z0-9]{26,33}",
"STRICT_REG_BITCOIN_BECH32_V0": r"(bc1)[a-zA-HJ-NP-Z0-9]{25,39}"
}
def validate(self, ent):
return is_valid(ent)
def __init__(self, cwd=_CWD, lax_regexp=None, strict_regexp=None):
lax_regexp = self.lax_regexp() if lax_regexp is None else lax_regexp
strict_regexp = self.strict_regexp() if strict_regexp is None else strict_regexp
super().__init__(cwd, lax_regexp, strict_regexp)
In the code shown, two strict regular expressions are defined.
A regular expression is made up of two parts. The regular expression and the name, which uses the following nomenclature:
<STRICT/LAX>_REG_<entity>_<regex_version>
(Internally this allows us to know which regular expression has matched which entity). The naming convention allows versioning of regular expressions to add improvements or fix problems.
The validation function receives a discovered entity and should return a boolean (True
or False
).
For validation, the stdnum package (available in FARO as a dependency) is used, specifically the bitcoin module, is_valid function.
The stdnum module allows you to parse, validate and reformat standard numbers and codes in different formats. Contains a large collection of number formats. It has a very complete list of validation mechanisms, so its use is recommended when possible.
Context is an important part of lax regular expressions, especially when there is no validation function.
Lax regular expressions can lead to many false positives. Context is a dictionary of words that are searched before, after or in both sides of the detected entity and helps reduce the number of false positives.
The phone plugin is a good example. It only uses lax regex and has no validation function.
If only the regular expression were used, all numbers that match the regular expression would be logged. Using context, before or after, false positives are eliminated.
For example, phone 888-999-000
The context word would be phone.
Context files:
- Context-left.txt: List of words to search before the detected entity.
- Context-right.txt: List of words to search after the detected entity.
- Context.txt: List of words to search before and after an entity.
If a plugin has a language dependency. That is, regular expressions are only valid for a certain language or country, you need to create a new folder for that language.
The name of this folder must follow the language nomenclature (ISO 639-1). For example, es, pt, en, etc.
An example would be the phone plugin, so please check it out.
In each of the language folders their corresponding lax regular expressions, on his own pattern.py, and their contexts are applied.
Remember that for language dependencies to be used, the variable is_lang_dependent
in the entrypoint.py
file must be True
.
It is possible and advisable to use unit tests to verify that the plugin works correctly.
The structure of the test directory is as follows:
- test_<name_plugin>: Main test file.
- data: Directory of documents to test the plugin.
Here is an example of a unit test of the bitcoin address plugin.
import unittest
from pathlib import Path
from plugins.address_bitcoin.entrypoint import PluginEntrypoint, MANIFEST
CWD = Path(__file__).parent
INPUT_PATH = CWD / "data"
FILE_NAME = "document.txt"
GROUND_TRUTH_RESULT = ["1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2", "3J98t1WpEZ73CNmQviecrnyiWrnqRhWNLy",
"bc1qar0srrr7xfkvy5l643lydnw9re59gtzzwf5mdq"]
def load_file(file_path):
with open(INPUT_PATH / file_path, "r", encoding='utf8') as f_stream:
return [f_stream.read().replace('\n', '')]
class AddressBitcoinTest(unittest.TestCase):
def test_for_address_bitcoin(self):
text = load_file(FILE_NAME)
address_bitcoin_plugin = PluginEntrypoint(text=text)
plugin_data = address_bitcoin_plugin.run()
results = list(plugin_data[MANIFEST['key']])
self.assertTrue(len(results) == len(GROUND_TRUTH_RESULT))
diff_list = (set(results) ^ set(GROUND_TRUTH_RESULT))
self.assertTrue(len(diff_list) == 0)
if __name__ == '__main__':
unittest.main()
The constant GROUND_TRUTH_RESULT is used to record the expected results.
FILE_NAME contains the name of the document to be analyzed.
Within the test class, each of the test functions is defined. It is recommended to check FARO unit tests for inspiration if needed.