Skip to content
Jose Torres Velasco edited this page Feb 5, 2021 · 12 revisions

How to create a new FARO plugin

This document describes the information to integrate a plug-in with FARO core project. In order to do so, both plain text and language have to be provided.

All plug-ins are expected to output a set of configurable entities. There are already several default entities and they can be extended if necessary. (For further details, please, check conf/commons.yml file).

Next, a plugin template can be found here [ZIP PLUGIN]. It is recommended to follow this template to integrate plug-ins since it would allow to benefit from all available detectors and other utilities.

Plugin template folder structure

A folder containing the name of the plug-in is required by the plug-in orchestrator to discover it. It may contain these files:

  • context.txt: this file may contain a word list context for each language (these words help confirm and strengthen the detected entities).
  • context-left.txt: a context applied only to those words placed on the left side of the detected entities.
  • context-right.txt: a context applied only to those words placed on the right side of the detected entities.
  • context.yaml: config file for contexts. It includes both left and right span in characters to search for context words.
  • entrypoint.py: defines the plug-in entrypoint interface. This interface must be implemented to allow for plug-in discovery.
  • pattern.py: defines regular expressions (both lax and strict) as well as any possible validation method.
  • <lang_dir>: additional languages would require a lang directory which includes its own pattern structure (mimicking this folder structure). This directory should be named according to ISO 639-1
  • test: this folder should contain all unit test required to validate a plug-in.

How to create a plug-in

Next, let's see how to create a plug-in using a bitcoin address detector plug-in (it can be found at plugins/address_bitcoin). This plug-in has followed the utils/pattern_template described before.

Several requirements have to be met in order to successfully integrate a plug-in in FARO. These requirements will be highlighted next using the bitcoin example.


File entrypoint.py

The entrypoint.py is mandatory for successful integration because the FARO orchestrator will instantiate it and call both its constructor and method run.

The example bellow shows the constructor. The constructor sets both language and text to make plug-ins operate. Besides that, the run method must be implemented in order to allow the orchestrator to run it.

MANIFEST = {
    "key": "FINANCIAL_DATA",
    "is_lang_dependent": False
}


class PluginEntrypoint(PluginPatternEntrypointBase):
    def __init__(self, text, lang='uk'):
        super().__init__(_CWD, MANIFEST["is_lang_dependent"], MANIFEST["key"], lang, text)

    def output(self, unconsolidated_lax_dict=None, consolidated_lax_dict=None, strict_ent_dict=None,
               validate_dict=None):
        return super().output(validate_dict=validate_dict)

    def run(self):
        
        return super().run()

Additional elements are strongly recommended to satisfy integration requirements.

One of these elements is the MANIFEST. The MANIFEST is a good way to set two important required parameters: its targeted entity and its language dependency.

The targeted entity (key) is necessary to generate the output dictionary. This is because a plug-in has to be classified in one of the entities present in conf/commons.yalm. If a plug-in output does not fit in any of these entities it has to be created. This output dictionary will be generated by the default output method, which is possible to modify.

For the example of the Bitcoin plug-in, we use the entity "FINANCIAL_DATA" as the key.

Language dependency can also be configured in MANIFEST. In the parameter is_lang_dependent. If true, the plugin base class (PluginPatternEntrypointBase) will try to load language patterns according to the folder structure defined above. The name of this directory must be the language nomenclature (ISO 639-1). For example, es, pt, en, etc.

In the example, the parameter is_lang_depende is False. Since it has no dependency on language.

In the class method (output) you define the output that the plug-in returns. This method receives 4 objects which are described below.

  • unconsolidated_lax_dict: Detected entities by lax regex expression from pattern.lax_regexp() not validated by context.
  • consolidated_lax_dict: Detected entities by lax regex expression from pattern.lax_regexp() validated by context.
  • strict_ent_dict: Detected entities by strict regex expression from pattern.strict_regexp() method.
  • validate_dict: Validated detected entities regex expression both from lax and strict from pattern.validate() method.

It is recommended to return objects with a higher degree of reliability depending on the properties of the plugin.

If the plugin uses a validate function, it returns only the object: validate_dict. If the plugin does not use a validation function, return the strict_ent_dict (if it has pattern.strict_regexp()) and/or consolidated_lax_dict (if it has pattern.lax_regexp())

As a general rule, it is not recommended to return the unconsolidated_lax_dict object, because it could return many false positives. This object can be useful when you want to collect as much information as possible while sacrificing data reliability or it is not possible to use context.

In the example, we use the validate_dict object since the plug-in uses a validation function.


File pattern.py

In this file the regular expressions and validation are defined. It is necessary that there is at least one regular expression, strict or lax.

from stdnum.bitcoin import is_valid

class PluginPattern(PluginPatternBase):

    def strict_regexp(self):
        return {
            "STRICT_REG_BITCOIN_P2PKH_P2SH_V0": r"[13][a-km-zA-HJ-NP-Z0-9]{26,33}",
            "STRICT_REG_BITCOIN_BECH32_V0": r"(bc1)[a-zA-HJ-NP-Z0-9]{25,39}"
        }


    def validate(self, ent):
        return is_valid(ent)

    def __init__(self, cwd=_CWD, lax_regexp=None, strict_regexp=None):
        lax_regexp = self.lax_regexp() if lax_regexp is None else lax_regexp
        strict_regexp = self.strict_regexp() if strict_regexp is None else strict_regexp
        super().__init__(cwd, lax_regexp, strict_regexp)

In the code shown, two strict regular expressions are defined.

A regular expression is made up of two parts. The regular expression and the name, which uses the following nomenclature:

<STRICT/LAX>_REG_<entity>_<regex_version>

(Internally this allows to know with which regular expression each entity was detected). The naming convention allows versioning of regular expressions to detect modifications or enhancements.

The validation function receives a discovered entity and should return a boolean (True or False).

For validation, the stdnum package (available in FARO) is used, specifically the bitcoin module, is_valid function.

The stdnum module allows you to parse, validate and reformat standard numbers and codes in different formats. Contains a large collection of number formats. It has a very complete list of validation mechanisms, so its use is recommended.


The context

Context is an important part of lax regular expressions, especially when there is no validation function.

Lax regular expressions can be lax and generate many false positives. Context is a dictionary of words that are searched before or after the detected entity and helps reduce the number of false positives.

The phone plug-in is a good example. It only uses lax regex and has no validation function.

If only the regular expression were used, all numbers that match the regular expression would be logged. Using context, before or after, false positives are eliminated.

For example, phone 888-999-000

The context word would be phone.

Context files:

  • Context-left.txt: List of words to search before the detected entity.
  • Context-right.txt: List of words to search after the detected entity.
  • Context.txt: List of words to search before and after an entity.

Language dependence (language pattern)

If a plugin has a language dependency. That is, regular expressions are only valid for a certain language or country, you need to create a new directory for that language.

The name of this directory must be the language nomenclature (ISO 639-1). For example, es, pt, en, etc.

A good example would be the phone plug-in (plugins / phone), so it is recommended to review it.

In each of the language directories their corresponding lax regular expressions, on his own pattern.py, and their contexts are applied.

Remember that for language dependencies to be used, the variable "is_lang_dependent" in the Entrypoint file must be true.


Test

It is possible and advisable to use unit tests to verify that the plug-in works correctly.

The structure of the test directory is as follows:

  • test_<name_plug-in>: Main test file.
  • data: Directory of documents to test the plug-in.

Here is an example of a unit test of the bitcoin address plug-in.

import unittest
from pathlib import Path
from plugins.address_bitcoin.entrypoint import PluginEntrypoint, MANIFEST

CWD = Path(__file__).parent
INPUT_PATH = CWD / "data"
FILE_NAME = "document.txt"
GROUND_TRUTH_RESULT = ["1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2", "3J98t1WpEZ73CNmQviecrnyiWrnqRhWNLy",
                       "bc1qar0srrr7xfkvy5l643lydnw9re59gtzzwf5mdq"]


def load_file(file_path):
    with open(INPUT_PATH / file_path, "r", encoding='utf8') as f_stream:
        return [f_stream.read().replace('\n', '')]


class AddressBitcoinTest(unittest.TestCase):

    def test_for_address_bitcoin(self):
        text = load_file(FILE_NAME)
        address_bitcoin_plugin = PluginEntrypoint(text=text)
        plugin_data = address_bitcoin_plugin.run()
        results = list(plugin_data[MANIFEST['key']])
        self.assertTrue(len(results) == len(GROUND_TRUTH_RESULT))
        diff_list = (set(results) ^ set(GROUND_TRUTH_RESULT))
        self.assertTrue(len(diff_list) == 0)


if __name__ == '__main__':
    unittest.main()

The constant GROUND_TRUTH_RESULT is used to record the expected results.

The constant FILE_NAME is assigned the name of the document to be analyzed.

Within the test class, each of the test functions is defined. It is recommended to check the unit tests in the test/ directory.

Clone this wiki locally