Skip to content

Latest commit

 

History

History
154 lines (107 loc) · 9.04 KB

tutorial_connector.md

File metadata and controls

154 lines (107 loc) · 9.04 KB

Connector Developer Guide

This guide describes how developers can write new connectors for Presidio. Connectors allow users to anonymize their data between different data sources. This guide will review a few key Connector concepts and then describe how to create connectors and schedule the anonymization process between them.

Core concepts

To anonymize data between one system and another, define Connectors for the systems to pull the data from or push data to. There are two types of Connectors:

  1. Source - data source from which data is read.
  2. Sink - anonymized data output target.

After having both sink and source connectors defined, the next step is to trigger the Scheduler. The Scheduler allows you to read your data periodically from the Source Connector, anonymize it and export it to Sink Connector.

How to Create New Data Source Connector

The project already supports sources like object storage and streams. For similar sources and code reuse, or creating a new source follow the steps in how to create:

Object Storage Source Connector

Object storage is a computer data storage architecture that manages data as objects. Currently the following object storage services are supported: - AWS S3 - Azure Blob Storage - Azure Data Lake Storage Gen2

To implement a new object storage source connector and reuse existing code follow the next steps:

  1. Implement the scanner interface - The scanner interface is representing a storage data source. It requires implementation of three methods:

    1. Init - initialize connection to the storage data source.
    2. ScanFunc - the function executed on the scanned item
    3. Scan(fn ScanFunc) - function that walks over the items to scan and executes ScanFunc on each item

    Implement the scanner interface using a struct under presidio-collector/cmd/presidio-collector/scanner. For example see: storage scanner

  2. Add configuration - create templates for source and object storage configurations and service definitions, on the presidio-genproto repo.

  3. Add to Factory - Add your newly created scanner in the storage factory CreateScanner method.

Stream Source Connector

A stream is data that is continuously generated by different sources.

To implement a new stream source connector and reuse existing code follow the next steps:

  1. Implement the stream interface - The stream interface is representing a stream data source. It requires implementation of three methods:
    1. ReceiveFunc - define how received event should be processed
    2. Receive - read event from stream.
    3. Send - send event to stream. Implement the stream interface using a struct under pkg/stream. For example see: kafka
  2. Add configuration - create templates for source and stream configurations and service definitions, on the presidio-genproto repo.
  3. Add to Factory - Add your newly created stream under streams factory CreateStream method.

New Source Connector

Creation of additional types of source connector are more than welcome. To implement a new type of data Source Connector follow the next steps:

  1. Create implementation - Create a new directory under pkg and implement the new data source connector. you can take example from the existing connectors.
  2. Create configuration - create templates for source and any additional configurations and service definitions, on the presidio-genproto repo.
  3. Create Factory - Create a new directory under presidio-collector/cmd/presidio-collector/ and add a factory like the factories for other connectors types.

How to Create New Data Sink

All data sinks need to implement the Datasink interface. The Datasink interface represents the different data output types. It requires implementation of following methods:

  • Init - how to initialize the datasink.
  • WriteAnalyzeResults - write the analyzer results to the specified datasink.
  • WriteAnonymizeResults - write the anonymized response to the specified datasink.

The project already supports sinks like object storage, database and streams. For similar sinks and code reuse, or creating a new sink follow the steps in how to create:

Object Storage Sink Connector

Creating an object storage sink connector is similar to object storage source connector. In addition to steps listed in storage source connector, implement the following functions in you storage scanner class:

  1. CreateContainer(name string) - create container/bucket reference
  2. New(kind string, config stow.Config, concurrencyLimit int) - initialize a new storage instance.
  3. PutItem(name string, content string, container stow.Container) - write a new item to the container For example see: storage

Stream Sink Connector

Creating a stream sink connector is similar to stream source connector. In addition to steps listed in stream source connector:

  1. Implement producer - To allow message to be written to the stream define NewProducer method in your stream struct.
  2. Add to Factory - Add your newly created stream sink under stream New method.

DB Sink Connector

Presidio supports the following DBs:

  • MSSQL
  • MySQL
  • PostgreSQL

To support additional databases:

  1. Implement datasink interface - by creating a new struct under presidio-datasink/cmd/presidio-datasink/database. For example see: database
  2. Add to Factory - Add your newly created DB sink under datasink-factory

New Sink Connector

Creation of additional types of sink connector are more than welcome. To implement a new type of data Sink Connector follow the next steps:

  1. Implement required writing capabilities - by creating new directory under pkg and adding a struct implementing the required writing capabilities.
  2. Create configuration - create templates for source and any additional configurations and service definitions, on the presidio-genproto repo.
  3. Implement datasink interface - by creating a new directory under presidio-datasink/cmd/presidio-datasink and a struct which implements datasink interface methods.
  4. Add to Factory - Add your newly created sink under datasink-factory

Scheduler

The Scheduler allows you to read your data periodically from the Source Connector, anonymize it and export it to Sink Connector.

For more information about scheduler usage: scheduler readme.

Currently we support object storage and streams scheduling.

To support additional source and sinks in scheduler follow the next steps:

  1. Create configuration - create any additional configurations, templates and service definitions, on the presidio-genproto repo as follows:
    • CronJobApiRequest - represents the request to the API HTTP service
    • CronJobRequest - represents the request to the scheduler service via GRPC.
    • CronJobResponse- represents the response from the scheduler service.
    • Support the new job in the init() method.
    • Add Apply method to SchedulerServiceClient to trigger a new scanning cron job and return if it was triggered successfully.
    • Create a Handler for to handle job request.
    • Add the newly added Apply method and the Handler to _SchedulerService_serviceDesc. Use existing storage scanner/stream as an example.
  2. Create job - Create new job directory under presidio-api/cmd/presidio-api/api and create job implementation, how it should be scheduled and etc. For example see: scanner-job
  3. Support new cron job creation to existing API - Add newly created job in methods validateTemplate method.