This guide describes how developers can write new connectors for Presidio. Connectors allow users to anonymize their data between different data sources. This guide will review a few key Connector concepts and then describe how to create connectors and schedule the anonymization process between them.
To anonymize data between one system and another, define Connectors for the systems to pull the data from or push data to. There are two types of Connectors:
- Source - data source from which data is read.
- Sink - anonymized data output target.
After having both sink and source connectors defined, the next step is to trigger the Scheduler. The Scheduler allows you to read your data periodically from the Source Connector, anonymize it and export it to Sink Connector.
The project already supports sources like object storage and streams. For similar sources and code reuse, or creating a new source follow the steps in how to create:
Object storage is a computer data storage architecture that manages data as objects. Currently the following object storage services are supported: - AWS S3 - Azure Blob Storage - Azure Data Lake Storage Gen2
To implement a new object storage source connector and reuse existing code follow the next steps:
-
Implement the scanner interface - The scanner interface is representing a storage data source. It requires implementation of three methods:
Init
- initialize connection to the storage data source.ScanFunc
- the function executed on the scanned itemScan(fn ScanFunc)
- function that walks over the items to scan and executes ScanFunc on each item
Implement the scanner interface using a struct under
presidio-collector/cmd/presidio-collector/scanner
. For example see: storage scanner -
Add configuration - create templates for source and object storage configurations and service definitions, on the presidio-genproto repo.
-
Add to Factory - Add your newly created scanner in the storage factory
CreateScanner
method.
A stream is data that is continuously generated by different sources.
To implement a new stream source connector and reuse existing code follow the next steps:
- Implement the stream interface - The stream interface is representing a stream data source.
It requires implementation of three methods:
ReceiveFunc
- define how received event should be processedReceive
- read event from stream.Send
- send event to stream. Implement the stream interface using a struct underpkg/stream
. For example see: kafka
- Add configuration - create templates for source and stream configurations and service definitions, on the presidio-genproto repo.
- Add to Factory - Add your newly created stream under streams factory
CreateStream
method.
Creation of additional types of source connector are more than welcome. To implement a new type of data Source Connector follow the next steps:
- Create implementation - Create a new directory under
pkg
and implement the new data source connector. you can take example from the existing connectors. - Create configuration - create templates for source and any additional configurations and service definitions, on the presidio-genproto repo.
- Create Factory - Create a new directory under
presidio-collector/cmd/presidio-collector/
and add a factory like the factories for other connectors types.
All data sinks need to implement the Datasink interface. The Datasink interface represents the different data output types. It requires implementation of following methods:
Init
- how to initialize the datasink.WriteAnalyzeResults
- write the analyzer results to the specified datasink.WriteAnonymizeResults
- write the anonymized response to the specified datasink.
The project already supports sinks like object storage, database and streams. For similar sinks and code reuse, or creating a new sink follow the steps in how to create:
Creating an object storage sink connector is similar to object storage source connector. In addition to steps listed in storage source connector, implement the following functions in you storage scanner class:
CreateContainer(name string)
- create container/bucket referenceNew(kind string, config stow.Config, concurrencyLimit int)
- initialize a new storage instance.PutItem(name string, content string, container stow.Container)
- write a new item to the container For example see: storage
Creating a stream sink connector is similar to stream source connector. In addition to steps listed in stream source connector:
- Implement producer - To allow message to be written to the stream define
NewProducer
method in your stream struct. - Add to Factory - Add your newly created stream sink under stream
New
method.
Presidio supports the following DBs:
- MSSQL
- MySQL
- PostgreSQL
To support additional databases:
- Implement datasink interface - by creating a new struct under
presidio-datasink/cmd/presidio-datasink/database
. For example see: database - Add to Factory - Add your newly created DB sink under datasink-factory
Creation of additional types of sink connector are more than welcome. To implement a new type of data Sink Connector follow the next steps:
- Implement required writing capabilities - by creating new directory under
pkg
and adding a struct implementing the required writing capabilities. - Create configuration - create templates for source and any additional configurations and service definitions, on the presidio-genproto repo.
- Implement datasink interface - by creating a new directory under
presidio-datasink/cmd/presidio-datasink
and a struct which implements datasink interface methods. - Add to Factory - Add your newly created sink under datasink-factory
The Scheduler allows you to read your data periodically from the Source Connector, anonymize it and export it to Sink Connector.
For more information about scheduler usage: scheduler readme.
Currently we support object storage and streams scheduling.
To support additional source and sinks in scheduler follow the next steps:
- Create configuration - create any additional configurations, templates and service definitions, on the presidio-genproto repo as follows:
CronJobApiRequest
- represents the request to the API HTTP serviceCronJobRequest
- represents the request to the scheduler service via GRPC.CronJobResponse
- represents the response from the scheduler service.- Support the new job in the
init()
method. - Add
Apply
method toSchedulerServiceClient
to trigger a new scanning cron job and return if it was triggered successfully. - Create a
Handler
for to handle job request. - Add the newly added
Apply
method and theHandler
to_SchedulerService_serviceDesc
. Use existing storage scanner/stream as an example.
- Create job - Create new job directory under
presidio-api/cmd/presidio-api/api
and create job implementation, how it should be scheduled and etc. For example see: scanner-job - Support new cron job creation to existing API - Add newly created job in methods
validateTemplate
method.