As mentioned in the Basic Concepts Overview, a Stream
is the atomic unit for reading data from a Source. A stream can read data from anywhere: a relational database, an API, or even scrape a web page! (although that might be stretching the limits of what a connector should do).
To implement a stream, there are two minimum requirements: 1. Define the stream's schema 2. Implement the logic for reading records from the underlying data source
Your connector must describe the schema of each stream it can output using JSONSchema.
The simplest way to do this is to describe the schema of your streams using one .json
file per stream. You can also dynamically generate the schema of your stream in code, or you can combine both approaches: start with a .json
file and dynamically add properties to it.
The schema of a stream is the return value of Stream.get_json_schema
.
By default, Stream.get_json_schema
reads a .json
file in the schemas/
directory whose name is equal to the value of the Stream.name
property. In turn Stream.name
by default returns the name of the class in snake case. Therefore, if you have a class class EmployeeBenefits(HttpStream)
the default behavior will look for a file called schemas/employee_benefits.json
. You can override any of these behaviors as you need.
Important note: any objects referenced via $ref
should be placed in the shared/
directory in their own .json
files.
If you'd rather define your schema in code, override Stream.get_json_schema
in your stream class to return a dict
describing the schema using JSONSchema.
Place a .json
file in the schemas
folder containing the basic schema like described in the static schemas section. Then, override Stream.get_json_schema
to run the default behavior, edit the returned value, then return the edited value:
def get_json_schema(self):
schema = super().get_json_schema()
schema['dynamically_determined_property'] = "property"
return schema
The only method required to implement a Stream
is Stream.read_records
. Given some information about how the stream should be read, this method should output an iterable object containing records from the data source. We recommend using generators as they are very efficient with regards to memory requirements.
We highly recommend implementing Incremental when feasible. See the incremental streams page for more information.