operationalize for on-going labeling, taining, matching streams #203
tomdavidson
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This more or less just a generalized wish list based on how I want to use zingg in an on-going streaming based architecture that is quite a bit different batch job in Airflow (please provide feedback if anything seems out-of-wack).
(related #176 #133 )
"Cloud Native"
Packaged as a portable, production quality OCI. Ready to be added to an SOA or MSA system and/or to interact with via API and CLI.
Built in Dapr client. Dapr provides platform agnostic interfaces to configuration, events, state, secrets, etc - all the stuff we need for prod cloud native services. When deployed to ACS or k8s, Dapr side car is available in the platform. For a vm or local dev we can offer a configured docker compose file with Dapr, PostgreSQL, Redis, or perhaps a Helm Chart and k3d.
(related #44 )
Persistence
All state and configuration is external with no requirement on filesystem persistence.
Persist the clustering info, a configured record id field, and matching fields but do not replicate all the data that passes through in its own db appropriate for matching. Emit the effected (not just the new ones) records via Dapr pubsub.
Store labels and models in Dapr state service (versioned kv) which could use the same db used for the clustered records. (Related #200 )
Matching can use any version of the model, configured via Dapr's config service. This facilitates A/B testing of models as well.
APIs
Use Dapr pubsub endpoints for controlling events such as "label updated" and "new model requested". Complete management of models, labels, training, etc and fetching data. Facilitates automated and guided positive feedback loops producing new model versions, CLI clients, Webapps to more accessible/non-engineer access to labeling, etc.
Leverage Dapr pubsub for data io / zingg pipes.
API proxy (could be a different service or built-in) providing an OAS3 RESTful interface, AsyncAPI, gRPC, GraphQL endpoints, etc. I am partial to GraphQL and AsyncAPI.
Expose a liveness HTTP endpoint to check if zingg is running and healthy separate from a readiness HTTP endpoint to check if zingg is ready to process.
(related #34 )
Observability
Expose Prometheus style metrics on records processed, labeled, matched, unsure/possible, cluster count, cluster size, jvm metrics, etc. This can be parsed from logs too but the prom endpoint is preferred.
With Dapr we get relayed OpenTelemetry traces, but zingg can add steps within areas of high interest.
Emit OpenLineage data to the lineage Dapr pubsub topic.
Configuration Niceties
Beyond using the Dapr config service. Some things can be simplified.
Ability to add the clustering info and pass-through the complete record without the need to configure all the pass-through fields. Zingg does not need to know what the entire record is, only the fields that should be used in ER.
JSONSchema of configuration for validation and for editing (related #183 )
CLI
With string OAS compliant APIs, tools such as RESTish and auto-generated SDKs can be easily used.
CLI should be a client with config options and when talking to zingg in k8s, the CLI should be able to leverage k8s local port-forwarding to reach the zingg service API.
(related #187 )
Beta Was this translation helpful? Give feedback.
All reactions