-
Notifications
You must be signed in to change notification settings - Fork 20
Home
Welcome to our Wiki Pages!
Thank you for visiting. Here, we'll explore various use case scenarios for utilizing or setting up cuallee.
Cuallee is designed for teams operating in hybrid environments, where they need to develop using in-memory data structures and then test their quality rules within the same code base, against larger datasets. Additionally, cuallee provides an excellent interface for planning migrations between technologies like Databricks and Snowflake. It allows the definition of consistent checks across different implementations of dataframe data structures.
At its core, cuallee emphasizes creating meaningful and representative tests against evolving data contracts, especially under the governance of different teams or organizations.
Why did we create a software that offers functionality similar to pydeequ
, great-expectations
, and soda-core
? While these tools are inspiring and have lowered the barrier for better data quality in the complex ecosystem of data platforms, they are primarily geared towards enterprise, collaborative, and licensed-supported software.
Cuallee's operational model, from day one, aimed to be a transparent, reliable, and robust solution for large-scale data sources and team profiles. It focuses on portability and ease of use. Differing from the low-code
movement that foster authoring data quality checks without coding experience or through configuration formats like YAML or proprietary meta-languages.
Cuallee championed the idea of maintaining a code-centric approach to quality definitions, focusing on areas where data defects are statistically more likely to appear. This includes the logistics of moving data and the transformation phase, where business logic is needed to process original sources. This approach aligns with the concept of safeguarding data through its life-cycle via unit tests, as originally proposed by the deequ framework.
There were three main reasons for cuallee becoming a standalone solution rather than just a contribution to the pydeequ framework:
- Scala development: Cuallee is written entirely in Python, which benefits from a larger development ecosystem, especially considering Python's dominance in the data ecosystem.
- Computation model: The architecture of initializing a pydeequ session encountered friction during the implementation of quality rules, influencing memory footprint and requiring additional jar files for use in PySpark.
-
Extensibility to other data structures: Cuallee addresses the widely adopted workflow for data experimentation, where teams often start their journeys in
notebooks
or local environments before transitioning to packaged software releases, subject to more stringent criteria like vulnerability management, release management, quality control, telemetry, etc. Plus, the transition toout-of-memory
frameworks like PySpark for distributed execution.
In summary, cuallee
provides a Python-centric, transparent, and robust solution for data quality testing, particularly suited for teams operating in hybrid environments or undergoing technology migrations.