Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate Need for User-Facing Schema #94

Open
mdr223 opened this issue Jan 30, 2025 · 4 comments · May be fixed by #104
Open

Eliminate Need for User-Facing Schema #94

mdr223 opened this issue Jan 30, 2025 · 4 comments · May be fixed by #104
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@mdr223
Copy link
Collaborator

mdr223 commented Jan 30, 2025

While some users may want to use Schemas as a nice way to organize fields, others simply want to specify field names, types, and descriptions as part of their PZ program. (See the new .add_column(...) syntax).

I think we should continue to support Schemas, especially for under-the-hood operations.

Rather than eliminate the Schema class, this issue simply proposes that we rewrite our demos / quickstart(s) to use non-Schema syntax. This issue would also add a page in our documentation which shows users how to use the Schema class should they choose to do so.

To summarize, the acceptance criteria for this issue are:

  1. The majority of user-facing demos / quickstart(s) use code which does not make use of the Schema syntax.
  2. We add a demo which explicitly shows how to use the Schema syntax as a convenience for power users (e.g. demos/schema-demo.py.
  3. Code passes unit tests and demos work end-to-end. This ensures that our internals are no longer dependent upon user-provided Schemas.
@chjuncn
Copy link
Collaborator

chjuncn commented Feb 3, 2025

I agree we don't need to introduce Schema in demos, the current new syntax doesn't need them.

Question: if Schema is not user-facing anymore, what kind of demo you're expecting?

Further Improvement Options

Instead of Schema, why not use protobuf https://protobuf.dev/getting-started/pythontutorial/ which is a common way to serve for data transformation?

  1. For simple use cases, users just use [{name, type, desc}] to define what they want.
  2. For more advanced use cases, users define their own protobuf for their own data.

Why do I propose Protobuf?

  1. Based on my understanding, I think current Schema is for users to freely define their data format, which is exactly what protobuf for. Currently I don't see what's the extra benefits Schema can provide. (Probably I missed that)
  2. When there is widely used tool like protobuf, we don't reinvent the wheels except we have to.
  3. When users need to use for advanced use cases, we don't need to explain why they should use protobuf because there are already a lot of use cases and documentations out there. Plus, if users are engineers, it's highly likely they already know how to use protobuf.

@chjuncn
Copy link
Collaborator

chjuncn commented Feb 3, 2025

I refactored demos as the followup for updating syntax for #84, this issue will need more work. I'm unassigning myself for now.

@chjuncn chjuncn removed their assignment Feb 3, 2025
@mdr223
Copy link
Collaborator Author

mdr223 commented Feb 3, 2025

Hi @chjuncn, thanks for taking the lead on this issue and refactoring the demos to not rely on the Schema class!

I think the answer to the two points/questions you raised above boil down to one idea: we want to make the experience for new users as simple as possible.

Given this goal, I hope my answers below will make sense.

Question: if Schema is not user-facing anymore, what kind of demo you're expecting?

Even if we don't have Schema in our beginner demos, I still see Schema playing two roles in the system:

  1. Internally, we may want to represent groups of fields computed in a single .[sem_]add_columns() as belonging to a Schema. We can then leverage the classes' .union(), .project(), .add_new_fields(), etc. methods to more easily manage DataRecord fields.
  2. Power users may want to explicitly reason about Schemas. I think for those of us with DB backgrounds it makes sense to think about Schemas, therefore I don't think we need to eliminate the Schema class entirely. Ultimately, my hope is that methods like .sem_add_columns() will work with both lists of dictionaries and Schemas as input.
Instead of Schema, why not use protobuf?

Similar to point 2. above, for new users I think we primarily want to showcase the ability to use simple Python lists & dictionaries to specify Dataset transformations. If we'd like to use Protobuf in addition to the Schema class (or as a replacement) for advanced users, then I think that is a discussion we can have with the broader group. My two cents (based on the Python tutorial for protobuf that you shared) is that I think Protobuf may be overkill for what we need in PZ -- especially if it requires a compilation step (although perhaps this isn't always needed?). In any case -- it's something we can discuss with Mike et. al down the road.

@chjuncn
Copy link
Collaborator

chjuncn commented Feb 4, 2025

Make sense! I totally agree. To summarize:

  1. We support both list[dict] and Schema as input for .(sem_)add_columns(). Schema for advanced use cases.
  2. Let's consider to use Protobuf when we think Schema is not a good fit anymore. We'll need more discussions down the road. So far this is not prioritized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
2 participants