Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor, fix, and optimizer filters/rules #1794

Open
wants to merge 4 commits into
base: dsb/depickle
Choose a base branch
from

Conversation

dsblank
Copy link
Member

@dsblank dsblank commented Nov 22, 2024

This PR does three things:

  1. Unrolls the recursive call of filter.apply(), and splits out single checks into filter.apply_to_one()
  2. Uses JSON data where possible, passing the data rather than objects. If an object is needed, it will not be recreated more than once.
  3. Adds an optimizer based on rule.map sets

@dsblank dsblank self-assigned this Nov 22, 2024
"""
Apply the rule. Return True on a match.
"""
obj = self.get_object(data)
# FIXME: probably can do the rest without object:
if not self.match_substring(0, obj.get_name()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not self.match_substring(0, obj.get_name()):
if not self.match_substring(0, data["name"]):

def apply(self, db, person):
def apply_to_one(self, db, data):
person = self.get_object(data)
# FIXME: can probably do all of this without object:
Copy link
Contributor

@stevenyoungs stevenyoungs Nov 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use data["birth_ref_index"] to avoid constructing the object?

@@ -145,6 +145,8 @@ def apply(self, db, place):

# now we know at least one is given in the filter and is valid

place = self.get_object(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use data["lat"] and data["long"] to avoid constructing the object?

@dsblank
Copy link
Member Author

dsblank commented Nov 30, 2024

Thanks @stevenyoungs for looking this over and the feedback. I'm hoping that @Nick-Hall is open to these three ideas (fixes, using JSON data, and the optimizer) because all three make the filter system so much faster while keeping the API almost the same. Many places in Gramps use the raw data for speed, including tools, views, displayers, and importers. It would be a shame if filters couldn't do the same.

@stevenyoungs
Copy link
Contributor

Thanks @stevenyoungs for looking this over and the feedback. I'm hoping that @Nick-Hall is open to these three ideas (fixes, using JSON data, and the optimizer) because all three make the filter system so much faster while keeping the API almost the same. Many places in Gramps use the raw data for speed, including tools, views, displayers, and importers. It would be a shame if filters couldn't do the same.

No problem. I'm also keen to see the benefits of more standard storage and see what opportunities it unlocks. The recent discussion of GQL is an interesting option.
I've learnt a lot from following your changes and hopefully my comments have had a minor benefit to the overall quality of the PR.

@Nick-Hall
Copy link
Member

hopefully my comments have had a minor benefit to the overall quality of the PR.

Your feedback is very useful. It always helps for another person to review the code. I very much appreciate your contributions.

@Nick-Hall
Copy link
Member

I'm hoping that @Nick-Hall is open to these three ideas (fixes, using JSON data, and the optimizer) because all three make the filter system so much faster while keeping the API almost the same.

With the new JSON format, I think that we can regard the raw data as providing lightweight objects. The format mirrors the Gramps objects and is therefore much easier to understand and more resilient to updates.

I've been doing some research today to see if we can make the dict format more object-like. For example, data.gender would perhaps be neater than data["gender"]. The dataclasses introduced in python 3.7 look interesting, but I don't think that they are of much help to us. Using a JSON parser written in C may be an option in the future, but the ones I was looking at deserialized to dict structures.

I am open to using JSON data in more core code, including filters. We should probably discuss where the raw format is acceptable. Third-party code should be careful when using the raw format. It may change between feature releases, not just major releases, but his has always been the case.

An optimizer seems like a good idea. I haven't reviewed your code yet though.

@dsblank
Copy link
Member Author

dsblank commented Dec 2, 2024

I've been doing some research today to see if we can make the dict format more object-like. For example, data.gender would perhaps be neater than data["gender"]

How about just wrapping our raw data with DataDict(data):

class DataDict(dict):
    def __getattr__(self, key):
        value = self[key]
        if isinstance(value, dict):
            return DataDict(value)
        elif isinstance(value, list):
            return DataList(value)
        else:
            return value
        
class DataList(list):
    def __getitem__(self, position):
        value = super().__getitem__(position)
        if isinstance(value, dict):
            return DataDict(value)
        elif isinstance(value, list):
            return DataList(value)
        else:
            return value

It allows exactly what you describe, is low-cost, and doesn't require any more code than that above.

I am open to using JSON data in more core code, including filters. We should probably discuss where the raw format is acceptable. Third-party code should be careful when using the raw format. It may change between feature releases, not just major releases, but his has always been the case.

Agreed. The rule has seemed to be: use where it is needed for speed or easy representation.

@stevenyoungs
Copy link
Contributor

The only downside is that we leak the internal representation of the objects into the higher layers of code. But since those layers, in part, already use the raw format, this PR does not change make a material difference.

@dsblank dsblank marked this pull request as ready for review December 3, 2024 11:57
@dsblank
Copy link
Member Author

dsblank commented Dec 3, 2024

The only downside is that we leak the internal representation of the objects into the higher layers of code.

For many attributes of an object, it isn't so much "internal" now that we are switching to JSON. For example, this is the Person object API to get the parent family list:

person.parent_family_list # object attribute
person.get_parent_family_handle_list() # function call

With the above DataDict (with some additional code) with new JSON, it could look like (where person is really JSON data with a wrapper):

person.parent_family_list  # JSON access
person.get_parent_family_handle_list() # function call, instantiates object

The new version would hide the fact that the function call creates the object (but just once). So because the JSON data mirrors the actual attribute names, it doesn't have to change at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants