-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the parse function to accept an entity id #189
Conversation
…if none is supplied
@@ -371,8 +371,13 @@ parse: | |||
author: .//meta[@name="author"]/@content | |||
publishedAt: .//*[@class="date"]/text() | |||
description: .//meta[@property="og:description"]/@content | |||
keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest we use syntax similar to ftm mappings for consistency. Something like https://github.com/alephdata/aleph/blob/main/mappings/md_companies.yml#L15-L17
So the keys section will look like:
keys:
- title
- author
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keys
section needs to be updated now I think?
memorious/operations/parse.py
Outdated
hashlib.md5(key_string.encode("utf-8")).hexdigest() | ||
!= hashlib.md5("".encode("utf-8")).hexdigest() | ||
): | ||
entity_id = hashlib.md5(key_string.encode("utf-8")).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use make_id
from memorious.helpers.key
instead of making the key using hashlib.
memorious/operations/parse.py
Outdated
for key, value in properties.items(): | ||
properties_dict[key] = html.xpath(value) | ||
|
||
data["entity_id"] = hashlib.md5(data["url"].encode("utf-8")).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably better and more consistent to make keys out of the supplied keys only instead of falling back on a default. We should instead raise an error in case of null keys so that the user can notice and change their keying strategy.
memorious/operations/aleph.py
Outdated
countries: list[str] = list(context.params.get("countries", [])) | ||
mime_type: str = context.params.get("mime_type", "") | ||
|
||
context.log.warn(languages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stray warning
memorious/operations/aleph.py
Outdated
published_at=data.get("published_at"), | ||
headers=data.get("headers", {}), | ||
keywords=data.get("keywords", []), | ||
) | ||
|
||
if data.get("aleph_folder_id"): | ||
meta["parent"] = {"id": data.get("aleph_folder_id")} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using 2 different styles here. We are treating meta as a object in the line above and as a dict here. I think we should be consistent and use a single style throughout the function.
And imo we can merge _create_meta_object
and _create_document_metadata
together into just one function and only set whatever metadata is available.
memorious/logic/meta.py
Outdated
@@ -0,0 +1,24 @@ | |||
# from typing_extensions import TypedDict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is not generic enough to be its own module. We should put it in the aleph operation module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's not a module. It's the definition of a type. I would type definitions seperate from modules that actually do stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting the type definition near the relevant code is the convention we're already using in other places. See for example https://github.com/alephdata/followthemoney/blob/master/followthemoney/schema.py#L24
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I'll move the meta file into the operations module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually meant the Meta type definition should live in operations/aleph.py and not in a separate file since it's very specific to the aleph_emit operations and unlikely to be reused anywhere else.
memorious/operations/parse.py
Outdated
@@ -88,22 +92,41 @@ def parse_for_metadata(context, data, html): | |||
if value is not None: | |||
data[key] = value | |||
break | |||
meta_paths.update(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary?
memorious/operations/parse.py
Outdated
temp_key = "".join(properties[key]) | ||
|
||
if not temp_key == "": | ||
return make_id(temp_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make_id
can take multiple arguments. You don't need to join them beforehand.
@@ -371,8 +371,13 @@ parse: | |||
author: .//meta[@name="author"]/@content | |||
publishedAt: .//*[@class="date"]/text() | |||
description: .//meta[@property="og:description"]/@content | |||
keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keys
section needs to be updated now I think?
@@ -45,6 +45,10 @@ pipeline: | |||
author: .//meta[@name="author"]/@content | |||
publishedAt: .//*[@class="date"]/text() | |||
description: .//meta[@property="og:description"]/@content | |||
keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be the method needs to be changed to use the built-in parse
method instead of a custom Python method to match the documentation example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. The problem with doing that though is that the scraper won't be able to extract the body of the article, which is why the custom script exists. I guess, as it's an example it doesn't really matter too much, but that is why we have a difference.
Personally I don't have an issue with having the documentation not match the example in the repo
memorious/operations/meta.py
Outdated
from typing import Optional, TypedDict | ||
|
||
|
||
class MetaBase(TypedDict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetaBase vs Meta difference is no longer used. So I guess it's safe to merge these two together now?
Could you fix the merge conflict too? I think it's from the changes Simon made to fix a couple of issues in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the merge conflict wasn't fixed the way it should be. So the code is left in a broken state. A test for the aleph operations should have caught this. But we don't have any tests for it. May be it's worth adding one now since we are making changes to the aleph_emit operations.
memorious/operations/aleph.py
Outdated
from memorious.logic.context import Context | ||
|
||
|
||
class Meta(MetaBase, total=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetaBase is undefined
make_key(collection_id, foreign_id, content_hash) | ||
) | ||
|
||
if document_id: | ||
context.log.info("Skip aleph upload: %s", foreign_id) | ||
data["aleph_id"] = document["id"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document
is undefined
This died |
This includes a tweak to the parse function so that it generates an entity id before creating an entity. There are two ways in which this can occur