Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract links_videos (and links_sounds?) #287

Open
tokee opened this issue May 3, 2022 · 1 comment
Open

Extract links_videos (and links_sounds?) #287

tokee opened this issue May 3, 2022 · 1 comment

Comments

@tokee
Copy link
Collaborator

tokee commented May 3, 2022

The links_images-field is very usable for reverse image search and showing thumbnails as part of a search result. Similar links_videos and maybe links_sounds would have equal benefits.

Unfortunately it is messy to extract as is has historically been hacked in different ways. Using iframe was popular at one point:

<iframe width="560" height="315" src="http://example.com/43j5jtfrh398" frameborder="0" >
</iframe>

The problem here is that the only indication of the iframe containing a video and not an image, a HTML page or something else, is the URL for the video and that is in no way guaranteed to have a usable extension. Some ideas:

  1. Only populate links_videos with "guaranteed" videos, i.e. those with known video extensions
  2. Index all iframe#src and move the resolve logic to the GUI, first extracting all the URLs, then requesting their content_type_norm-field

If method 2 is used, it might be better to have a field links_resources with all inlined resources (except images). That would also catch sounds and make it possible to e.g. check is a page was iframed from somewhere.

@thomasegense
Copy link
Contributor

thomasegense commented May 3, 2022

also not forget the simple "<video>" and "<audio>" tag.

And while we at it, maybe also add the <iframe> tag

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants