Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification of playback index requirements #69

Open
anjackson opened this issue Oct 5, 2021 · 18 comments
Open

Clarification of playback index requirements #69

anjackson opened this issue Oct 5, 2021 · 18 comments

Comments

@anjackson
Copy link

When playing back web archives using ReplayWeb.page, we can get very high quality playback, and I think this is down to:

  • how you index POST (and other non-GET???) requests.
  • you fuzzy matching rules.

For large-scale web archives, we need to find ways to support this via OutbackCDX or SolrWayback/webarchive-discovery. Can you give us a clear summary of what we need to do ensure that pywb (and future ReplayWeb.page?) can achieve the same level of playback quality?

@anjackson
Copy link
Author

I think we know some of this, but I'm after a clarification. For example (this is based on some comments from @ato, thanks!):

For OutbackCDX, we can populate it with records generated using pywb's cdx-indexer tool, or another tool that meets that specification. Where is that specification? And does this need a recent version of OutbackCDX? (I seem to remember there was some issue with POST request indexing before)? And is the cdx-indexer one up-to-date? I thought you'd changed it recently and it wasn't everywhere yet?

For matching, OutbackCDX has a -y command-line for loading a pywb fuzzy matching file (rules.yaml). Is this the same fuzzy matching ruleset/approach as ReplayWeb.page? Do we have to worry about keeping the match rules in sync?

@ikreymer
Copy link
Member

ikreymer commented Oct 6, 2021

The short answer is that populating an index produced from cdx-indexer or cdxj-indexer should work when added to OutbackCDX, as long as prefix queries work.

The longer answer is that the process of converting a request/response pair to a URL for inclusion in the cdx(j) process is something that should definitely be documented better. It is currently implemented across several libraries, in both python and javascript. It is not yet well documented, but should be.
The implementations currently exist in:
python:

The actual fuzzy matching is done on the client-side of the index fetch, either in wabac.js or in pywb, by performing a prefix query from the index server. The fuzzy matching is implemented slightly differently between wabac.js and pywb but the end results are mostly the same. The domain specific rules exist in at the client level, as the prefix based query allows for more flexibility in how the fuzzy matching is done.

(As an alternative to avoid prefix query is to create 'fake cdx' entries which can be queried with an exact match, which then makes the dependency on the index a bit more strict, but this can be avoided when using prefix querying. This approach is only used when prefix querying is not available).

Currently, connecting ReplayWeb.page to OutbackCDX is not yet fully, but definitely could be, especially in combination with nla/outbackcdx#79. For this, should be possible to use cdx created via cdxj-indexer/cdx-indexer.

But definitely the transformation to post-aware URL should be documented somewhere as a first step.

@anjackson
Copy link
Author

Thanks for clarifying @ikreymer - so, the pywb rules are now supported by OutbackCDX but it not strictly necessary to deploy them there, right? i.e. you can do it if it looks like yielding better performance, but it's not a requirement?

@anjackson
Copy link
Author

anjackson commented Oct 11, 2021

I've just been experimenting with, so I'm going to drop some notes here (I guess I'll want to review and add on top of webrecorder/pywb#588 ?).

Firstly, support for POST parameters was added to OutbackCDX here, and as such, OutbackCDX >= 0.8.0 is needed to support this feature.

It's also worth noting that the OutbackCDX implementation does not use the urlkey provided by the indexer, but instead calculates it's own key, and adds the __wb_post_data parameters on the end, if there are any. This means you won't see any sign of __wb_method=POST and if the post data is empty (__wb_post_data=&...) it will also be ignored. For both these reasons, CDX lines pulled out of OutbackCDX will not 100% identically match the lines posted to it.

Also note that at the current time, webarchive-discovery does not support this convention.

@ato
Copy link

ato commented Oct 11, 2021

I suspect that nla/outbackcdx#91 is not needed anymore and was based on old pywb indexing behaviour that has since changed. The cdx-indexer tool at least now seems add the POST-encoding to the original URL field not just the canonicalised url and so the CDX server doesn't need to do anything special to handle it. I haven't confirmed it does that with __wb_post_data as all the examples I have handy are JSON bodies but skimming the pywb code I don't see __wb_post_data treated any differently from the other query string modifications.

@ato
Copy link

ato commented Oct 11, 2021

Actually, I'm looking at an old replayweb.page WARC where it altered the WARC-Target-URI, so I've likely just confused myself.

@ato
Copy link

ato commented Oct 11, 2021

Yeah OK, indeed the pywb cdx-indexer and cdxj-indexer only append to the canonicalised URL field and leave the original URL untouched which OutbackCDX's normal behaviour is to discard as it uses its own urlkey generation.

That PR 91 which copies the __wb_post_data parameter back to original url is not enough as JSON requests don't even use that field but rather create a synthetic query string based on the JSON structure. To properly implement that strategy it'd need to look for __wb_method= and copy it and everything after it.

What our custom indexing code at NLA does and what I thought cdx-indexer did but was wrong about is add the extra parameters to the original URL field. That then works fine with pywb reading it and with all versions of OutbackCDX.

@ikreymer
Copy link
Member

Also, (and yes this needs to be documented and will be soon), cdxj-indexer and pywb cdx-indexer also add method and requestBody fields, which you can also use directly, if generating url key differently, eg. <url-key>&__wb_method=<method>&<requestBody>

@anjackson
Copy link
Author

Is that just &<requestBody> or &requestBody=<requestBody>?

@anjackson
Copy link
Author

I thinking there are two separate(iish) problems: One is making sure OutbackCDX+pywb/cdxj-indexer is as good at playback as pywb alone. The second is that there are unresolved issues about how to handle other cases, beyond what pywb does now.

On the former question, I think I'm right in saying that, at the current moment, I can't use pywb/cdxj-indexer with OutbackCDX and have it work as well as pywb works with it's own local index. I can use cdxj-indexer and Outback will keep the __wb_post_data, which doesn't cover everything, but works okay. Alternatively, I could attempt to copy @ato and add parameters to the original URL when indexing, but this is not implemented in pywb/cdx-indexer. it's not clear to me right now whether that approach covers more or less than the __wb_post_data approach.

@ikreymer
Copy link
Member

Yes, it seems currently OutbackCDX isn't quite in sync with the latest indexing, but should be fixable.
Currently the requestBody in the cdx contains the converted JSON post data, so it can be appended directly.
If I understand @ato's comment above, it sounds like if the CDX url field contains the fully converted form, eg. <url>&__wb_method=<method>&<requestBody>, it should then work. Maybe this is the solution for CDX outputs, not CDXJ as well.. (I'm not sure if OutbackCDX supports CDXJ inputs yet or not).

@ikreymer
Copy link
Member

So maybe just need an option that, instead of writing separate "url", "requestBody", "method" fields, it combines them into the URL field (initially was just hesitant to not store the original URL at all), making it an option for cdxj and requirement (since can't add new fields) for cdx output with POST canonicalization.

@ato
Copy link

ato commented Oct 20, 2021

I just updated to Pywb 2.6.0 and also implemented form-urlencoded POST request encoding in our indexing pipeline (previously we were only doing it for JSON requests). While testing that I discovered that Pywb wasn't actually passing the POST-encoded version of the url through to OutbackCDX. It turns out the pages I was testing earlier didn't care too much about getting the correct graphql response as long as long as they just got any response, so it was "working" purely by accident. 🤦

For now I seem to have gotten things working by patching XmlQueryIndexSource to prefer params['alt_url'] (which has the POST data encoded into it) over params['url']. But I'm not at all confident that's the correct solution.

@anjackson
Copy link
Author

I'm working on our indexer and re-reading this thread, and still struggling to know what to do. I'm using cdxj-indexer (specifically the CDX11Indexer class) and POSTing the resulting records to OutbackCDX. I don't think that's enough, but I don't really understand.

Should we try to hold a call to work out what the details should be? @ikreymer @ato ?

@kaij
Copy link

kaij commented Feb 28, 2022

I am the author of #91 and updating to pywb 2.6.x while using outbackcdx. Currently seeing the following points:

  1. Indexing: improved HTTP method handling comes with new parameter output by cdx-indexer - see POST request handling and indexing improvements pywb#636. Incompatible with Support load of multiple WARC files #91 which would need a rewrite to handle this and append it to the original url (mentioned in Clarification of playback index requirements #69 (comment))
  2. Replay: originally, I opened a companion pull request (Add url_post template parameter for remote cdx api pywb#587) which added the correct fields in a new template variable "url_post". This variable had to be used in the config.yaml instead of the normal url.

If I understand @ikreymer correctly, (1) could also be solved in OutbackCDX by integrating the fields method and requestBody fields from the output of cdx-indexer (from pywb 2.6.x). (Aside question: how can I get these fields? I tried cdx-indexer -a - p ... and I don't see any additional field - is this only for cdxj-indexer?) [Update: Q: can OutbackCDX now fully handle CDXJ?]

For 2) I could imagine a re-implementation of my original pywb pull request webrecorder/pywb#587 which adds the method and requestBody to the url_post variable (the "_post" suffix is just for compatibility to not break existing systems which are not indexing POST requests).

It would be the original outbackcdx and pywb pull requests adjusted to the new requirements. I could try to make these changes, but if there's a simpler solution I'm more than happy.

@ato @ikreymer @anjackson What do you think about this?

PS. I just realized that this discussion is maybe taking place in the wrong project (replayweb.page). Should we split up and move to pywb and outbackcdx?

@ato
Copy link

ato commented Jun 12, 2023

(1) could also be solved in OutbackCDX by integrating the fields method and requestBody fields from the output of cdx-indexer [Update: Q: can OutbackCDX now fully handle CDXJ?]

OutbackCDX can now store arbitrary CDXJ fields (currently this is gated behind the --index-version 5 option). It also now implements pywb-style copying of method and requestBody into the urlkey field, although I'm thinking about maybe in future instead using separate index key fields.

For 2) I could imagine a re-implementation of my original pywb pull request webrecorder/pywb#587 which adds the method and requestBody to the url_post variable (the "_post" suffix is just for compatibility to not break existing systems which are not indexing POST requests).

I think it might be nicer to actually add method and requestBody as query parameters to the CDX server API.

  1. This gives the server flexibility as to how it internally indexes them. For example a server backed by Solr or an SQL database isn't constrained by trying to pack everything into a traditional CDX file urlkey field and may instead opt to have dedicated indexed fields for method and requestBody.
  2. This ensures we have a way out if mixing the request body into the query string causes problems (e.g. parameter naming conflicts).
  3. This makes the API a bit more straightforward for other users (researchers etc).
  4. Older servers should just ignore the new query parameters so there should be no need to have any special configuration in the Pywb config for POST-aware index servers.

@kaij
Copy link

kaij commented Jun 12, 2023

@ato 🚀 Really cool, I'm going to test this and report back. For me, the implemented solution makes absolutely sense, many thanks! Just an idea/questions: would it maybe make sense to just store the content of PUT/POST as a hash value? Could we run into size problems (length of PUT/POST data)? I also saw you're working on an index upgrader 🎉

I working on renewing the Dockerfile as building the tools now fails with maven:3-eclipse-temurin-17. Do you think there's any chance of upgrading the rocksdbjni image to a newer version? (it is set to 6.20.3 in pom.xml, current is 8.1.1.1) - it is hard for me to estimate the changes at API level, but could try.

@ato
Copy link

ato commented Jun 13, 2023

would it maybe make sense to just store the content of PUT/POST as a hash value?

I think the main downside to this option is it means you can't change fuzzy matching rules (e.g. fields to ignore for matching because they contain random, time-specific or user-agent specific data) without going back to the source WARC records which can take a lot of processing time for large collections. I guess it also makes the records less detailed for other purposes like troubleshooting or research. Using hashes would indeed certainly make the storage simpler though, particular for long requests.

My current personal goal is to at least make OutbackCDX compatible with the CDX/CDXJ "--post-append" indexes that the Webrecorder suite of tools are currently producing as that's already seeing quite a bit of use. I expect this is an area that's going to see more refinement and experimentation over time though. There's the case of responses that differ based on request headers -- an obvious example being content type negotiation, but I'm sure someone's made at least one site somewhere that passes essential API parameters as custom HTTP request headers.

Could we run into size problems (length of PUT/POST data)?

The Pywb requestBody transformation truncates the converted/canonicalized requestBody to 4096 bytes.

Do you think there's any chance of upgrading the rocksdbjni image to a newer version?

Yep. Intending to do that for the next release as well. I haven't tried the very latest release but I did try 7.x not so long ago and there weren't any API changes. I've opened an issue to remind myself. nla/outbackcdx#114

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants