-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Further explanation of the Range header for pagination #36
Comments
Thanks for the questions. I'll outline some things here as a start to the conversation and then update the docs once we have some sense that it has been well described. First off, my motivation was to avoid "page" related things as it doesn't really have a stable meaning with mutating data. Instead, ranges should remain stable in meaning, even if their contents change (the range 1..10 always means the same thing, vs what does page 2 ever mean). Using the range mechanism from HTTP provides us with an interesting precedent for this (though it only fully specifies how to do this with byte ranges). I think it is powerful in that it affords the possibility of status codes for invalid ranges, partial results, etc. We chose
NOTE: pinning ranges based on endpoints is currently NOT implemented in the Heroku platform api due to it being weird and difficult to optimize for. It just wasn't a priority and if you can reverse the order of the list you don't tend to need it (and it is hard to reason about/use). Hopefully that starts to explain and flesh out examples, but I'm certainly happy to address further questions you might have. I'll be happy to flesh out the guide itself, but I think it helps to hash it out in this context a bit first. Thanks! |
How would this work with non-unique attributes, like "last name" in an address book? It'd be possible for a range to contain 3 "Smith" entries and the next range would have to start over at the beginning of "Smith", including those same 3 entries:
Or perhaps worse, using the
The only way I can think of to prevent this is to just require unique attribute for ranges, which might not always be feasible. But it seems like it's either that or just warn developers that their ranges may include duplicates in such ranges and to code around it somehow. Any other ideas? A second question involves "incomplete" collections. You mention "partial content", which sounds like a 206 response code, but when would that be used? I can imagine two different scenarios that might make sense for that:
I'd lean toward the former, because it seems to match the spec a bit better, while the use case in the latter could still be achieved by looking at whether the And I could also be overthinking all this. :) |
Hmm. I think so far we have only allowed unique values here, so I hadn't given it that much thought. For simplicity (and at least the time being), continuing to have that limitation seems reasonable. Asking for a range which isn't the whole collection is one way to get a 206. The other way (in our case anyway) is not specifying a range, but having a collection for which there are more than the default range length results. Which is to say, default range is not everything, it is unspecified begin/end but with a limit (say 100). You allude to this in the second point, so I think we are on the same page. This does deviate from the spec in that we have a different default (for byte ranges, not specifying implies getting the whole thing). I think defaulting to a limit is a sensible implementation to choose (for performance reasons), as you don't want outliers with thousands of records to hold the endpoint hostage. Not to mention that there is some value in having the response time for each endpoint be at least reasonably consistent. My expectation is that you would just watch status (and check I'm not sure I follow what you mean wrt restricting |
Sorry, I think I'm not explaining myself thoroughly. The 206 cases you mention make sense to me, but I'm more wondering about a range-within-a-range almost. Imagine I'm querying entries from a log system, and I wan to get all entries from yesterday:
No matter what, I'd expect to get a 206 response, because I'm not going to get the full set of log entries that are available at this endpoint. And the Imagine there are 150 records in the range I originally requested and the first response included 100 of them. Here's what I would expect to see today, based on what I understand of the docs and interacting with the Heroku Platform API:
But if I just use that
So the biggest problem I can think of with this approach is determining whether or not the list was truncated. If your server has a hard limit of 100 items per response, your queries will likely have that limit included as well. But even then, it'd be simple enough to just query for 101 items and if you get 101 items back, you just drop the last item off and add a
So yeah, I'm basically just trying to figure out how this whole process works for consumers that genuinely only want a partial result and just need to know how to get that partial result and how to be sure they've got it all. |
Shouldn't this be "results less than z"? |
@gulopine - I would expect that you would get a 200 if you got the entire range you asked for (you got full content, in terms of your communicated desire) and only 206 if you only received a portion of the requested range. This is similar to have a request without a range works (where start and end are implied to be the first and last records). Good call on the next-range there being inappropriate though. I think 1 is not great in terms of experience. 2 definitely reveals a bug. I think we should add the endpoint to the next-range, but I'm fairly certain we fail to do so currently (we naively expect users are mostly following next-ranges we give, rather than specifying arbitrary ranges). 3 should be true also, in that if you get the whole range you ask for, it should be 200 and have no next-range (both might be broken presently, but that is how it ought to be in my mind). So, yeah, our implementation may be a bit off still, unfortunately, but I think the ideas are sound. Hopefully that clarifies the confusing edges you were running in to? Definitely some work to be done here. @BRMatt you are spot on. I've edited/updated it so that it will be clearer for future readers. |
@geemus I'm nervous about using 200 for that purpose. The spec is pretty clear that a 206 "indicates that the server is successfully fulfilling a range request for the target resource..." and without stipulating further, I take it to mean that a 206 should be used any time the response doesn't include the full scope of the resource. In particular, 206 has different caching semantics associated with it, and I'd be worried that a 200 could cause partial responses to get cached as if they were full responses, which would really mess up any subsequent requests that ask for different ranges, or even the whole resource. Other than that, I think we're on the same page. A 206 with a In the case of straight "give me everything, but in pages", it'd work like this (using
And if we request an explicit to that same resource, it would look like this instead:
Lastly, the specs seem to suggest that any time the response contains the entire range (even if a range was requested that just happened to include everything), the status should be 200, but I don't see that being practical to implement on a server. Finding out if there are more records after what you've queried is pretty easy (the 100+1 example I mentioned earlier) but determining if there were any before it as well either requires two queries and a race condition (by issuing a count along with the filtered query) or some other black magic that I'm not aware of in order to do it in a single query. So I'd suggest a reasonable simplification be that if there was no range selected (no start point and no end point, even if the Maybe I'm overthinking all of this, but I'd just really like to get it nailed down. I agree that the ideas are sound, they just need a little refinement. And I'm still trying to formulate my thoughts about the semantics of 416 Range Not Satisfiable in relation to all of this, so that'll have to wait for another day. :) |
@gulopine hmm. Good point. I guess it should be 206 and perhaps presence/absence of Next-Range is how you know instead? (seems this is also what you suggest). I agree that the "is there earlier stuff" case is weird/tricky. We originally tried to support at least portions of that and found it to be costly, confusing, etc and so dropped it in our implementation at least. I suspect we should try to do a better job of writing up the output of our discussion above so that (hopefully) it will be clearer for others in the future. I certainly welcome any help in that area you are able and willing to provide. Thanks! |
Yeah, that's what I'm suggesting, and it sounds like we're on the same page. I keep going over the specs on this, though, and one line in Section 3.1 of RFC 7233 has me worried: "A proxy MAY discard a Range header field that contains a range unit it does not understand." I had assumed the requirement would be maintain any |
@gulopine no known reports of problems here (but that may not mean much). It is a point of concern, certainly, but perhaps not worth worrying about too much if we don't know of any actual instances of it yet. |
The specification for non-byte range units states that it is the burden of the API provider to ensure that 1..10 returns resources in a meaningful order. So with that in mind, how would I indicate that I want the 11 - 20th resources ordered by date modified, versus 31 - 40th resources ordered by resource name (alphabetical)? Would we cheat and use query string or tuck it in a post body with filters as json? |
11-20 by modified, I would expect to look something like: In both cases you wouldn't inherently know those to be 11-20 or 31-40, you follow |
HI, wondering how to implement a typical pagination using "Range Header", where there should have "total page" and pages (1, 2, 3, 4, 5 ... 10, 11, 12 )presented. And through your discussion, I didn't see any way that I can enable the user to navigate to the "page 5" ( middle of the whole collection, as there's no way to get the range ) directly. Or maybe this mechanism is just applicable for "previous and next" pagination ? |
Great question. Our use case really only needed next/previous style pagination, jumping to arbitrary locations wasn't deeply considered. I imagine this could be layered on with some additional arguments, but again I hadn't given it tons of thought. Could you elaborate more on your use case so I can better understand what might be required? Thanks! |
Pagination with page links is quite common, like below (with bootstrap style): According to my experience, a common implementation to this features, a API ( for example: /books )should:
Hope this will help. |
@bjeanes Yeah, I think that article does a good job of articulating some of the reasons why we went the way we did. Drift in particular is troublesome and something we hoped to avoid. The mechanism described here avoids that by having stale semantics (asking for a particular range will always have the same meaning, though the contents might change, asking for page 5 has very little semantic meaning at all). Thanks for sharing that. @neekey I certainly recognize that the pagination style described is common, but it isn't one I'm fond of and I'm not sure it is actually very useful in many cases. I worry that trying to layer it on here might be a step backwards from what I think is a more meaningful approach, but it is problematic that it can not solve a common use case. Does that clarify? What do you think? |
@bjeanes Thanks for this good article, learned a lot. @geemus The situation is quite clear now. The approach of range-header-pagination is more semantic, and the response will be explicit, in addition, to implement this mechanism of pagination may provide better performance. So I guess for now if a pagination-style is still need, the range-header-pagination will not be the right one to choose ( As the article pointed, a "infinite scrolling" scenario will be more suitable ), and the traditional way will be quite the same as I described above ( in case someone wants to know : ) ) |
It does make me wonder how expensive giving some indication of how far through a resultset you are would be. That seems to be the primary point of page numbers (as anchoring to a boundary value as we do can offer positioning that is at least as stable as page numbers). If you know the count of all values and the count of values before or after your current position, some useful approximate metadata could be provided. I'm thinking ("200-400 of 23,000 results" style messaging). |
@geemus Right now I'm creating a API server based off much of the advice on the repo, of which I'm grateful for. I'm struggling with the pagination, mostly because, our front-end is an Angular app, which very much wants the type of Bootstrap pagination that @neekey brings up. While doing page & size would imply a I am fronting all the API resources with a Amazon CloudFront specifically, and by default differentiation of requests by headers is disabled, and you need to maintain a white-list of headers to cache by, so it might be worthwhile to call that out in comparison to using query parameters. I really like the Github style of utilizing the HTTP Link header to construct the next/previous/first/last pagination URLs, however some of these operations can be time prohibitive given the nature of finding the data on every request. So I would say that next/previous should almost always be provided, but first/last could be avoided if it was considered to be too expensive to calculate. Thoughts? |
@bjeanes, providing total/estimate counts can be very expensive as in most databases it involves a full table scan for an exact count, or estimate counts based on the query planner which execute in a much faster speed, but from my experience in Postgres, these are really only accurate on full table counts, as soon as you start to filter in anyway, the estimate is drastically incorrect. Some SQL databases have solutions based upon triggers, where you maintain an atomic count of rows in a side table for every table that might require exact counts. This is easy in a database where it is single-tenant, however a database where you have many different users all tracking the same type of content (pretty normal case in an API, where you want the total to be based on the user/customer's account) as far as I can tell, please correct me if this is incorrect, requires using partitioning tables so the estimate count is more accurate, or you can put triggers on those tables and maintain a racy-count of the total rows. |
@jamie-stackhouse yup that is my thinking too. Just thinking out loud of ways to make this style of pagination less of a compromise. Unsurprisingly, there may not be much to it... edit I was hoping that because we are already sorting on some given field we can assume it is indexed. Depending on the index type, a count of all values less than or greater than a certain pivot could be done cheaply—in theory. |
Yeah, next/prev should be provided, first/last are of more dubious value (and are often more expensive). I'm not sure how much value is gained by total/progress. I guess it would allow you to display what "page" you are on, but it still wouldn't provide for jumping to particular pages. Definitely more to think about. We were inspired by S3 in many ways, maybe I should revisit which of these mechanisms it does (or doesn't) provide. I think you just check a |
@geemus I think the utility is entirely one of UX. A good example is search results presented to a user. It's semantically relevant to know whether the result set is 100 or 1000 large. Knowing who far "through it" you are could be done client side as you iterate over the requested "pages" but knowing how many there are in total is a black box without extra metadata. |
Man, it's amazing how interesting and neat a topic can be to some and not my girlfriend! @geemus - I might be a little lost on whether you're making a pro-HTTP/Restful API argument or saying that the idea of the "grand total" and "how deep in it" you are is something lost by using the Range header. But assuming it's the latter, a couple of thoughts:
Sorry, more on point, UX is really important, especially fundamental "chemical" aspects, like size, shape, and general nature/behavior of the data or how it's served/presented. (Honestly, I am seriously over this trend of "click for 10 more!" style pagination where as the list grows and grows it goes from mysterious to frustrating to almost creepy). But this is exactly why I latched on to the Range header as a generic paginator. Because it is versatile enough to allow for any unit (pages, chapters, seconds, frames, bytes, records, People, etc) while simple (and thus fairly strict) enough to avoid "extending" to accommodate three different popular browsers bratty attempts at trendsetting or the trappings of last-gen developers trying to use it for some other purpose so that they can fit a different paradigm into a new approach and feel like they are keeping up, etc. Not because Range couldn't be abused and mutated and misused via a simple semicolon or new draft of HTTP, but because (here me out, this is the good part), it's f*ing RANGE. if you screw with it, to do something besides return X-Z of unit Q, then really, it's not range anymore. Man, that was supposed to be shorter. Second point (originally): Like I said, I totally agree with you, not just on a UX level, but on a "everything should be tangible on some level" level. This is why the HTTP response header Content-Size exists (even your browser wants some idea how big the file is going to be), and, I think, why Range has something similar: Content-Range : X-Y/Brajillion Is the issue that this doesn't quite work for APIs that use the "Status 206 : Keep Comin back for more til I say I'm done" model? maybe my girlfriend is right and I do find this all more interesting than I should... |
Sorry, all of the above was for @bjeanes But regarding 'first' and 'last' being dubious, @geemus I think it only appears as such when you are locked down by data having a starting point but no "defined" ending. But it's like that saying about one guys ass being another dogs face, an end is just the beginning backwards: Arbitrary slice: Range: name bar..foo First: Range: name a../100 (names starting at a, up to 100) Last Now if we had a better idea of "right" way to indicate sorting of Range, we could be sure that z../100 would return reverse alphabetical set, not just the few z's left. Maybe: Range: name z../-100 |
@crazytonyi I was responding to @geemus' comment:
My point is that "how much value" is certainly debatable but that there is in fact semantic significance to the size of the result set (in some cases). This might be due to UX (it's relevant to the users' actions, choices, an behaviour how much information they need to assess) or it might be because a system needs to make some programmatic decision (like, "how many API calls might I need to consume" or "how long will I spend processing this data"). |
@bjeanes Right, and I was saying I very much agree that some properties of data and/or the data service (api, protocol, etc) are so inherently fundamental that even though it can be left out or considered "less important" or "secondary", to dismiss these properties would be like disregarding music in favor of science or having all cars be black, or having a bird with no feet . The value of these things have a definite " user experience" or "presentation layer" value and arbitrariness, but they also have a model-layer value that is beyond trivial (like making decisions and interpretations regarding the data/service based on qualities such as size, shape, data format, data structure, etc). I think my somewhat adviserarial response was from my reading of the term UX. I think your emphasis on the inherit need for things like total size/count is actually true beyond (or maybe before) UX consideration and it's because of this "more primitive" value that it needs to have a place in the api and not be left for the client/browser/domain logic to deal with. I guess my main point is I agree with you on it being more than an afterthought, but that generally if something only has UX/UI value, it is actually the type of thing that should be seen as extra weight and trimmed whenever possible. |
Tricky to know how to best deal with this. I think the |
Keeping track of your place can probably be done client-side reasonably well, IMO. |
@bjeanes good point, at least you can if you start at the beginning and move through (but that is likely the common case). So maybe it is just a matter of adding some means of communicating the total? With that and knowledge of the requested max you could work out page numbering, I think? I guess I would lean toward including it as a param to |
We currently implement a custom header on a HEAD request for the resource that returns a total in the header Resource-Count, purely to make parsing easier like @geemus said. I will say, that with super attentive database design, total counts in the style of It takes a very "designed" database to make these counts accurate, and take a predictable amount of time for a multi-tenant system (what isn't that is an API). Just think that if you suggest a total count that is both fairly accurate, and quick, that the technical complexity does go up. |
@itsjamie yeah, I can see how putting it just on head would be a nice tradeoff (it is easily accessible, but it is sort of "lazily" evaluated instead of doing it all the time since it can be costly and/or inaccurate). Definitely true that actually having accurate data in a timely fashion here is rather tricky also. |
Just wanted to leave my two cents as this well-thought discussion earned high google rankings. This also builds on the sort/pagination discussion in issue 81 First I want to get out of the way that heroku added sorting in a non-spec compliant way.
Spaces aren't actually allowed in the spec:
And VCHAR defined as:
%x20 is the space character. Whether this is a practical concern is obviously another story, but I thought it's worth mentioning. So the problem of downloading large, sorted datasets would benefit with http's ability to cache partially retrieved results, but after racking my brain against it yesterday I realized the current spec (even *bis version) doesn't support sorting in the way we want to make it fit with the range header. This leaves us the options of a query parameter or a custom header - I'm opting for the latter. As for the problem of pagination, I'm not following. This key-ed approach seems to conflict with numeric pagination and I think the two would be best left separate. Keys also limit you to a single ascending unique column which is a serious consideration. Finally as @bjeanes alluded to, understanding the size of what you're requesting is not possible with keys. Say someone requests The linked article describing the issue with offset pagination was good, but I believe using keys is a niche solution. Edited: Corrected invalid points |
Could you give a more concrete example around partial caching to help me ensure I understand the issues and/or options? You also are not really able to know how many results you will get with offset (at least in the case of the final "page", which could be any size). I don't think that is inherently problematic, in our usage you are allowed to specify a maximum that you would accept, but the implication is that you will take as many as you can get as long as it does not exceed this. So if you ask for max of billions of records, we would instead just return whatever we felt was a reasonable/acceptable max (say 1000 for example) and not explicitly error. I think this is not that disimilar from the case with offset/page based mechanisms in which you for instance ask for 1000 in a place where there is only a single record (or a small number of records). There are certainly issues around this usage that we have found, but there are also issues with offset based. So although offset has more widespread usage, I'm not convinced one is inherently better or worse (yet anyway). They both have strengths and weaknesses depending on use case and dataset. |
Thanks much for pointing me to the general header section about the spaces. You're right. About not following the spec in other ways, would you mind explaining why you opted for that route as opposed to using a custom header? Some context: I'm not a heroku consumer and only joined in because I was deciding whether and how to use the range header to get sorted rows from a database. It mapped fairly well to your tackling of pagination. My partial caching comment was only in reference to the line in rfc7233 section-3.1: Regarding the number of results, surely with the offset, limit, and total size you can estimate the request size for any page including the last. I must not be understanding your point/intent, and it's also likely not important.
This helps my understanding your design, however I'm now confused since my interpretation of the range header was that the client can expect to retrieve the entire range requested - unless the entire range is shorter than what was requested. Looking around the web I'm not finding whether a 206 with a subset of the range requested is valid. It just seems weird to declare a range satisfiable, yet not satisfy that range. Regardless, if you're already choosing to not follow the spec via the range header then my confusion is misplaced. Your model is working well in a widely used production environment. I appreciate your effort to go a different route and find the practical limitations of it - definitely moves everyone forward. |
I think it could best be described by saying that it originally started as something that could be said to be quite faithful to the RFC, but as we encountered some practical limitations we made some small pragmatic changes to address them which drifted away from the original definition and usage. But we were far enough along at that point that totally moving to a different header seemed like it would cause it's own issue, so we simply continued on the path we had begun upon (for better or worse). If I had it all to do again, there would be some similarities to the current approach, but some differences as well (and I think it would probably end up being somewhat more like Range than it actually ended up being). Live and learn I suppose. Yes, you should be able to estimate the request size for any page but the last, but at least in my experience you often don't know if you are on the last page or not until AFTER you request the page, which would lend me to think that all pages are somewhat unpredictable. Does that make more sense? The final place we landed is certainly a bit weird/awkward, but it is that way because it succeeds in one of the original design tenants, which was that a range has a stable semantic meaning (even if it has unstable contents). ie if you were using this for a literal dictionary, pagination/offset would give you stable results for the 3rd grouping of 100 words (something like the 3rd page in a real life dictionary), but this doesn't really tell you anything about what appears on that page. The way we have things here, on the other hand, is intended to more closely map to the direct/human experience of thinking about these pages. ie that the page can be represented by Does that help? I've thought a little about how I might redo this given the chance, but haven't really worked through the exercise deeply enough to be confident presenting it. |
Ha - all that makes a lot of sense. And I know what you mean about thinking through such a large design decision, a ton of variables to comb through. Thanks for your time! |
Here is a proposal I wrote a little while back that would be much closer to the original Range stuff: https://gist.github.com/geemus/8dfc01bd99367346c25891c915327220 |
I appreciate the write-up. Would you mind expanding on:
I just haven't thought of transactional issues and am very interested in the problems you came across. As for your the rest - your simplified design differs slightly from mine mostly due to requirements. We also share the same unresolved question surrounding conditional requests haha. I'm considering that out of scope for the sake of moving forward. I'm currently implementing support for multiple ranges and sorting. I chose to support multiple ranges as motivation to learn how they work (very new to rest). Sorting will be done by a custom header and throws a 400 if incorrect sorting columns were requested. Nothing complicated there. After I have an initial draft of my my sqlite-to-rest library, I'll post back with a gist link detailing my implementation for future interested readers and the code will be open source. I will keep that gist related to the range header handling. Edit: Reworded/moved stuff around |
Gosh, on the train of multiple/compound ranges - I haven't really dug into http2 prior to today. I forgot that it removes the overhead associated with many connections, defeating any purpose behind multi-range requests. Not only that, but I was looking around to see if there exists a browser-compatible multipart response parser and didn't find any that were light-weight much less browser-compatible. It seems that in practice, multipart communication is consumed by servers (e.g. multipart/form-data). There are quite a few reasons not to support multi-range requests after reading a lot about the subject today, the above two being the most compelling for my use-case. Not that you were ever seriously considering support for heroku's api, but in case you were, supporting http2 (should happen eventually) and advising your users to move toward it is a better alternative. |
Very true, in general I look forward very much to http2 becoming more broadly available/usable/used. |
The link to the Heroku platform docs helps to a certain degree, but it doesn't explain what the benefits are to using this approach to pagination. Given that there's not much tooling available for using the Range header this way, it might be good to describe why it's better than just, for example, embedding page details and next/previous URLs in the JSON response.
Also, some additional detail on how pagination would work in different scenarios would be useful. The platform docs only mention how to sort and specify a maximum value (and what does ".." mean in this context?). It'd be good to see how to specify a range bounded on both ends, and also how to respond when the returned range isn't the same as what was requested (either too many results were requested or there's not enough data to fill the request range).
I expect those docs were written with the expectation that we'd play with the Heroku platform API, but for a broader-purpose doc like this, it'd be good to see some real-world use cases without having to resort to experimenting with a specific implementation.
The text was updated successfully, but these errors were encountered: