Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further explanation of the Range header for pagination #36

Open
gulopine opened this issue Jul 7, 2014 · 42 comments
Open

Further explanation of the Range header for pagination #36

gulopine opened this issue Jul 7, 2014 · 42 comments

Comments

@gulopine
Copy link

gulopine commented Jul 7, 2014

The link to the Heroku platform docs helps to a certain degree, but it doesn't explain what the benefits are to using this approach to pagination. Given that there's not much tooling available for using the Range header this way, it might be good to describe why it's better than just, for example, embedding page details and next/previous URLs in the JSON response.

Also, some additional detail on how pagination would work in different scenarios would be useful. The platform docs only mention how to sort and specify a maximum value (and what does ".." mean in this context?). It'd be good to see how to specify a range bounded on both ends, and also how to respond when the returned range isn't the same as what was requested (either too many results were requested or there's not enough data to fill the request range).

I expect those docs were written with the expectation that we'd play with the Heroku platform API, but for a broader-purpose doc like this, it'd be good to see some real-world use cases without having to resort to experimenting with a specific implementation.

@geemus
Copy link
Member

geemus commented Jul 7, 2014

Thanks for the questions. I'll outline some things here as a start to the conversation and then update the docs once we have some sense that it has been well described.

First off, my motivation was to avoid "page" related things as it doesn't really have a stable meaning with mutating data. Instead, ranges should remain stable in meaning, even if their contents change (the range 1..10 always means the same thing, vs what does page 2 ever mean).

Using the range mechanism from HTTP provides us with an interesting precedent for this (though it only fully specifies how to do this with byte ranges). I think it is powerful in that it affords the possibility of status codes for invalid ranges, partial results, etc.

We chose .. as a delimiter. Byte ranges use -, but we didn't feel that was unique enough as uuids at least also contain this value, so .. seems less ambiguous and is an established convention elsewhere for use as a delimiter. You can supply one or both ends, so if we were ranging over the alphabet, you can do things like:

  • a..z to get everything
  • a.. to get the default size worth of results greater than or equal to a
  • ..z to get the default size worth of results less than or equal to z
  • ]a.. results greater than a (not greater than or equal as above)
  • ..z[ results less than z (not less than or equal as above)

NOTE: pinning ranges based on endpoints is currently NOT implemented in the Heroku platform api due to it being weird and difficult to optimize for. It just wasn't a priority and if you can reverse the order of the list you don't tend to need it (and it is hard to reason about/use).

Hopefully that starts to explain and flesh out examples, but I'm certainly happy to address further questions you might have. I'll be happy to flesh out the guide itself, but I think it helps to hash it out in this context a bit first. Thanks!

@gulopine
Copy link
Author

gulopine commented Jul 8, 2014

How would this work with non-unique attributes, like "last name" in an address book? It'd be possible for a range to contain 3 "Smith" entries and the next range would have to start over at the beginning of "Smith", including those same 3 entries:

Content-Range: price..smith
Next-Range: smith..wallace

Or perhaps worse, using the ] approach would skip any remaining "Smith" entries altogether:

Content-Range: price..smith
Next-Range: ]smith..

The only way I can think of to prevent this is to just require unique attribute for ranges, which might not always be feasible. But it seems like it's either that or just warn developers that their ranges may include duplicates in such ranges and to code around it somehow. Any other ideas?

A second question involves "incomplete" collections. You mention "partial content", which sounds like a 206 response code, but when would that be used? I can imagine two different scenarios that might make sense for that:

  • The returned range doesn't include all the item in the collections, full stop. To my reading, this is what the spec had in mind, because it's a way to distinguish the sub-collection from the URL that conceptually covers the entire collection. So if I request a range that skips any items, I'd get a 206. If I don't request a range or if it includes everything, I'd get a 200.
  • The returned range doesn't include everything I asked for. If I request a range that happens to include 57 items, but my app only returns 50 items per batch, it could return a 206, while if the range only includes 42 items, it could send a 200. This could be useful for indicating that the client needs to use the range in the Next-Range header in a subsequent request to get the rest of the results.

I'd lean toward the former, because it seems to match the spec a bit better, while the use case in the latter could still be achieved by looking at whether the Next-Range header exists. Of course, that would mean adding a restriction to when Next-Range is used as well, though. Does that line up with what you would have in mind?

And I could also be overthinking all this. :)

@geemus
Copy link
Member

geemus commented Jul 8, 2014

Hmm. I think so far we have only allowed unique values here, so I hadn't given it that much thought. For simplicity (and at least the time being), continuing to have that limitation seems reasonable.

Asking for a range which isn't the whole collection is one way to get a 206. The other way (in our case anyway) is not specifying a range, but having a collection for which there are more than the default range length results. Which is to say, default range is not everything, it is unspecified begin/end but with a limit (say 100). You allude to this in the second point, so I think we are on the same page. This does deviate from the spec in that we have a different default (for byte ranges, not specifying implies getting the whole thing).

I think defaulting to a limit is a sensible implementation to choose (for performance reasons), as you don't want outliers with thousands of records to hold the endpoint hostage. Not to mention that there is some value in having the response time for each endpoint be at least reasonably consistent. My expectation is that you would just watch status (and check Next-Range only when you get non-200). I agree that this deviates from the spec, so perhaps it is worth further consideration (but this is where we are with it presently).

I'm not sure I follow what you mean wrt restricting Next-Range, could you elaborate? I only expect Next-Range as a header returned in responses, not something passed in requests. As such, I don't think you need restrict it (outside of not including it when there is no further content). Does that make sense?

@gulopine
Copy link
Author

Sorry, I think I'm not explaining myself thoroughly. The 206 cases you mention make sense to me, but I'm more wondering about a range-within-a-range almost. Imagine I'm querying entries from a log system, and I wan to get all entries from yesterday:

Range: timestamp 2014-07-16T00:00:00-07:00..2014-07-17T00:00:00-07:00

No matter what, I'd expect to get a 206 response, because I'm not going to get the full set of log entries that are available at this endpoint. And the Content-Range header would include the range that was actually returned. But how do I know whether the results I got are everything I requested? And how would I know what range to request next time in order to get what's left?

Imagine there are 150 records in the range I originally requested and the first response included 100 of them. Here's what I would expect to see today, based on what I understand of the docs and interacting with the Heroku Platform API:

Content-Range: timestamp 2014-07-16T00:00:00-07:00..2014-07-16T18:00:00-07:00
Next-Range: timestamp ]2014-07-16T16:00:00-07:00..

But if I just use that Next-Range as given, it would get me the next 100 results, 50 of which are outside the range I'm asking for. After all, there's a difference between "I'm requesting this range of data because I'm iterating over the entire set" and "I really only want this range of data and nothing more." So in this case, I can see a few options:

  • Simply expect me, the developer, to look at the Next-Range header, compare the first value to the maximum value I need, and if it's greater (or equal to, with the ] prefix), I'm done. Otherwise, I need to add my end range point after the .. and do another request, which then requires going through this again to figure if I have all of my data.
  • Have the server determine if there were items that got truncated, and if so, automatically add the endpoint of the range back into the Next-Range header: Next-Range: ]2014-07-16T16:00:00-07:00..2014-07-17T00:00:00-07:00. Then I, as a developer, can just check to see whether there's an endpoint on the Next-Range header. If there is, I issue a new request using that range. If not, I'm done.
  • Like above, except that the server doesn't even include the Next-Range header if the range I requested has been satisfied. This is what I meant by restricting it: restricting when it's output at all. This way, a 206 just tells me that I'm not getting everything at the endpoint. The presence of a Next-Range header on top of that tells me that I didn't even get what I requested, and gives me a way to make the next request I need. If there's a Next-Range header, I use it. Otherwise, I'm done.

So the biggest problem I can think of with this approach is determining whether or not the list was truncated. If your server has a hard limit of 100 items per response, your queries will likely have that limit included as well. But even then, it'd be simple enough to just query for 101 items and if you get 101 items back, you just drop the last item off and add a Next-Range header accordingly.

</novel>

So yeah, I'm basically just trying to figure out how this whole process works for consumers that genuinely only want a partial result and just need to know how to get that partial result and how to be sure they've got it all.

@BRMatt
Copy link

BRMatt commented Jul 18, 2014

@geemus

..z[ results greater than z (not less than or equal as above)

Shouldn't this be "results less than z"?

@geemus
Copy link
Member

geemus commented Jul 18, 2014

@gulopine - I would expect that you would get a 200 if you got the entire range you asked for (you got full content, in terms of your communicated desire) and only 206 if you only received a portion of the requested range. This is similar to have a request without a range works (where start and end are implied to be the first and last records).

Good call on the next-range there being inappropriate though. I think 1 is not great in terms of experience. 2 definitely reveals a bug. I think we should add the endpoint to the next-range, but I'm fairly certain we fail to do so currently (we naively expect users are mostly following next-ranges we give, rather than specifying arbitrary ranges). 3 should be true also, in that if you get the whole range you ask for, it should be 200 and have no next-range (both might be broken presently, but that is how it ought to be in my mind).

So, yeah, our implementation may be a bit off still, unfortunately, but I think the ideas are sound. Hopefully that clarifies the confusing edges you were running in to? Definitely some work to be done here.

@BRMatt you are spot on. I've edited/updated it so that it will be clearer for future readers.

@gulopine
Copy link
Author

@geemus I'm nervous about using 200 for that purpose. The spec is pretty clear that a 206 "indicates that the server is successfully fulfilling a range request for the target resource..." and without stipulating further, I take it to mean that a 206 should be used any time the response doesn't include the full scope of the resource. In particular, 206 has different caching semantics associated with it, and I'd be worried that a 200 could cause partial responses to get cached as if they were full responses, which would really mess up any subsequent requests that ask for different ranges, or even the whole resource.

Other than that, I think we're on the same page. A 206 with a Next-Range header means the client should issue a new request with the new Range header, while a 206 without a Next-Range header would be considered complete. That seems like it'd be pretty reasonable to implement for both client and a server, while still having a pretty solid definition.

In the case of straight "give me everything, but in pages", it'd work like this (using < to indicate requests and > to indicate responses and a collection resource with 150 records and a response size of 100):

< Range: id ..                   (or no Range header at all)
> 206 Partial Response
> Content-Range: id 1..100
> Next-Range: id ]100..          (note the lack of an endpoint)

< Range: id ]100..
> 206 Partial Response
> Content-Range: id 101..150     (note the lack of Next-Range header)

And if we request an explicit to that same resource, it would look like this instead:

< Range: id 12..148
> 206 Partial Response
> Content-Range: id 12..111
> Next-Range: id ]111..148       (note the echo of the original endpoint)

< Range: id ]111..148
> 206 Partial Response
> Content-Range: id 111..148     (note the lack of Next-Range header)

Lastly, the specs seem to suggest that any time the response contains the entire range (even if a range was requested that just happened to include everything), the status should be 200, but I don't see that being practical to implement on a server. Finding out if there are more records after what you've queried is pretty easy (the 100+1 example I mentioned earlier) but determining if there were any before it as well either requires two queries and a race condition (by issuing a count along with the filtered query) or some other black magic that I'm not aware of in order to do it in a single query. So I'd suggest a reasonable simplification be that if there was no range selected (no start point and no end point, even if the Range header was supplied), then the server could just query the whole range with the +1 trick to figure out if it has everything or not. If the +1 record is returned, then issue a 206. Otherwise, issue a 200. But again, only when the range was completely open-ended.

Maybe I'm overthinking all of this, but I'd just really like to get it nailed down. I agree that the ideas are sound, they just need a little refinement. And I'm still trying to formulate my thoughts about the semantics of 416 Range Not Satisfiable in relation to all of this, so that'll have to wait for another day. :)

@geemus
Copy link
Member

geemus commented Jul 21, 2014

@gulopine hmm. Good point. I guess it should be 206 and perhaps presence/absence of Next-Range is how you know instead? (seems this is also what you suggest). I agree that the "is there earlier stuff" case is weird/tricky. We originally tried to support at least portions of that and found it to be costly, confusing, etc and so dropped it in our implementation at least. I suspect we should try to do a better job of writing up the output of our discussion above so that (hopefully) it will be clearer for others in the future. I certainly welcome any help in that area you are able and willing to provide. Thanks!

@gulopine
Copy link
Author

Yeah, that's what I'm suggesting, and it sounds like we're on the same page. I keep going over the specs on this, though, and one line in Section 3.1 of RFC 7233 has me worried:

"A proxy MAY discard a Range header field that contains a range unit it does not understand."

I had assumed the requirement would be maintain any Range headers the proxy didn't understand, so I'm surprised to see explicit permission granted to discard them. I assume you don't have any reports of proxies discarding these, but it makes me nervous about relying a header that might get dropped in transit. Thoughts?

@geemus
Copy link
Member

geemus commented Jul 21, 2014

@gulopine no known reports of problems here (but that may not mean much). It is a point of concern, certainly, but perhaps not worth worrying about too much if we don't know of any actual instances of it yet.

@crazytonyi
Copy link

The specification for non-byte range units states that it is the burden of the API provider to ensure that 1..10 returns resources in a meaningful order. So with that in mind, how would I indicate that I want the 11 - 20th resources ordered by date modified, versus 31 - 40th resources ordered by resource name (alphabetical)? Would we cheat and use query string or tuck it in a post body with filters as json?

@geemus
Copy link
Member

geemus commented Feb 2, 2015

11-20 by modified, I would expect to look something like: Range: modified_on 02-01-2015..02-01-2016
31-40 by resource name, something like Range: name bar..foo

In both cases you wouldn't inherently know those to be 11-20 or 31-40, you follow Next-Range headers to get to here. Also, I assume that the thing is ordered by the unit, but it could probably be better explained. Does that make sense/help?

@neekey
Copy link

neekey commented Feb 21, 2015

HI, wondering how to implement a typical pagination using "Range Header", where there should have "total page" and pages (1, 2, 3, 4, 5 ... 10, 11, 12 )presented. And through your discussion, I didn't see any way that I can enable the user to navigate to the "page 5" ( middle of the whole collection, as there's no way to get the range ) directly.

Or maybe this mechanism is just applicable for "previous and next" pagination ?

@geemus
Copy link
Member

geemus commented Feb 25, 2015

Great question. Our use case really only needed next/previous style pagination, jumping to arbitrary locations wasn't deeply considered. I imagine this could be layered on with some additional arguments, but again I hadn't given it tons of thought. Could you elaborate more on your use case so I can better understand what might be required? Thanks!

@neekey
Copy link

neekey commented Feb 26, 2015

Pagination with page links is quite common, like below (with bootstrap style):

image

According to my experience, a common implementation to this features, a API ( for example: /books )should:

  • receive two parameters: page and size to indicate the page to jump and rows for per page respectively
  • the response should include:
    • total page
    • total row count
    • current page
  • what if the user navigates to a page that exceeds the range? two strategies:
    • return empty list
    • navigate the use to the nearest page by present the nearest pagination info with “current page”

Hope this will help.

@bjeanes
Copy link

bjeanes commented Feb 26, 2015

@neekey another aspect to consider here is that the page/size style of pagination, while common, has some implementation drawbacks to consider and relies on LIMIT/OFFSET equivalents in many data stores, which are not without flaws. There's a great read on this topic here.

@geemus
Copy link
Member

geemus commented Feb 26, 2015

@bjeanes Yeah, I think that article does a good job of articulating some of the reasons why we went the way we did. Drift in particular is troublesome and something we hoped to avoid. The mechanism described here avoids that by having stale semantics (asking for a particular range will always have the same meaning, though the contents might change, asking for page 5 has very little semantic meaning at all). Thanks for sharing that.

@neekey I certainly recognize that the pagination style described is common, but it isn't one I'm fond of and I'm not sure it is actually very useful in many cases. I worry that trying to layer it on here might be a step backwards from what I think is a more meaningful approach, but it is problematic that it can not solve a common use case. Does that clarify? What do you think?

@neekey
Copy link

neekey commented Feb 27, 2015

@bjeanes Thanks for this good article, learned a lot.

@geemus The situation is quite clear now. The approach of range-header-pagination is more semantic, and the response will be explicit, in addition, to implement this mechanism of pagination may provide better performance.

So I guess for now if a pagination-style is still need, the range-header-pagination will not be the right one to choose ( As the article pointed, a "infinite scrolling" scenario will be more suitable ), and the traditional way will be quite the same as I described above ( in case someone wants to know : ) )

@bjeanes
Copy link

bjeanes commented Feb 27, 2015

It does make me wonder how expensive giving some indication of how far through a resultset you are would be. That seems to be the primary point of page numbers (as anchoring to a boundary value as we do can offer positioning that is at least as stable as page numbers). If you know the count of all values and the count of values before or after your current position, some useful approximate metadata could be provided. I'm thinking ("200-400 of 23,000 results" style messaging).

@itsjamie
Copy link

@geemus Right now I'm creating a API server based off much of the advice on the repo, of which I'm grateful for. I'm struggling with the pagination, mostly because, our front-end is an Angular app, which very much wants the type of Bootstrap pagination that @neekey brings up.

While doing page & size would imply a OFFSET / LIMIT SQL approach, it could be changed to do size / since to implement the style based off modified times.

I am fronting all the API resources with a Amazon CloudFront specifically, and by default differentiation of requests by headers is disabled, and you need to maintain a white-list of headers to cache by, so it might be worthwhile to call that out in comparison to using query parameters.

I really like the Github style of utilizing the HTTP Link header to construct the next/previous/first/last pagination URLs, however some of these operations can be time prohibitive given the nature of finding the data on every request. So I would say that next/previous should almost always be provided, but first/last could be avoided if it was considered to be too expensive to calculate. Thoughts?

@itsjamie
Copy link

@bjeanes, providing total/estimate counts can be very expensive as in most databases it involves a full table scan for an exact count, or estimate counts based on the query planner which execute in a much faster speed, but from my experience in Postgres, these are really only accurate on full table counts, as soon as you start to filter in anyway, the estimate is drastically incorrect.

Some SQL databases have solutions based upon triggers, where you maintain an atomic count of rows in a side table for every table that might require exact counts.

This is easy in a database where it is single-tenant, however a database where you have many different users all tracking the same type of content (pretty normal case in an API, where you want the total to be based on the user/customer's account) as far as I can tell, please correct me if this is incorrect, requires using partitioning tables so the estimate count is more accurate, or you can put triggers on those tables and maintain a racy-count of the total rows.

@bjeanes
Copy link

bjeanes commented Feb 27, 2015

@jamie-stackhouse yup that is my thinking too. Just thinking out loud of ways to make this style of pagination less of a compromise. Unsurprisingly, there may not be much to it... edit I was hoping that because we are already sorting on some given field we can assume it is indexed. Depending on the index type, a count of all values less than or greater than a certain pivot could be done cheaply—in theory.

@geemus
Copy link
Member

geemus commented Mar 2, 2015

Yeah, next/prev should be provided, first/last are of more dubious value (and are often more expensive).

I'm not sure how much value is gained by total/progress. I guess it would allow you to display what "page" you are on, but it still wouldn't provide for jumping to particular pages. Definitely more to think about. We were inspired by S3 in many ways, maybe I should revisit which of these mechanisms it does (or doesn't) provide. I think you just check a truncated boolean to know if you should continue though, which is similar to what we have here.

@bjeanes
Copy link

bjeanes commented Apr 6, 2015

@geemus I think the utility is entirely one of UX. A good example is search results presented to a user. It's semantically relevant to know whether the result set is 100 or 1000 large. Knowing who far "through it" you are could be done client side as you iterate over the requested "pages" but knowing how many there are in total is a black box without extra metadata.

@crazytonyi
Copy link

Man, it's amazing how interesting and neat a topic can be to some and not my girlfriend!

@geemus - I might be a little lost on whether you're making a pro-HTTP/Restful API argument or saying that the idea of the "grand total" and "how deep in it" you are is something lost by using the Range header. But assuming it's the latter, a couple of thoughts:

  1. To your point on UX, I am inclined to agree, but my dirty little secret is that I don't think standards/protocols should be designed to serve humans. Not because humans aren't great or because I've gone so native that I think the end result of human consumption is trivial, but because humans are needy, fickle, never-satisfied jerks (just ask my gf). Which is to say that I totally agree that there is a fundamental value--beyond "I like my buttons plump" or "helvetica just makes me feel nice"--to having some sense of size and magnitude when approaching tons of data. Knowing there are 50,000 page results lets me know I'll probably get what I need in the first 2 pages. If there are only 5 pages, I'll make the effort to go through everything.
  2. ...buuut, humans are silly needy etc etc. And the value of something like an HTTP API (among other things) is to consolidate as much as possible the verbs/actions/syntax/language between humans and machines. Not because machines are more reliable or less needy (unfortunately, machines have their own trends and changes that you can't count on), but because machines actually represent a huge part of the client base for a data service. I don't just mean browsers and crawlers, but readers (for those who surf the web with their ears), mash-up apps, or that integration suite your boss keeps saying you have to start using since they paid 10k dollars for that free thing!

Sorry, more on point, UX is really important, especially fundamental "chemical" aspects, like size, shape, and general nature/behavior of the data or how it's served/presented. (Honestly, I am seriously over this trend of "click for 10 more!" style pagination where as the list grows and grows it goes from mysterious to frustrating to almost creepy). But this is exactly why I latched on to the Range header as a generic paginator. Because it is versatile enough to allow for any unit (pages, chapters, seconds, frames, bytes, records, People, etc) while simple (and thus fairly strict) enough to avoid "extending" to accommodate three different popular browsers bratty attempts at trendsetting or the trappings of last-gen developers trying to use it for some other purpose so that they can fit a different paradigm into a new approach and feel like they are keeping up, etc. Not because Range couldn't be abused and mutated and misused via a simple semicolon or new draft of HTTP, but because (here me out, this is the good part), it's f*ing RANGE. if you screw with it, to do something besides return X-Z of unit Q, then really, it's not range anymore.

Man, that was supposed to be shorter.

Second point (originally):

Like I said, I totally agree with you, not just on a UX level, but on a "everything should be tangible on some level" level. This is why the HTTP response header Content-Size exists (even your browser wants some idea how big the file is going to be), and, I think, why Range has something similar:

Content-Range : X-Y/Brajillion

Is the issue that this doesn't quite work for APIs that use the "Status 206 : Keep Comin back for more til I say I'm done" model?

maybe my girlfriend is right and I do find this all more interesting than I should...

@crazytonyi
Copy link

Sorry, all of the above was for @bjeanes

But regarding 'first' and 'last' being dubious, @geemus I think it only appears as such when you are locked down by data having a starting point but no "defined" ending. But it's like that saying about one guys ass being another dogs face, an end is just the beginning backwards:

Arbitrary slice:

Range: name bar..foo

First:

Range: name a../100

(names starting at a, up to 100)

Last
Range: name z../100
(names starting at end, up to 100)

Now if we had a better idea of "right" way to indicate sorting of Range, we could be sure that z../100 would return reverse alphabetical set, not just the few z's left. Maybe:

Range: name z../-100

@bjeanes
Copy link

bjeanes commented Apr 7, 2015

@crazytonyi I was responding to @geemus' comment:

I'm not sure how much value is gained by total/progress.

My point is that "how much value" is certainly debatable but that there is in fact semantic significance to the size of the result set (in some cases). This might be due to UX (it's relevant to the users' actions, choices, an behaviour how much information they need to assess) or it might be because a system needs to make some programmatic decision (like, "how many API calls might I need to consume" or "how long will I spend processing this data").

@crazytonyi
Copy link

@bjeanes Right, and I was saying I very much agree that some properties of data and/or the data service (api, protocol, etc) are so inherently fundamental that even though it can be left out or considered "less important" or "secondary", to dismiss these properties would be like disregarding music in favor of science or having all cars be black, or having a bird with no feet . The value of these things have a definite " user experience" or "presentation layer" value and arbitrariness, but they also have a model-layer value that is beyond trivial (like making decisions and interpretations regarding the data/service based on qualities such as size, shape, data format, data structure, etc).

I think my somewhat adviserarial response was from my reading of the term UX. I think your emphasis on the inherit need for things like total size/count is actually true beyond (or maybe before) UX consideration and it's because of this "more primitive" value that it needs to have a place in the api and not be left for the client/browser/domain logic to deal with.

I guess my main point is I agree with you on it being more than an afterthought, but that generally if something only has UX/UI value, it is actually the type of thing that should be seen as extra weight and trimmed whenever possible.

@geemus
Copy link
Member

geemus commented Apr 7, 2015

Tricky to know how to best deal with this. I think the foo..bar/baz helps a little with a final endpoint, but certainly doesn't provide overall context. In cases where that is desired or needed, perhaps it should live in extra params, ie foo..bar; total=200 or alternatively we could consider using other headers altogether. A count shouldn't be too expensive to calculate, though keeping track of your place in this might be more so.

@bjeanes
Copy link

bjeanes commented Apr 7, 2015

Keeping track of your place can probably be done client-side reasonably well, IMO.

@geemus
Copy link
Member

geemus commented Apr 8, 2015

@bjeanes good point, at least you can if you start at the beginning and move through (but that is likely the common case). So maybe it is just a matter of adding some means of communicating the total? With that and knowledge of the requested max you could work out page numbering, I think?

I guess I would lean toward including it as a param to Content-Range, probably just total. I like that this avoids new headers and provides some nice parity to the other stuff like max on the way in. That said, I can't think of other cases that have params like this in response headers (or at least I think of them mostly as request header things). Also, in terms of actually using it, a separate header is easier (parsing the key/value style of headers is, honestly, kind of annoying).

@itsjamie
Copy link

itsjamie commented Apr 9, 2015

We currently implement a custom header on a HEAD request for the resource that returns a total in the header Resource-Count, purely to make parsing easier like @geemus said.

I will say, that with super attentive database design, total counts in the style of SELECT COUNT(*) FROM users; can be optimized in Postgres by using the estimate counts provided by the query optimizer or using VACUUM'd data directly. However, as soon as you get into filtering with WHERE you hit major issues with the estimate counts from the query planner, in that they are wildly inaccurate. So, you're back to doing a full table scan.

It takes a very "designed" database to make these counts accurate, and take a predictable amount of time for a multi-tenant system (what isn't that is an API). Just think that if you suggest a total count that is both fairly accurate, and quick, that the technical complexity does go up.

@geemus
Copy link
Member

geemus commented Apr 9, 2015

@itsjamie yeah, I can see how putting it just on head would be a nice tradeoff (it is easily accessible, but it is sort of "lazily" evaluated instead of doing it all the time since it can be costly and/or inaccurate). Definitely true that actually having accurate data in a timely fashion here is rather tricky also.

@olsonpm
Copy link

olsonpm commented May 18, 2016

Just wanted to leave my two cents as this well-thought discussion earned high google rankings. This also builds on the sort/pagination discussion in issue 81

First I want to get out of the way that heroku added sorting in a non-spec compliant way.

The platform API’s Range headers now support descending order by appending a ; order=desc

Spaces aren't actually allowed in the spec:

 Range = byte-ranges-specifier / other-ranges-specifier
 other-ranges-specifier = other-range-unit "=" other-range-set
 other-range-set = 1*VCHAR

And VCHAR defined as:

VCHAR = %x21-7E
visible (printing) characters

%x20 is the space character. Whether this is a practical concern is obviously another story, but I thought it's worth mentioning.

So the problem of downloading large, sorted datasets would benefit with http's ability to cache partially retrieved results, but after racking my brain against it yesterday I realized the current spec (even *bis version) doesn't support sorting in the way we want to make it fit with the range header. This leaves us the options of a query parameter or a custom header - I'm opting for the latter.

As for the problem of pagination, I'm not following. This key-ed approach seems to conflict with numeric pagination and I think the two would be best left separate. Keys also limit you to a single ascending unique column which is a serious consideration. Finally as @bjeanes alluded to, understanding the size of what you're requesting is not possible with keys. Say someone requests a..z and the server is configured to respond with 403 - Requested range too large due to some limit. In the case of a record approach you are given the opportunity to append: A maximum range of 200 records is allowed. I cannot think of a reasonable way to respond if a..z results in too many records.

The linked article describing the issue with offset pagination was good, but I believe using keys is a niche solution.

Edited: Corrected invalid points

@geemus
Copy link
Member

geemus commented May 19, 2016

.. vs - is not spec compliant, but is less ambiguous for our usage. I'm less sure I would agree about the issue of spaces. I think the ; should delineate between the range-unit portion of the header and optional parameters, which though not mentioned in the range specific portion of the spec is well defined in the more general header definition.

Could you give a more concrete example around partial caching to help me ensure I understand the issues and/or options?

You also are not really able to know how many results you will get with offset (at least in the case of the final "page", which could be any size). I don't think that is inherently problematic, in our usage you are allowed to specify a maximum that you would accept, but the implication is that you will take as many as you can get as long as it does not exceed this. So if you ask for max of billions of records, we would instead just return whatever we felt was a reasonable/acceptable max (say 1000 for example) and not explicitly error. I think this is not that disimilar from the case with offset/page based mechanisms in which you for instance ask for 1000 in a place where there is only a single record (or a small number of records).

There are certainly issues around this usage that we have found, but there are also issues with offset based. So although offset has more widespread usage, I'm not convinced one is inherently better or worse (yet anyway). They both have strengths and weaknesses depending on use case and dataset.

@olsonpm
Copy link

olsonpm commented May 20, 2016

Thanks much for pointing me to the general header section about the spaces. You're right. About not following the spec in other ways, would you mind explaining why you opted for that route as opposed to using a custom header?

Some context: I'm not a heroku consumer and only joined in because I was deciding whether and how to use the range header to get sorted rows from a database. It mapped fairly well to your tackling of pagination. My partial caching comment was only in reference to the line in rfc7233 section-3.1: Range supports efficient recovery from partially failed transfers and partial retrieval of large representations.

Regarding the number of results, surely with the offset, limit, and total size you can estimate the request size for any page including the last. I must not be understanding your point/intent, and it's also likely not important.

the implication is that you will take as many as you can get as long as it does not exceed this

This helps my understanding your design, however I'm now confused since my interpretation of the range header was that the client can expect to retrieve the entire range requested - unless the entire range is shorter than what was requested. Looking around the web I'm not finding whether a 206 with a subset of the range requested is valid. It just seems weird to declare a range satisfiable, yet not satisfy that range. Regardless, if you're already choosing to not follow the spec via the range header then my confusion is misplaced. Your model is working well in a widely used production environment.

I appreciate your effort to go a different route and find the practical limitations of it - definitely moves everyone forward.

@geemus
Copy link
Member

geemus commented May 20, 2016

I think it could best be described by saying that it originally started as something that could be said to be quite faithful to the RFC, but as we encountered some practical limitations we made some small pragmatic changes to address them which drifted away from the original definition and usage. But we were far enough along at that point that totally moving to a different header seemed like it would cause it's own issue, so we simply continued on the path we had begun upon (for better or worse). If I had it all to do again, there would be some similarities to the current approach, but some differences as well (and I think it would probably end up being somewhat more like Range than it actually ended up being). Live and learn I suppose.

Yes, you should be able to estimate the request size for any page but the last, but at least in my experience you often don't know if you are on the last page or not until AFTER you request the page, which would lend me to think that all pages are somewhat unpredictable. Does that make more sense?

The final place we landed is certainly a bit weird/awkward, but it is that way because it succeeds in one of the original design tenants, which was that a range has a stable semantic meaning (even if it has unstable contents).

ie if you were using this for a literal dictionary, pagination/offset would give you stable results for the 3rd grouping of 100 words (something like the 3rd page in a real life dictionary), but this doesn't really tell you anything about what appears on that page. The way we have things here, on the other hand, is intended to more closely map to the direct/human experience of thinking about these pages. ie that the page can be represented by aardvark..antelope tells you something about the page which does not change (though the contents might). The question you are asking remains the same either way, so the idea is that 206 is accurate because you are getting everything you asked for. You are not specifying that you want precisely 100 records, you are simply saying you would accept anywhere in the range of 1-100 records. So I think the status code usage is still accurate, at least in spirit (or that is my hope).

Does that help? I've thought a little about how I might redo this given the chance, but haven't really worked through the exercise deeply enough to be confident presenting it.

@olsonpm
Copy link

olsonpm commented May 20, 2016

Ha - all that makes a lot of sense. And I know what you mean about thinking through such a large design decision, a ton of variables to comb through. Thanks for your time!

@geemus
Copy link
Member

geemus commented May 23, 2016

Here is a proposal I wrote a little while back that would be much closer to the original Range stuff: https://gist.github.com/geemus/8dfc01bd99367346c25891c915327220

@olsonpm
Copy link

olsonpm commented May 23, 2016

I appreciate the write-up. Would you mind expanding on:

[existing implementation]... does not support the desired level of transactionality

I just haven't thought of transactional issues and am very interested in the problems you came across.

As for your the rest - your simplified design differs slightly from mine mostly due to requirements. We also share the same unresolved question surrounding conditional requests haha. I'm considering that out of scope for the sake of moving forward. I'm currently implementing support for multiple ranges and sorting. I chose to support multiple ranges as motivation to learn how they work (very new to rest). Sorting will be done by a custom header and throws a 400 if incorrect sorting columns were requested. Nothing complicated there.

After I have an initial draft of my my sqlite-to-rest library, I'll post back with a gist link detailing my implementation for future interested readers and the code will be open source. I will keep that gist related to the range header handling.

Edit: Reworded/moved stuff around

@olsonpm
Copy link

olsonpm commented May 24, 2016

Gosh, on the train of multiple/compound ranges - I haven't really dug into http2 prior to today. I forgot that it removes the overhead associated with many connections, defeating any purpose behind multi-range requests. Not only that, but I was looking around to see if there exists a browser-compatible multipart response parser and didn't find any that were light-weight much less browser-compatible. It seems that in practice, multipart communication is consumed by servers (e.g. multipart/form-data).

There are quite a few reasons not to support multi-range requests after reading a lot about the subject today, the above two being the most compelling for my use-case. Not that you were ever seriously considering support for heroku's api, but in case you were, supporting http2 (should happen eventually) and advising your users to move toward it is a better alternative.

@geemus
Copy link
Member

geemus commented May 24, 2016

Very true, in general I look forward very much to http2 becoming more broadly available/usable/used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants