-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for output="raw"
#280
Conversation
Thanks a lot for this pull request. It is a useful enhancement towards opening the package to powers users without making it less complex for most use cases. The only issue I have is that Therefore I would suggest, that the I do not think that the power user (who that feature is aimed at) would see any problem in having individual files instead of an object. |
Discussion from #276 continues here. I see your point - but why not offer the power user the option, if they want to have the object with the json strings back or saved to disk? The additional effort would not be that big (saving the jsons after each page downloaded by |
Or introduce another output format - |
I see your point about memory, but to be honest I'm not really sure whether I think the problem is that for a power user, whatever we support in OpenAlex will be too limiting. For example, as a power user myself (though I don't make large queries), if I were to regularly make such large queries to the database, I would rather do one of two things:
I would also like to clarify that this PR isn't really "aiming" to win over power users - I just think |
On point 1, the entirety of Works objects in OpenAlex is only 380GB :) (the complete snapshot is 420GB). If you post-process the file fragments into a single database yourself it'll get much smaller.
|
Thanks for your elaboration. I want to describe my usage scenario: I want to do a full text search on title and abstract for a complex search term, which I can do very easily via Also, a data snapshot would be perfect, if I could do full text search. I have no idea how to setup elastisearch and duckdb simply gives up with 250 million records for full text search. I agree, that the main audience of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
June, incredible work! 🚀 Thank you so much for the thoughtful implementation and clear examples + tests.
Thanks @trangdata - I'll plan to merge this! And thanks @rkrug for the discussion here. For now, my 1 last thought on this: Setting aside the question of whether the ability to download json in real time should be within the scope of On your specific usecase: from my experience working with such data snapshots, once you have it set up for yourself, regex matching on title and abstract is extremely simple and fast as far as database queries go. FWIW, duckdb is an analytical database and not the best tool for the job - you'd want to work with transactional databases like PostgreSQL or BigQuery (as OpenAlex also recommends). I'm sure OpenAlex would appreciate your input if you, as an experienced user, come across any trouble in the setup and can give feedback on their documentation for the data snapshot. I get that convenience is important, but I'd hate for |
Hi Rainer, we have not been supportive of including file-writing functionality in openalexR (recall our discussions on csv, rds, and now json), and there are many reasons for this decision. One key rationale is that the core purpose of the package is to fetch data from OpenAlex, not to handle file storage. Mixing data retrieval with file system operations violates the principle of separation of concerns, making the function less focused, more complex, and much less maintainable. I want to reiterate June's point (which he has made many times) that, like most opensource projects, openalexR focuses on serving the average user. We appreciate your feedback but ultimately will have to make the decision whether to incorporate a change while considering multiple factors. You are free to make a fork of the project and modify it to work with your niche use case, which you have done. I know you tried |
Great point, June. Working with data snapshots would put a lot less stress on the API given the very large number of queries in Rainer's use case. |
Data snapshot: I agree that it should be done when regular requests and regular downloads of the data is done - but setting it up for running one query which is then analysed for a few months and possibly not updated anymore is pure overkill. Cheers, |
sorry - this seems to me as an overkill and adds to much complications. |
I agree, and I truly empathize with you that that's a genuinely difficult problem to solve! I really do! But at the end of the day I feel like |
This PR supports the option of
oa_fetch(output = "raw")
to return the raw JSON string(s) of the body of the API response, as-is. This is done in two parts:api_request()
is now toggle-able. The default is to parse the JSON into R list, as before, but can be turned off withapi_request(parse = FALSE)
. When called fromoa_fetch(output = "raw")
, parsing is turned off.output="raw"
, the result as returned byapi_request()
gets the sameoutput = "list"
treatment of returning early fromoa_fetch()
, avoiding further processing which assumes some structure in the result object.The result of
output="raw"
is a character vector of raw JSON strings. The length depends on the implementational details of the query, namely the number of pages in the result.Some tests of equivalence to other output formats:
Equivalence to
output = "list"
Equivalence to
output = "tibble"
A power user now has more flexibility to intervene in the processing pipeline. For example, if they prefer to use another JSON parser: