Support for `output="raw"` #280

yjunechoe · 2024-09-25T11:49:40Z

This PR supports the option of oa_fetch(output = "raw") to return the raw JSON string(s) of the body of the API response, as-is. This is done in two parts:

The JSON parsing step in api_request() is now toggle-able. The default is to parse the JSON into R list, as before, but can be turned off with api_request(parse = FALSE). When called from oa_fetch(output = "raw"), parsing is turned off.
When output="raw", the result as returned by api_request() gets the same output = "list" treatment of returning early from oa_fetch(), avoiding further processing which assumes some structure in the result object.

The result of output="raw" is a character vector of raw JSON strings. The length depends on the implementational details of the query, namely the number of pages in the result.

output_raw <- oa_fetch(
  entity = "works",
  search = "language",
  per_page = 2,
  options = list(sample = 5, seed = 1),
  output = "raw"
)
sapply(output_raw, substr, 1, 60, USE.NAMES = FALSE)
#> [1] "{\"meta\":{\"count\":5,\"db_response_time_ms\":201,\"page\":1,\"per_p"
#> [2] "{\"meta\":{\"count\":5,\"db_response_time_ms\":103,\"page\":2,\"per_p"
#> [3] "{\"meta\":{\"count\":5,\"db_response_time_ms\":66,\"page\":3,\"per_pa"

Some tests of equivalence to other output formats:

Equivalence to output = "list"

output_list <- oa_fetch(
  entity = "works",
  search = "language",
  per_page = 2,
  options = list(sample = 5, seed = 1),
  output = "list"
)

raw2list <- function(x) {
  parsed <- lapply(x, jsonlite::fromJSON, simplifyVector = FALSE)
  results <- sapply(parsed, `[[`, "results")
  unlist(results, recursive = FALSE, use.names = FALSE)
}

waldo::compare(
  raw2list(output_raw), output_list,
  list_as_map = TRUE
)
#> ✔ No differences

Equivalence to output = "tibble"

output_tibble <- oa_fetch(
  entity = "works",
  search = "language",
  per_page = 2,
  options = list(sample = 5, seed = 1),
  output = "tibble"
)
identical(
  output_raw %>% raw2list() %>% works2df(),
  output_tibble
)
#> [1] TRUE

A power user now has more flexibility to intervene in the processing pipeline. For example, if they prefer to use another JSON parser:

raw2list_fast <- function(x) {
  parsed <- lapply(x, RcppSimdJson::fparse, max_simplify_lvl = 3L, empty_array = list())
  results <- lapply(parsed, `[[`, "results")
  unlist(results, recursive = FALSE, use.names = FALSE)
}
bench::mark(
  jsonlite = raw2list(raw),
  RcppSimdJson = raw2list_fast(raw)
)
#> # A tibble: 2 × 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 jsonlite       1.71ms   1.95ms      502.    12.5KB     2.03
#> 2 RcppSimdJson  498.3µs  632.2µs     1421.    12.5KB    11.4

rkrug · 2024-09-25T12:06:04Z

Thanks a lot for this pull request. It is a useful enhancement towards opening the package to powers users without making it less complex for most use cases.

The only issue I have is that oa_fetch(output = "raw"...) still needs to have all results in memory which causes problems with larger numbers of works (millions) - which is not that uncommon, I would guess.

Therefore I would suggest, that the output = "raw" option saves the individual jsons from each call / page into individual files in a directory. A power user will have in this case even more possibilities to work with these results (e.g. using duckdb for extracting, conversion, ...).

I do not think that the power user (who that feature is aimed at) would see any problem in having individual files instead of an object.

rkrug · 2024-09-25T12:12:42Z

Discussion from #276 continues here.

I see your point - but why not offer the power user the option, if they want to have the object with the json strings back or saved to disk? The additional effort would not be that big (saving the jsons after each page downloaded by api_request() when an argument raw_in_dir_only is set, i.e. not null. In this case, the return value could be the path of the json directory?

rkrug · 2024-09-25T12:17:47Z

Or introduce another output format - raw_in_file which is doing that - if no directory specified, temporary directory which is returned.

yjunechoe · 2024-09-25T12:40:47Z

I see your point about memory, but to be honest I'm not really sure whether {openalexR} is the best place for such tail-end of power user features. By "R interface to OpenAlex", the package intends to cater to the audience of R users who wants to get results from OpenAlex directly into their R session for the usual kinds of in-memory data-wrangling. These are the features we advertise and this scope is typical of most R API packages.

I think the problem is that for a power user, whatever we support in OpenAlex will be too limiting. For example, as a power user myself (though I don't make large queries), if I were to regularly make such large queries to the database, I would rather do one of two things:

Download snapshots of their entire database, updated at a monthly-or-so interval, store it in a S3 bucket and query it directly with SQL. (this is what I had done in the past with Microsoft Academic Graph, and preferred as per OpenAlex, so as to not overload their servers)

Implement pagination myself with a callback such that for each page, read the JSON response remotely first into memory and then immediately insert the results into a local duckdb database. I'm not sure if you're already aware, but duckdb can read remote files with the httpfs extension, which avoids the need to download json first.

library(duckdb)
con <- dbConnect(duckdb())
dbExecute(con, "INSTALL json; LOAD json;")
dbExecute(con, "INSTALL httpfs; LOAD httpfs;")
tbl <- dbGetQuery(con, "
  SELECT
    *
  FROM
    read_ndjson('https://api.openalex.org/works/W2755950973', ignore_errors=true)
")
tibble::as_tibble(tbl)
#> # A tibble: 1 × 50
#>   id     doi   title display_name publication_year publication_date ids$openalex
#>   <chr>  <chr> <chr> <chr>                   <dbl> <date>           <chr>       
#> 1 https… http… bibl… bibliometri…             2017 2017-09-12       https://ope…
#> # ℹ 45 more variables: ids$doi <chr>, $mag <chr>, language <chr>,
#> #   primary_location <df[,9]>, type <chr>, type_crossref <chr>,
#> #   indexed_in <list>, open_access <df[,4]>, authorships <list>,
#> #   institution_assertions <list>, countries_distinct_count <dbl>,
#> #   institutions_distinct_count <dbl>, corresponding_author_ids <list>,
#> #   corresponding_institution_ids <list>, apc_list <df[,4]>, apc_paid <chr>,
#> #   fwci <dbl>, has_fulltext <lgl>, fulltext_origin <chr>, …

I would also like to clarify that this PR isn't really "aiming" to win over power users - I just think output="raw" is a logical extension of output="list" that incidentally will be particularly useful for the power users who want to do more at that level of the data. But I don't think it's helpful to understand the development of openalexR, specifically, as driven by the needs of the power users. I don't have any problems against the approach you outline (in fact I think it's great and I'm glad that it works for your usecase), but on principle I'm hesitant to bake each power user's preferred workflows into the package, especially when we're not really pressed for this (which may change as the package evolves and the users mature! - but I don't feel the need quite yet).

yjunechoe · 2024-09-25T12:59:40Z

On point 1, the entirety of Works objects in OpenAlex is only 380GB :) (the complete snapshot is 420GB). If you post-process the file fragments into a single database yourself it'll get much smaller.

> aws s3 ls --summarize --human-readable --no-sign-request --recursive "s3://openalex/data/works"
...
Total Objects: 735
   Total Size: 380.2 GiB

rkrug · 2024-09-25T13:43:08Z

Thanks for your elaboration. I want to describe my usage scenario:

I want to do a full text search on title and abstract for a complex search term, which I can do very easily via openalexR. But it returns between 4.5 and 5 million works. This does not work with openalexR and it will (I assume) also not work with this pull request. But I really like the functionality and syntax of openalexR to fetch the data. OK - I could write my add on package, which takes the query returned from oa_query(), implement paging, i.e. duplicate what is already in oa_request(), and only add that the page is saved ad not processed - but I see this as unnecessary duplication of functionality.

Also, a data snapshot would be perfect, if I could do full text search. I have no idea how to setup elastisearch and duckdb simply gives up with 250 million records for full text search.

I agree, that the main audience of the openalexR is not somebody who downloads millions of records, but I think it would be nice if one could just change an argument, and one could use the same, easy to use syntax of openalexR for that task. The result, a directory of json files, could be processed further with a package which is using this functionality, i.e. out of your maintenance, but improving the usefulness for possibly even normal users. And, this directory can also be used as a direct input into VOSViewer.

trangdata

June, incredible work! 🚀 Thank you so much for the thoughtful implementation and clear examples + tests.

yjunechoe · 2024-09-25T14:12:19Z

Thanks @trangdata - I'll plan to merge this!

And thanks @rkrug for the discussion here. For now, my 1 last thought on this:

Setting aside the question of whether the ability to download json in real time should be within the scope of openalexR, I feel obligated to sincerely urge power users to pursue the option of working with their own data snapshot. I sympathize that setting this up is complicated, but it seems worth the investment and it's simply the correct and polite approach if you're in the business of making large queries.

On your specific usecase: from my experience working with such data snapshots, once you have it set up for yourself, regex matching on title and abstract is extremely simple and fast as far as database queries go. FWIW, duckdb is an analytical database and not the best tool for the job - you'd want to work with transactional databases like PostgreSQL or BigQuery (as OpenAlex also recommends). I'm sure OpenAlex would appreciate your input if you, as an experienced user, come across any trouble in the setup and can give feedback on their documentation for the data snapshot.

I get that convenience is important, but I'd hate for {openalexR} to overstep the bounds here as a simple wrapper package.

trangdata · 2024-09-25T14:19:56Z

Hi Rainer, we have not been supportive of including file-writing functionality in openalexR (recall our discussions on csv, rds, and now json), and there are many reasons for this decision. One key rationale is that the core purpose of the package is to fetch data from OpenAlex, not to handle file storage. Mixing data retrieval with file system operations violates the principle of separation of concerns, making the function less focused, more complex, and much less maintainable.

I want to reiterate June's point (which he has made many times) that, like most opensource projects, openalexR focuses on serving the average user. We appreciate your feedback but ultimately will have to make the decision whether to incorporate a change while considering multiple factors. You are free to make a fork of the project and modify it to work with your niche use case, which you have done.

I know you tried oa_generate once and gave me some valuable feedback. Perhaps you can try it again now that we have resolved its issue?

trangdata · 2024-09-25T14:22:19Z

I feel obligated to sincerely urge power users to pursue the option of working with their own data snapshot. I sympathize that setting this up is complicated, but it seems worth the investment and it's simply the correct and polite approach if you're in the business of making large queries.

Great point, June. Working with data snapshots would put a lot less stress on the API given the very large number of queries in Rainer's use case.

rkrug · 2024-09-25T14:25:04Z

Data snapshot: I agree that it should be done when regular requests and regular downloads of the data is done - but setting it up for running one query which is then analysed for a few months and possibly not updated anymore is pure overkill.

Cheers,
Rainer

rkrug · 2024-09-25T14:26:52Z

I know you tried oa_generate once and gave me some valuable feedback. Perhaps you can try it again now that we have resolved its issue?

sorry - this seems to me as an overkill and adds to much complications.

yjunechoe · 2024-09-25T15:04:27Z

Data snapshot: I agree that it should be done when regular requests and regular downloads of the data is done - but setting it up for running one query which is then analysed for a few months and possibly not updated anymore is pure overkill.

I agree, and I truly empathize with you that that's a genuinely difficult problem to solve! I really do! But at the end of the day I feel like {openalexR} is not the right place to come to find an answer for this. To be honest, I think that that usecase is also "pure overkill" for {openalexR} too. We are merely a third-party package for general-purpose usecases supporting iterative, in-memory, tabular data analysis workflows designed for R users: simply, a wrong tool for the job. It seems to me that the real challenge here is about downloading sizeable chunks of the database through the API (and not about getting the raw, unparsed json back, which was the starting point of this discussion). That's no longer an implementation problem (I agree that it's easy) but a problem that we, both maintainers and users, have to tread carefully with respect to the boundaries of third-party packages, responsibilities of open-source maintainers, role of OpenAlex as a provider of this free service, etc. You're always free to do anything as an individual, of course, but I hope you understand our cautiousness and hesitance here as maintainers.

yjunechoe added 5 commits September 24, 2024 14:02

output = json

b2081b6

document

887fe2c

rename output type to raw

fafdb68

document output type items

adaae9c

add test

679a619

yjunechoe mentioned this pull request Sep 25, 2024

Save raw json to enable download of large number of records #276

Closed

trangdata approved these changes Sep 25, 2024

View reviewed changes

yjunechoe merged commit c3f64e5 into ropensci:main Sep 25, 2024
7 checks passed

yjunechoe deleted the output-raw branch September 25, 2024 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for `output="raw"` #280

Support for `output="raw"` #280

yjunechoe commented Sep 25, 2024

rkrug commented Sep 25, 2024

rkrug commented Sep 25, 2024

rkrug commented Sep 25, 2024

yjunechoe commented Sep 25, 2024 •

edited

Loading

yjunechoe commented Sep 25, 2024

rkrug commented Sep 25, 2024

trangdata left a comment

yjunechoe commented Sep 25, 2024 •

edited

Loading

trangdata commented Sep 25, 2024

trangdata commented Sep 25, 2024

rkrug commented Sep 25, 2024

rkrug commented Sep 25, 2024

yjunechoe commented Sep 25, 2024

Support for output="raw" #280

Support for output="raw" #280

Conversation

yjunechoe commented Sep 25, 2024

rkrug commented Sep 25, 2024

rkrug commented Sep 25, 2024

rkrug commented Sep 25, 2024

yjunechoe commented Sep 25, 2024 • edited Loading

yjunechoe commented Sep 25, 2024

rkrug commented Sep 25, 2024

trangdata left a comment

Choose a reason for hiding this comment

yjunechoe commented Sep 25, 2024 • edited Loading

trangdata commented Sep 25, 2024

trangdata commented Sep 25, 2024

rkrug commented Sep 25, 2024

rkrug commented Sep 25, 2024

yjunechoe commented Sep 25, 2024

Support for `output="raw"` #280

Support for `output="raw"` #280

yjunechoe commented Sep 25, 2024 •

edited

Loading

yjunechoe commented Sep 25, 2024 •

edited

Loading