Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for output="raw" #280

Merged
merged 5 commits into from
Sep 25, 2024
Merged

Support for output="raw" #280

merged 5 commits into from
Sep 25, 2024

Conversation

yjunechoe
Copy link
Collaborator

This PR supports the option of oa_fetch(output = "raw") to return the raw JSON string(s) of the body of the API response, as-is. This is done in two parts:

  1. The JSON parsing step in api_request() is now toggle-able. The default is to parse the JSON into R list, as before, but can be turned off with api_request(parse = FALSE). When called from oa_fetch(output = "raw"), parsing is turned off.
  2. When output="raw", the result as returned by api_request() gets the same output = "list" treatment of returning early from oa_fetch(), avoiding further processing which assumes some structure in the result object.

The result of output="raw" is a character vector of raw JSON strings. The length depends on the implementational details of the query, namely the number of pages in the result.

output_raw <- oa_fetch(
  entity = "works",
  search = "language",
  per_page = 2,
  options = list(sample = 5, seed = 1),
  output = "raw"
)
sapply(output_raw, substr, 1, 60, USE.NAMES = FALSE)
#> [1] "{\"meta\":{\"count\":5,\"db_response_time_ms\":201,\"page\":1,\"per_p"
#> [2] "{\"meta\":{\"count\":5,\"db_response_time_ms\":103,\"page\":2,\"per_p"
#> [3] "{\"meta\":{\"count\":5,\"db_response_time_ms\":66,\"page\":3,\"per_pa"

Some tests of equivalence to other output formats:

  1. Equivalence to output = "list"

    output_list <- oa_fetch(
      entity = "works",
      search = "language",
      per_page = 2,
      options = list(sample = 5, seed = 1),
      output = "list"
    )
    
    raw2list <- function(x) {
      parsed <- lapply(x, jsonlite::fromJSON, simplifyVector = FALSE)
      results <- sapply(parsed, `[[`, "results")
      unlist(results, recursive = FALSE, use.names = FALSE)
    }
    
    waldo::compare(
      raw2list(output_raw), output_list,
      list_as_map = TRUE
    )
    #> ✔ No differences
  2. Equivalence to output = "tibble"

    output_tibble <- oa_fetch(
      entity = "works",
      search = "language",
      per_page = 2,
      options = list(sample = 5, seed = 1),
      output = "tibble"
    )
    identical(
      output_raw %>% raw2list() %>% works2df(),
      output_tibble
    )
    #> [1] TRUE

A power user now has more flexibility to intervene in the processing pipeline. For example, if they prefer to use another JSON parser:

raw2list_fast <- function(x) {
  parsed <- lapply(x, RcppSimdJson::fparse, max_simplify_lvl = 3L, empty_array = list())
  results <- lapply(parsed, `[[`, "results")
  unlist(results, recursive = FALSE, use.names = FALSE)
}
bench::mark(
  jsonlite = raw2list(raw),
  RcppSimdJson = raw2list_fast(raw)
)
#> # A tibble: 2 × 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 jsonlite       1.71ms   1.95ms      502.    12.5KB     2.03
#> 2 RcppSimdJson  498.3µs  632.2µs     1421.    12.5KB    11.4

@rkrug
Copy link

rkrug commented Sep 25, 2024

Thanks a lot for this pull request. It is a useful enhancement towards opening the package to powers users without making it less complex for most use cases.

The only issue I have is that oa_fetch(output = "raw"...) still needs to have all results in memory which causes problems with larger numbers of works (millions) - which is not that uncommon, I would guess.

Therefore I would suggest, that the output = "raw" option saves the individual jsons from each call / page into individual files in a directory. A power user will have in this case even more possibilities to work with these results (e.g. using duckdb for extracting, conversion, ...).

I do not think that the power user (who that feature is aimed at) would see any problem in having individual files instead of an object.

@rkrug
Copy link

rkrug commented Sep 25, 2024

Discussion from #276 continues here.

I see your point - but why not offer the power user the option, if they want to have the object with the json strings back or saved to disk? The additional effort would not be that big (saving the jsons after each page downloaded by api_request() when an argument raw_in_dir_only is set, i.e. not null. In this case, the return value could be the path of the json directory?

@rkrug
Copy link

rkrug commented Sep 25, 2024

Or introduce another output format - raw_in_file which is doing that - if no directory specified, temporary directory which is returned.

@yjunechoe
Copy link
Collaborator Author

yjunechoe commented Sep 25, 2024

I see your point about memory, but to be honest I'm not really sure whether {openalexR} is the best place for such tail-end of power user features. By "R interface to OpenAlex", the package intends to cater to the audience of R users who wants to get results from OpenAlex directly into their R session for the usual kinds of in-memory data-wrangling. These are the features we advertise and this scope is typical of most R API packages.

I think the problem is that for a power user, whatever we support in OpenAlex will be too limiting. For example, as a power user myself (though I don't make large queries), if I were to regularly make such large queries to the database, I would rather do one of two things:

  1. Download snapshots of their entire database, updated at a monthly-or-so interval, store it in a S3 bucket and query it directly with SQL. (this is what I had done in the past with Microsoft Academic Graph, and preferred as per OpenAlex, so as to not overload their servers)

  2. Implement pagination myself with a callback such that for each page, read the JSON response remotely first into memory and then immediately insert the results into a local duckdb database. I'm not sure if you're already aware, but duckdb can read remote files with the httpfs extension, which avoids the need to download json first.

    library(duckdb)
    con <- dbConnect(duckdb())
    dbExecute(con, "INSTALL json; LOAD json;")
    dbExecute(con, "INSTALL httpfs; LOAD httpfs;")
    tbl <- dbGetQuery(con, "
      SELECT
        *
      FROM
        read_ndjson('https://api.openalex.org/works/W2755950973', ignore_errors=true)
    ")
    tibble::as_tibble(tbl)
    #> # A tibble: 1 × 50
    #>   id     doi   title display_name publication_year publication_date ids$openalex
    #>   <chr>  <chr> <chr> <chr>                   <dbl> <date>           <chr>       
    #> 1 https… http… bibl… bibliometri…             2017 2017-09-12       https://ope…
    #> # ℹ 45 more variables: ids$doi <chr>, $mag <chr>, language <chr>,
    #> #   primary_location <df[,9]>, type <chr>, type_crossref <chr>,
    #> #   indexed_in <list>, open_access <df[,4]>, authorships <list>,
    #> #   institution_assertions <list>, countries_distinct_count <dbl>,
    #> #   institutions_distinct_count <dbl>, corresponding_author_ids <list>,
    #> #   corresponding_institution_ids <list>, apc_list <df[,4]>, apc_paid <chr>,
    #> #   fwci <dbl>, has_fulltext <lgl>, fulltext_origin <chr>, …

I would also like to clarify that this PR isn't really "aiming" to win over power users - I just think output="raw" is a logical extension of output="list" that incidentally will be particularly useful for the power users who want to do more at that level of the data. But I don't think it's helpful to understand the development of openalexR, specifically, as driven by the needs of the power users. I don't have any problems against the approach you outline (in fact I think it's great and I'm glad that it works for your usecase), but on principle I'm hesitant to bake each power user's preferred workflows into the package, especially when we're not really pressed for this (which may change as the package evolves and the users mature! - but I don't feel the need quite yet).

@yjunechoe
Copy link
Collaborator Author

On point 1, the entirety of Works objects in OpenAlex is only 380GB :) (the complete snapshot is 420GB). If you post-process the file fragments into a single database yourself it'll get much smaller.

> aws s3 ls --summarize --human-readable --no-sign-request --recursive "s3://openalex/data/works"
...
Total Objects: 735
   Total Size: 380.2 GiB

@rkrug
Copy link

rkrug commented Sep 25, 2024

Thanks for your elaboration. I want to describe my usage scenario:

I want to do a full text search on title and abstract for a complex search term, which I can do very easily via openalexR. But it returns between 4.5 and 5 million works. This does not work with openalexR and it will (I assume) also not work with this pull request. But I really like the functionality and syntax of openalexR to fetch the data. OK - I could write my add on package, which takes the query returned from oa_query(), implement paging, i.e. duplicate what is already in oa_request(), and only add that the page is saved ad not processed - but I see this as unnecessary duplication of functionality.

Also, a data snapshot would be perfect, if I could do full text search. I have no idea how to setup elastisearch and duckdb simply gives up with 250 million records for full text search.

I agree, that the main audience of the openalexR is not somebody who downloads millions of records, but I think it would be nice if one could just change an argument, and one could use the same, easy to use syntax of openalexR for that task. The result, a directory of json files, could be processed further with a package which is using this functionality, i.e. out of your maintenance, but improving the usefulness for possibly even normal users. And, this directory can also be used as a direct input into VOSViewer.

Copy link
Collaborator

@trangdata trangdata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

June, incredible work! 🚀 Thank you so much for the thoughtful implementation and clear examples + tests.

@yjunechoe
Copy link
Collaborator Author

yjunechoe commented Sep 25, 2024

Thanks @trangdata - I'll plan to merge this!

And thanks @rkrug for the discussion here. For now, my 1 last thought on this:

Setting aside the question of whether the ability to download json in real time should be within the scope of openalexR, I feel obligated to sincerely urge power users to pursue the option of working with their own data snapshot. I sympathize that setting this up is complicated, but it seems worth the investment and it's simply the correct and polite approach if you're in the business of making large queries.

On your specific usecase: from my experience working with such data snapshots, once you have it set up for yourself, regex matching on title and abstract is extremely simple and fast as far as database queries go. FWIW, duckdb is an analytical database and not the best tool for the job - you'd want to work with transactional databases like PostgreSQL or BigQuery (as OpenAlex also recommends). I'm sure OpenAlex would appreciate your input if you, as an experienced user, come across any trouble in the setup and can give feedback on their documentation for the data snapshot.

I get that convenience is important, but I'd hate for {openalexR} to overstep the bounds here as a simple wrapper package.

@trangdata
Copy link
Collaborator

Hi Rainer, we have not been supportive of including file-writing functionality in openalexR (recall our discussions on csv, rds, and now json), and there are many reasons for this decision. One key rationale is that the core purpose of the package is to fetch data from OpenAlex, not to handle file storage. Mixing data retrieval with file system operations violates the principle of separation of concerns, making the function less focused, more complex, and much less maintainable.

I want to reiterate June's point (which he has made many times) that, like most opensource projects, openalexR focuses on serving the average user. We appreciate your feedback but ultimately will have to make the decision whether to incorporate a change while considering multiple factors. You are free to make a fork of the project and modify it to work with your niche use case, which you have done.

I know you tried oa_generate once and gave me some valuable feedback. Perhaps you can try it again now that we have resolved its issue?

@trangdata
Copy link
Collaborator

I feel obligated to sincerely urge power users to pursue the option of working with their own data snapshot. I sympathize that setting this up is complicated, but it seems worth the investment and it's simply the correct and polite approach if you're in the business of making large queries.

Great point, June. Working with data snapshots would put a lot less stress on the API given the very large number of queries in Rainer's use case.

@rkrug
Copy link

rkrug commented Sep 25, 2024

Data snapshot: I agree that it should be done when regular requests and regular downloads of the data is done - but setting it up for running one query which is then analysed for a few months and possibly not updated anymore is pure overkill.

Cheers,
Rainer

@rkrug
Copy link

rkrug commented Sep 25, 2024

I know you tried oa_generate once and gave me some valuable feedback. Perhaps you can try it again now that we have resolved its issue?

sorry - this seems to me as an overkill and adds to much complications.

@yjunechoe
Copy link
Collaborator Author

Data snapshot: I agree that it should be done when regular requests and regular downloads of the data is done - but setting it up for running one query which is then analysed for a few months and possibly not updated anymore is pure overkill.

I agree, and I truly empathize with you that that's a genuinely difficult problem to solve! I really do! But at the end of the day I feel like {openalexR} is not the right place to come to find an answer for this. To be honest, I think that that usecase is also "pure overkill" for {openalexR} too. We are merely a third-party package for general-purpose usecases supporting iterative, in-memory, tabular data analysis workflows designed for R users: simply, a wrong tool for the job. It seems to me that the real challenge here is about downloading sizeable chunks of the database through the API (and not about getting the raw, unparsed json back, which was the starting point of this discussion). That's no longer an implementation problem (I agree that it's easy) but a problem that we, both maintainers and users, have to tread carefully with respect to the boundaries of third-party packages, responsibilities of open-source maintainers, role of OpenAlex as a provider of this free service, etc. You're always free to do anything as an individual, of course, but I hope you understand our cautiousness and hesitance here as maintainers.

@yjunechoe yjunechoe merged commit c3f64e5 into ropensci:main Sep 25, 2024
7 checks passed
@yjunechoe yjunechoe deleted the output-raw branch September 25, 2024 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants