Analyze the use of Solr for file searches in the context of the Files API extension and define an action plan for its use #9813

GPortas · 2023-08-22T08:05:38Z

Overview of the Feature Request

For the analysis we must consider certain aspects mentioned during the frontend weekly meeting for devs:

Solr grammar and other features that we would lose by not using Solr search in the extended Files API
Current complexity of the DatasetPage logic for Solr and database queries. *
Lack of Solr indexing for past Dataset versions
The use of Solr must be opaque to the consumer applications. It should be treated as an implementation detail. The API will receive a searchText regardless of the underlying implementation. Then the backend will decide whether to use Solr or not.

* Current behavior (JSF - DatasetPage.java): Solr is accessed to obtain the ids that meet the search criteria and then the DB is accessed to obtain all the metadata files of the dataset to iterate over them to create a list of those that match the ids returned by Solr.

We should evaluate different implementation options:

Hybrid: Solr (When available) + DB. Current implementation but exposed through the API.
Optimized Hybrid: New DB queries have been created in recent API extension issues, where database results are directly filtered through DB queries without additional Java code. The new queries can be combined with a Solr search query in a new, more efficient hybrid model.
Just DB (We lose Solr features, but gain code simplicity and possibly performance - This should be evaluated from a product / UX perspective)
Just Solr (Read issue comments below)

The action plan should allow making a decision on how to extend the API for files when filtering by search text, which as an initial implementation (#9785) will be done by DB in all cases (No Solr).

What kind of user is the feature intended for?

Devs, UX

What inspired the request?

[Spike - API] Extend the Files API to support search #9785
Weekly frontend meeting for devs

What existing behavior do you want changed?

None

Any brand new behavior do you want to add to Dataverse?

N/A

Any open or closed issues related to this feature request?

[Spike - API] Extend the Files API to support search #9785

qqmyers · 2023-08-22T10:17:28Z

What about a Just Solr option as well?

GPortas · 2023-08-22T10:59:52Z

Since Solr does not index past versions, we would lose the file filtering feature for these versions. That's why I haven't considered the Just Solr option. But we can add such option if that downside is not considered relevant enough. @qqmyers

qqmyers · 2023-08-22T13:21:41Z

I think Solr doesn't have docs for old versions just because we delete them when a new version is published. I think we also delete and recreate the latest published version's Solr doc whenever we reindex the draft version. With a redesign, I don't think we'd have to do that. That would make the Solr indexes bigger, but not deleting and rewriting so often could help Solr performance and not touching the db to render the dataset page could help there. Whether it's better than the other alternatives or not, I don't know, but I think it makes sense to consider it.

GPortas · 2023-08-23T14:07:00Z

@qqmyers I'm wondering why older versions are not indexed and if the reason is to optimize resources. I also don't know how big the indexes could get and if having big ones could be problematic.

Maybe someone who has been aware of the implementation in the early days can provide some information on this. @scolapasta, @pdurbin, @landreev.

On the other hand, starting to keep all the old versions indexed would cause the new ones to be indexed while others are not (Those old versions prior to this hypothetical redesign). There may be solutions for this, but only to take it into account.

rtreacy · 2023-08-24T14:11:11Z

The original purpose of search in the app was to allow for ad-hoc queries that were not viable as SQL queries. I was not around Dataverse as much when Solr was introduced, but my understanding is that an important feature it added was faceted search. Both functions are important conveniences but are also fault tolerant, that is to say if some data didn't get into the index for any number of reasons, it's not a disaster. My main point being that Solr can not be relied on as a source of truth as to what is in Dataverse. Postgres, as a relational database, is much more reliable as a data store with integrity built in. There are well thought out standards for mapping the data to the application objects in Dataverse. In short, I'm for using the search engine for what it is designed for, and continuing to rely on Postgres for driving the application

pdurbin · 2023-08-24T14:19:40Z

I'm wondering why older versions are not indexed and if the reason is to optimize resources

Yes, it's mostly because indexing has always been slow:

Solr: Index all performance is too slow with full production data. #50

GPortas · 2023-08-25T10:09:00Z

After gathering all the collected information and feedback and evaluating the different options in yesterday's frontend meeting, we have reached a consensus.

Initially we are going to base all file tab searches and filters on database queries, not including Solr at the moment. As developed in the recent API extension PR: #9820. Filters and search will be available for all versions in the same way, homogenizing behavior, although losing the possibility of using the Solr grammar.

This decision is made on the assumption that Solr may not be required in the context of files tab search, whose search facets are reduced compared to other in-application searches. Therefore, if we find evidence that the assumption is incorrect (potentially when users test the SPA), we will work on extending the search capabilities to support Solr.

In order to keep track of these decisions and that this can be useful for future UX work and application evolution, a new section has been added in the frontend README, which will include functionality behavior changes like the one we are addressing. Pull request: IQSS/dataverse-frontend#166.

At the moment, and in the absence of more sophisticated doc tools, we are using the README, although this type of information may be transferred to a more elaborated documentation in the future (As requested in IQSS/dataverse-frontend#26).

pdurbin · 2023-09-06T18:07:39Z

Yes, IQSS/dataverse-frontend#166 seems like a good writeup of the decisions explained above (no Solr for file listing page on dataset page, at least for now). Closing.

GPortas added User Role: API User Makes use of APIs pm.GREI-d-2.7.1 NIH, yr2, aim7, task1: R&D UI modules for creating datasets and supporting publishing workflows pm.GREI-d-2.7.2 NIH, yr2, aim7, task2: Implement UI modules for creating datasets and publishing workflows labels Aug 22, 2023

GPortas added this to IQSS Dataverse Project Aug 22, 2023

GPortas moved this to Re-arch: SPA MVP (Guillermo) in IQSS Dataverse Project Aug 22, 2023

GPortas mentioned this issue Aug 23, 2023

Dataset files API extension for search text filtering #9820

Merged

GPortas added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Aug 23, 2023

GPortas self-assigned this Aug 23, 2023

GPortas mentioned this issue Aug 25, 2023

Functionality behavior changes added to README, waiting on 9783 IQSS/dataverse-frontend#166

Merged

GPortas removed their assignment Aug 25, 2023

pdurbin self-assigned this Sep 6, 2023

pdurbin closed this as completed Sep 6, 2023

github-project-automation bot moved this from Re-arch: SPA MVP (Guillermo) to Clear of the Backlog in IQSS Dataverse Project Sep 6, 2023

pdurbin removed their assignment Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyze the use of Solr for file searches in the context of the Files API extension and define an action plan for its use #9813

Analyze the use of Solr for file searches in the context of the Files API extension and define an action plan for its use #9813

GPortas commented Aug 22, 2023 •

edited

Loading

qqmyers commented Aug 22, 2023

GPortas commented Aug 22, 2023

qqmyers commented Aug 22, 2023

GPortas commented Aug 23, 2023 •

edited

Loading

rtreacy commented Aug 24, 2023 •

edited

Loading

pdurbin commented Aug 24, 2023

GPortas commented Aug 25, 2023

pdurbin commented Sep 6, 2023

Analyze the use of Solr for file searches in the context of the Files API extension and define an action plan for its use #9813

Analyze the use of Solr for file searches in the context of the Files API extension and define an action plan for its use #9813

Comments

GPortas commented Aug 22, 2023 • edited Loading

Overview of the Feature Request

What kind of user is the feature intended for?

What inspired the request?

What existing behavior do you want changed?

Any brand new behavior do you want to add to Dataverse?

Any open or closed issues related to this feature request?

qqmyers commented Aug 22, 2023

GPortas commented Aug 22, 2023

qqmyers commented Aug 22, 2023

GPortas commented Aug 23, 2023 • edited Loading

rtreacy commented Aug 24, 2023 • edited Loading

pdurbin commented Aug 24, 2023

GPortas commented Aug 25, 2023

pdurbin commented Sep 6, 2023

GPortas commented Aug 22, 2023 •

edited

Loading

GPortas commented Aug 23, 2023 •

edited

Loading

rtreacy commented Aug 24, 2023 •

edited

Loading