Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze the use of Solr for file searches in the context of the Files API extension and define an action plan for its use #9813

Closed
GPortas opened this issue Aug 22, 2023 · 8 comments
Labels
pm.GREI-d-2.7.1 NIH, yr2, aim7, task1: R&D UI modules for creating datasets and supporting publishing workflows pm.GREI-d-2.7.2 NIH, yr2, aim7, task2: Implement UI modules for creating datasets and publishing workflows Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) User Role: API User Makes use of APIs

Comments

@GPortas
Copy link
Contributor

GPortas commented Aug 22, 2023

Overview of the Feature Request

For the analysis we must consider certain aspects mentioned during the frontend weekly meeting for devs:

  • Solr grammar and other features that we would lose by not using Solr search in the extended Files API
  • Current complexity of the DatasetPage logic for Solr and database queries. *
  • Lack of Solr indexing for past Dataset versions
  • The use of Solr must be opaque to the consumer applications. It should be treated as an implementation detail. The API will receive a searchText regardless of the underlying implementation. Then the backend will decide whether to use Solr or not.

* Current behavior (JSF - DatasetPage.java): Solr is accessed to obtain the ids that meet the search criteria and then the DB is accessed to obtain all the metadata files of the dataset to iterate over them to create a list of those that match the ids returned by Solr.

We should evaluate different implementation options:

  • Hybrid: Solr (When available) + DB. Current implementation but exposed through the API.
  • Optimized Hybrid: New DB queries have been created in recent API extension issues, where database results are directly filtered through DB queries without additional Java code. The new queries can be combined with a Solr search query in a new, more efficient hybrid model.
  • Just DB (We lose Solr features, but gain code simplicity and possibly performance - This should be evaluated from a product / UX perspective)
  • Just Solr (Read issue comments below)

The action plan should allow making a decision on how to extend the API for files when filtering by search text, which as an initial implementation (#9785) will be done by DB in all cases (No Solr).

What kind of user is the feature intended for?

Devs, UX

What inspired the request?

What existing behavior do you want changed?

None

Any brand new behavior do you want to add to Dataverse?

N/A

Any open or closed issues related to this feature request?

@GPortas GPortas added User Role: API User Makes use of APIs pm.GREI-d-2.7.1 NIH, yr2, aim7, task1: R&D UI modules for creating datasets and supporting publishing workflows pm.GREI-d-2.7.2 NIH, yr2, aim7, task2: Implement UI modules for creating datasets and publishing workflows labels Aug 22, 2023
@GPortas GPortas moved this to Re-arch: SPA MVP (Guillermo) in IQSS Dataverse Project Aug 22, 2023
@qqmyers
Copy link
Member

qqmyers commented Aug 22, 2023

What about a Just Solr option as well?

@GPortas
Copy link
Contributor Author

GPortas commented Aug 22, 2023

Since Solr does not index past versions, we would lose the file filtering feature for these versions. That's why I haven't considered the Just Solr option. But we can add such option if that downside is not considered relevant enough. @qqmyers

@qqmyers
Copy link
Member

qqmyers commented Aug 22, 2023

I think Solr doesn't have docs for old versions just because we delete them when a new version is published. I think we also delete and recreate the latest published version's Solr doc whenever we reindex the draft version. With a redesign, I don't think we'd have to do that. That would make the Solr indexes bigger, but not deleting and rewriting so often could help Solr performance and not touching the db to render the dataset page could help there. Whether it's better than the other alternatives or not, I don't know, but I think it makes sense to consider it.

@GPortas GPortas added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Aug 23, 2023
@GPortas
Copy link
Contributor Author

GPortas commented Aug 23, 2023

@qqmyers I'm wondering why older versions are not indexed and if the reason is to optimize resources. I also don't know how big the indexes could get and if having big ones could be problematic.

Maybe someone who has been aware of the implementation in the early days can provide some information on this. @scolapasta, @pdurbin, @landreev.

On the other hand, starting to keep all the old versions indexed would cause the new ones to be indexed while others are not (Those old versions prior to this hypothetical redesign). There may be solutions for this, but only to take it into account.

@GPortas GPortas self-assigned this Aug 23, 2023
@rtreacy
Copy link
Contributor

rtreacy commented Aug 24, 2023

The original purpose of search in the app was to allow for ad-hoc queries that were not viable as SQL queries. I was not around Dataverse as much when Solr was introduced, but my understanding is that an important feature it added was faceted search. Both functions are important conveniences but are also fault tolerant, that is to say if some data didn't get into the index for any number of reasons, it's not a disaster. My main point being that Solr can not be relied on as a source of truth as to what is in Dataverse. Postgres, as a relational database, is much more reliable as a data store with integrity built in. There are well thought out standards for mapping the data to the application objects in Dataverse. In short, I'm for using the search engine for what it is designed for, and continuing to rely on Postgres for driving the application

@pdurbin
Copy link
Member

pdurbin commented Aug 24, 2023

I'm wondering why older versions are not indexed and if the reason is to optimize resources

Yes, it's mostly because indexing has always been slow:

@GPortas
Copy link
Contributor Author

GPortas commented Aug 25, 2023

After gathering all the collected information and feedback and evaluating the different options in yesterday's frontend meeting, we have reached a consensus.

Initially we are going to base all file tab searches and filters on database queries, not including Solr at the moment. As developed in the recent API extension PR: #9820. Filters and search will be available for all versions in the same way, homogenizing behavior, although losing the possibility of using the Solr grammar.

This decision is made on the assumption that Solr may not be required in the context of files tab search, whose search facets are reduced compared to other in-application searches. Therefore, if we find evidence that the assumption is incorrect (potentially when users test the SPA), we will work on extending the search capabilities to support Solr.

In order to keep track of these decisions and that this can be useful for future UX work and application evolution, a new section has been added in the frontend README, which will include functionality behavior changes like the one we are addressing. Pull request: IQSS/dataverse-frontend#166.

At the moment, and in the absence of more sophisticated doc tools, we are using the README, although this type of information may be transferred to a more elaborated documentation in the future (As requested in IQSS/dataverse-frontend#26).

@GPortas GPortas removed their assignment Aug 25, 2023
@pdurbin pdurbin self-assigned this Sep 6, 2023
@pdurbin
Copy link
Member

pdurbin commented Sep 6, 2023

Yes, IQSS/dataverse-frontend#166 seems like a good writeup of the decisions explained above (no Solr for file listing page on dataset page, at least for now). Closing.

@pdurbin pdurbin closed this as completed Sep 6, 2023
@github-project-automation github-project-automation bot moved this from Re-arch: SPA MVP (Guillermo) to Clear of the Backlog in IQSS Dataverse Project Sep 6, 2023
@pdurbin pdurbin removed their assignment Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pm.GREI-d-2.7.1 NIH, yr2, aim7, task1: R&D UI modules for creating datasets and supporting publishing workflows pm.GREI-d-2.7.2 NIH, yr2, aim7, task2: Implement UI modules for creating datasets and publishing workflows Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) User Role: API User Makes use of APIs
Projects
Status: No status
Development

No branches or pull requests

4 participants