Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager #240

Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,9 @@
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.TopFieldCollector;
import org.apache.lucene.search.TopFieldCollectorManager;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.search.TopScoreDocCollectorManager;
import org.apache.lucene.store.Directory;
import org.apache.lucene.util.Bits;

Expand Down Expand Up @@ -108,17 +109,21 @@ public int doLogic() throws Exception {
// the IndexSearcher search methods that take
// Weight public again, we can go back to
// pulling the Weight ourselves:
TopFieldCollector collector =
TopFieldCollector.create(sort, numHits, withTotalHits() ? Integer.MAX_VALUE : 1);
searcher.search(q, collector);
hits = collector.topDocs();
int totalHitsThreshold = withTotalHits() ? Integer.MAX_VALUE : 1;
TopFieldCollectorManager collectorManager =
new TopFieldCollectorManager(
sort, numHits, null, totalHitsThreshold, searcher.getExecutor() != null);
hits = searcher.search(q, collectorManager);
} else {
hits = searcher.search(q, numHits);
}
} else {
Collector collector = createCollector();
searcher.search(q, collector);
// hits = collector.topDocs();
int totalHitsThreshold = withTotalHits() ? Integer.MAX_VALUE : 1;
TopScoreDocCollectorManager collectorManager =
new TopScoreDocCollectorManager(
numHits(), null, totalHitsThreshold, searcher.getExecutor() != null);
searcher.search(q, collectorManager);
// hits = collectorManager.topDocs();
}

if (hits != null) {
Expand Down Expand Up @@ -180,6 +185,7 @@ protected int withTopDocs(IndexSearcher searcher, Query q, TopDocs hits) throws
return res;
}

@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if anyone out there has sub-classed this to provide their own Collector through overriding this method? It's possible right? Would we want to support this with a CollectorManager now instead? You might consider creating a new protected method--createCollectorManager--that sub-classes could override if they want, then add javadoc here with a @lucene.deprecated tag pointing users to the new method.

Also, if this change is going onto main, I might kill this method entirely and then add it back in as deprecated if you choose to backport to 8x.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if anyone out there has sub-classed this to provide their own Collector through overriding this method? It's possible right?

Hmm you are right. I reverted some changes in this class to still use the ReadTask#createCollector method, in case there's overriding from users.

Would we want to support this with a CollectorManager now instead? You might consider creating a new protected method--createCollectorManager--that sub-classes could override if they want, then add javadoc here with a @lucene.deprecated tag pointing users to the new method.

I think this createCollector method was originally added (11 years ago) to support benchmarking, and not sure how widely this is used actually? So hopefully marking it as deprecated for release 9.0 and removing it entirely once we release 10.0 should be good enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more thinking along the lines of keeping the change you had but providing a protected method that would allow sub-classes to provide their own CollectorManager if they wanted to do so (which would give users a migration path if they were previously providing their own Collector). So in 9.0, I think you'd move completely to the CollectorManager approach like you had. What do you think of that approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where you are coming from. However, upon looking at this method more closely, I'm afraid this method is effectively not useful, since the result of using this collector was commented out :D :

Collector collector = createCollector();
searcher.search(q, collector);
// hits = collector.topDocs();

So I'm now leaning more towards just removing this method altogether if no users ever noticed / complaint about this. What do you think @gsmiller @jpountz ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect it would still be useful to assess the performance of a collector but you probably have a point that if performance is the only thing that the benchmark module reports, then there's a chance that it's not actually used much.

I did a quick GitHub search to see how much code I could find that extends this code:

Given that the benchmark module is quite expert, I'd be ok with breaking hard here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I've removed this method, and also fixed the commented out code.

protected Collector createCollector() throws Exception {
return TopScoreDocCollector.create(numHits(), withTotalHits() ? Integer.MAX_VALUE : 1);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TotalHitCountCollector;
import org.apache.lucene.search.TotalHitCountCollectorManager;
import org.apache.lucene.util.BytesRef;

/**
Expand Down Expand Up @@ -179,10 +179,8 @@ private Map<BytesRef, Integer> getWordFreqForClassess(String word) throws IOExce
if (query != null) {
booleanQuery.add(query, BooleanClause.Occur.MUST);
}
TotalHitCountCollector totalHitCountCollector = new TotalHitCountCollector();
indexSearcher.search(booleanQuery.build(), totalHitCountCollector);

int ret = totalHitCountCollector.getTotalHits();
int ret = indexSearcher.search(booleanQuery.build(), new TotalHitCountCollectorManager());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make it even better by using IndexSearcher#count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (ret != 0) {
searched.put(cclass, ret);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TotalHitCountCollector;
import org.apache.lucene.search.TotalHitCountCollectorManager;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.util.BytesRef;

Expand Down Expand Up @@ -169,7 +169,6 @@ protected int countDocsWithClass() throws IOException {
Terms terms = MultiTerms.getTerms(this.indexReader, this.classFieldName);
int docCount;
if (terms == null || terms.getDocCount() == -1) { // in case codec doesn't support getDocCount
TotalHitCountCollector classQueryCountCollector = new TotalHitCountCollector();
BooleanQuery.Builder q = new BooleanQuery.Builder();
q.add(
new BooleanClause(
Expand All @@ -179,8 +178,7 @@ protected int countDocsWithClass() throws IOException {
if (query != null) {
q.add(query, BooleanClause.Occur.MUST);
}
indexSearcher.search(q.build(), classQueryCountCollector);
docCount = classQueryCountCollector.getTotalHits();
docCount = indexSearcher.search(q.build(), new TotalHitCountCollectorManager());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's simplify by using IndexSearcher#count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also updated the implementation of IndexSearcher#count to use TotalHitCountCollectorManager.

} else {
docCount = terms.getDocCount();
}
Expand Down Expand Up @@ -276,9 +274,7 @@ private int getWordFreqForClass(String word, Term term) throws IOException {
if (query != null) {
booleanQuery.add(query, BooleanClause.Occur.MUST);
}
TotalHitCountCollector totalHitCountCollector = new TotalHitCountCollector();
indexSearcher.search(booleanQuery.build(), totalHitCountCollector);
return totalHitCountCollector.getTotalHits();
return indexSearcher.search(booleanQuery.build(), new TotalHitCountCollectorManager());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

private double calculateLogPrior(Term term, int docsWithClassSize) throws IOException {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TotalHitCountCollector;
import org.apache.lucene.search.TotalHitCountCollectorManager;
import org.apache.lucene.util.BytesRef;

/**
Expand Down Expand Up @@ -263,9 +263,7 @@ private int getWordFreqForClass(String word, String fieldName, Term term) throws
if (query != null) {
booleanQuery.add(query, BooleanClause.Occur.MUST);
}
TotalHitCountCollector totalHitCountCollector = new TotalHitCountCollector();
indexSearcher.search(booleanQuery.build(), totalHitCountCollector);
return totalHitCountCollector.getTotalHits();
return indexSearcher.search(booleanQuery.build(), new TotalHitCountCollectorManager());
}

private double calculateLogPrior(Term term, int docsWithClassSize) throws IOException {
Expand Down
78 changes: 18 additions & 60 deletions lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java
Original file line number Diff line number Diff line change
Expand Up @@ -66,9 +66,9 @@
* match lots of documents, counting the number of hits may take much longer than computing the top
* hits so this trade-off allows to get some minimal information about the hit count without slowing
* down search too much. The {@link TopDocs#scoreDocs} array is always accurate however. If this
* behavior doesn't suit your needs, you should create collectors manually with either {@link
* TopScoreDocCollector#create} or {@link TopFieldCollector#create} and call {@link #search(Query,
* Collector)}.
* behavior doesn't suit your needs, you should create collectorManagers manually with either {@link
* TopScoreDocCollectorManager} or {@link TopFieldCollectorManager} and call {@link #search(Query,
* CollectorManager)}.
*
* <p><a id="thread-safety"></a>
*
Expand Down Expand Up @@ -478,34 +478,10 @@ public TopDocs searchAfter(ScoreDoc after, Query query, int numHits) throws IOEx
}

final int cappedNumHits = Math.min(numHits, limit);

final CollectorManager<TopScoreDocCollector, TopDocs> manager =
new CollectorManager<TopScoreDocCollector, TopDocs>() {

private final HitsThresholdChecker hitsThresholdChecker =
(executor == null || leafSlices.length <= 1)
? HitsThresholdChecker.create(Math.max(TOTAL_HITS_THRESHOLD, numHits))
: HitsThresholdChecker.createShared(Math.max(TOTAL_HITS_THRESHOLD, numHits));

private final MaxScoreAccumulator minScoreAcc =
(executor == null || leafSlices.length <= 1) ? null : new MaxScoreAccumulator();

@Override
public TopScoreDocCollector newCollector() throws IOException {
return TopScoreDocCollector.create(
cappedNumHits, after, hitsThresholdChecker, minScoreAcc);
}

@Override
public TopDocs reduce(Collection<TopScoreDocCollector> collectors) throws IOException {
final TopDocs[] topDocs = new TopDocs[collectors.size()];
int i = 0;
for (TopScoreDocCollector collector : collectors) {
topDocs[i++] = collector.topDocs();
}
return TopDocs.merge(0, cappedNumHits, topDocs);
}
};
final boolean supportsConcurrency = executor != null && leafSlices.length > 1;
CollectorManager<TopScoreDocCollector, TopDocs> manager =
new TopScoreDocCollectorManager(
cappedNumHits, after, TOTAL_HITS_THRESHOLD, supportsConcurrency);

return search(query, manager);
}
Expand All @@ -527,7 +503,10 @@ public TopDocs search(Query query, int n) throws IOException {
*
* @throws TooManyClauses If a query would exceed {@link IndexSearcher#getMaxClauseCount()}
* clauses.
* @deprecated This method is being deprecated in favor of {@link IndexSearcher#search(Query,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we like the @lucene.deprecated tag?

Is your reasoning behind keeping this in there for now that all of our uses of this (internal to Lucene) haven't yet migrated as part of this change? Would the plan me to migrate all usages off of this internally and then actually remove it on main/9.0, or are you thinking of keeping it around until 10.0? I think our backwards compatibility policy is such that we could just directly remove this on main/9.0, but then leave it like you have it (marked deprecated) if you choose to backport this to 8x. Since this method is so fundamental though, I could easily see an argument to keep it around for an extra major release to give users more time to migrate. Then again, the migration path seems pretty straight-forward. What do you think?

Copy link
Contributor Author

@zacharymorn zacharymorn Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we like the @lucene.deprecated tag?

Do you mean @lucene.deprecated should be created and used instead of using @deprecated? I can't seems to find that tag is being used in lucene actually...I think @lucene.deprecated most likely won't be recognized by IDE or build tool to signal / warn the use of deprecated methods though.

Is your reasoning behind keeping this in there for now that all of our uses of this (internal to Lucene) haven't yet migrated as part of this change? Would the plan me to migrate all usages off of this internally and then actually remove it on main/9.0, or are you thinking of keeping it around until 10.0? I think our backwards compatibility policy is such that we could just directly remove this on main/9.0, but then leave it like you have it (marked deprecated) if you choose to backport this to 8x. Since this method is so fundamental though, I could easily see an argument to keep it around for an extra major release to give users more time to migrate. Then again, the migration path seems pretty straight-forward. What do you think?

As explained in https://issues.apache.org/jira/browse/LUCENE-10002?focusedCommentId=17397827&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17397827, I use deprecation instead of direct removal in this PR exactly for the reasons you mentioned:

  1. Not all internal uses have been migrated in this PR (which may require another few thousand lines of changes).
  2. I personally feel that we should give users more time for migration and testing out the changes, so 10.0 release might be a good timing for complete removal.

I would be interested in seeing how folks feel about the timing of removal, but after #1 is completed, complete removal of IndexSearcher#search(Query, Collector) in lucene codebase should require only straightforward deletion changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to keep IndexSearcher#search(Query, Collector) and the TopXXXCollector#create factory methods during the 9.x series, as I expect it to be quite commonly used. However in my opinion it would be fine to cut over expert classes like those in the benchmark module without providing such long bw compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 makes sense to keep it in 9.0. As for my comment about @lucene.deprecated, I was getting this mixed up with something else. Apologies for any confusion!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for my comment about @lucene.deprecated, I was getting this mixed up with something else. Apologies for any confusion!

Though, I think there are both @deprecated (javadoc tag) and @Deprecated (Java class annotation) -- we do both of them typically (looks like you did here @zacharymorn, great!).

* CollectorManager)} due to its support for concurrency in IndexSearcher
*/
@Deprecated
public void search(Query query, Collector results) throws IOException {
query = rewrite(query);
search(leafContexts, createWeight(query, results.scoreMode(), 1), results);
Expand Down Expand Up @@ -613,34 +592,10 @@ private TopFieldDocs searchAfter(
final int cappedNumHits = Math.min(numHits, limit);
final Sort rewrittenSort = sort.rewrite(this);

final boolean supportsConcurrency = executor != null && leafSlices.length > 1;
final CollectorManager<TopFieldCollector, TopFieldDocs> manager =
new CollectorManager<>() {

private final HitsThresholdChecker hitsThresholdChecker =
(executor == null || leafSlices.length <= 1)
? HitsThresholdChecker.create(Math.max(TOTAL_HITS_THRESHOLD, numHits))
: HitsThresholdChecker.createShared(Math.max(TOTAL_HITS_THRESHOLD, numHits));

private final MaxScoreAccumulator minScoreAcc =
(executor == null || leafSlices.length <= 1) ? null : new MaxScoreAccumulator();

@Override
public TopFieldCollector newCollector() throws IOException {
// TODO: don't pay the price for accurate hit counts by default
return TopFieldCollector.create(
rewrittenSort, cappedNumHits, after, hitsThresholdChecker, minScoreAcc);
}

@Override
public TopFieldDocs reduce(Collection<TopFieldCollector> collectors) throws IOException {
final TopFieldDocs[] topDocs = new TopFieldDocs[collectors.size()];
int i = 0;
for (TopFieldCollector collector : collectors) {
topDocs[i++] = collector.topDocs();
}
return TopDocs.merge(rewrittenSort, 0, cappedNumHits, topDocs);
}
};
new TopFieldCollectorManager(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful to see all this redundant code go away!

rewrittenSort, cappedNumHits, after, TOTAL_HITS_THRESHOLD, supportsConcurrency);

TopFieldDocs topDocs = search(query, manager);
if (doDocScores) {
Expand All @@ -659,9 +614,12 @@ public TopFieldDocs reduce(Collection<TopFieldCollector> collectors) throws IOEx
*/
public <C extends Collector, T> T search(Query query, CollectorManager<C, T> collectorManager)
throws IOException {
if (executor == null || leafSlices.length <= 1) {
if (executor == null || leafSlices.length == 0) {
final C collector = collectorManager.newCollector();
search(query, collector);
final Query rewrittenQuery = rewrite(query);

search(leafContexts, createWeight(rewrittenQuery, collector.scoreMode(), 1), collector);

return collectorManager.reduce(Collections.singletonList(collector));
} else {
final List<C> collectors = new ArrayList<>(leafSlices.length);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,10 @@ public TopDocs rescore(IndexSearcher searcher, TopDocs firstPassTopDocs, int top

List<LeafReaderContext> leaves = searcher.getIndexReader().leaves();

TopFieldCollector collector = TopFieldCollector.create(sort, topN, Integer.MAX_VALUE);
TopFieldCollector collector =
new TopFieldCollectorManager(
sort, topN, null, Integer.MAX_VALUE, searcher.getExecutor() != null)
.newCollector();

// Now merge sort docIDs from hits, with reader's leaves:
int hitUpto = 0;
Expand Down
4 changes: 2 additions & 2 deletions lucene/core/src/java/org/apache/lucene/search/TopDocs.java
Original file line number Diff line number Diff line change
Expand Up @@ -232,8 +232,8 @@ public static TopDocs merge(
/**
* Returns a new TopFieldDocs, containing topN results across the provided TopFieldDocs, sorting
* by the specified {@link Sort}. Each of the TopDocs must have been sorted by the same Sort, and
* sort field values must have been filled (ie, <code>fillFields=true</code> must be passed to
* {@link TopFieldCollector#create}).
* sort field values must have been filled (ie, <code>fillFields=true</code> must be passed to the
* constructor of {@link TopFieldCollectorManager}).
*
* @see #merge(Sort, int, int, TopFieldDocs[])
* @lucene.experimental
Expand Down
Loading