Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Catalog-Endpoint Performance #4499

Closed
efiege opened this issue Sep 26, 2024 · 14 comments · Fixed by #4513
Closed

Optimize Catalog-Endpoint Performance #4499

efiege opened this issue Sep 26, 2024 · 14 comments · Fixed by #4513
Labels
question Further information is requested

Comments

@efiege
Copy link

efiege commented Sep 26, 2024

Feature Request

The performance of the Catalog-Endpoint degrades as the number of ContractDefinitions increases. This appears to be due to an n+1 query issue within the ContractDefinitionResolverImpl. Currently, for each ContractDefinition, the corresponding access policy is resolved by its ID, resulting in a separate database query for each policy.

This issue seems to persists in the main branch. The relevant code can be found here:

/**
* Determines the applicability of a definition to an agent by evaluating its access policy.
*/
private boolean evaluateAccessPolicy(ContractDefinition definition, ParticipantAgent agent) {
var policyContext = PolicyContextImpl.Builder.newInstance().additional(ParticipantAgent.class, agent).build();
var accessResult = Optional.of(definition.getAccessPolicyId())
.map(policyStore::findById)
.map(PolicyDefinition::getPolicy)
.map(policy -> policyEngine.evaluate(CATALOGING_SCOPE, policy, policyContext))
.orElse(Result.failure(format("Policy %s not found", definition.getAccessPolicyId())));
if (accessResult.failed()) {
monitor.debug(format("Access not granted for %s: \n%s", definition.getId(), String.join("\n", accessResult.getFailureMessages())));
return false;
}
return true;
}

Which Areas Would Be Affected?

  • Control-Plane
  • PolicyDefinitionStore

Why Is the Feature Desired?

As the number of ContractDefinitions grows, the Catalog-Endpoint experiences significant performance degradation due to the increasing number of database queries.

Solution Proposal

Implement a database join to retrieve all relevant policies in a single query, reducing the number of queries executed.

@github-actions github-actions bot added the triage all new issues awaiting classification label Sep 26, 2024
@ndr-brt
Copy link
Member

ndr-brt commented Sep 26, 2024

could you please provide a detailed benchmark trying to highlight after how many contract definitions the performances start to degrade and what's the time factor?

EDIT: it was a typo, I was referring to contract definitions

@paullatzelsperger
Copy link
Member

paullatzelsperger commented Sep 26, 2024

I 100% agree with what @ndr-brt said - provide speeds and feeds, otherwise it's all conjecture.

Querying with a JOIN will have a widespread impact, because we'd have to introduce a new ContractDefinition object, i.e. one that contains Policy objects instead of just their IDs. This will also have an impact on all the stores. And even then, I'm very sceptic that it would bring much of an advantage, since connections are pooled, and establishing them is what takes time. Also, databases typically have a cache.

If anything, I would think about a local policy cache, i.e. a HashMap that gets populated, and gets evicted after the request.

But again, we can't do anything without concrete numbers.

@efiege
Copy link
Author

efiege commented Sep 26, 2024

Thank you for your prompt responses.

The performance degradation is not tied to any contract negotiation process, but rather scales with the number of ContractDefinitions present in the EDC.

To illustrate this, consider the following example: resolving a policy by its ID from the database takes approximately 1ms. With 100,000 ContractDefinitions, a catalog request requires at least 100,000 * 1ms, which translates to a minimum of 100 seconds.

@paullatzelsperger
Copy link
Member

paullatzelsperger commented Sep 26, 2024

first, is that an actual measurement? if so, can you share the results?
second, why are there 100_000 contract definitions? again, is this an actual use case?

like I said, a local policy cache would be a good option, but unless you provide hard facts (not fictitious numbers) there is nothing we can do.

edit: database queries do not scale linearly, and are influenced by many factors, so I think your calculation is a bit too simplistic. we should base our implementation on facts, not conjecture.

@ndr-brt
Copy link
Member

ndr-brt commented Sep 26, 2024

Thank you for your prompt responses.

The performance degradation is not tied to any contract negotiation process, but rather scales with the number of ContractDefinitions present in the EDC.

it was a typo 🤭

To illustrate this, consider the following example: resolving a policy by its ID from the database takes approximately 1ms. With 100,000 ContractDefinitions, a catalog request requires at least 100,000 * 1ms, which translates to a minimum of 100 seconds.

as we already stated, please provide more details.

(btw: if you have 100.000 definitions, is very likely that you have modeled them in the wrong way. read: the pattern 1 asset/1 policy/1 contract definition is definitely not correct and needs to be avoided)

@efiege
Copy link
Author

efiege commented Sep 26, 2024

I just wanted to give an example ☺️

If I understand correctly, you suggest that the response time for the catalog is acceptable when dealing with a reasonable number of ContractDefinitions for typical use cases.

I don't know what you think is a reasonable number, but even if there are just ~100 ContractDefinitions this will reduce the response time by ~100ms.

Wouldn't you agree that there is still potential for optimization here?

@jimmarino
Copy link
Contributor

jimmarino commented Sep 26, 2024

I just wanted to give an example ☺️

If I understand correctly, you suggest that the response time for the catalog is acceptable when dealing with a reasonable number of ContractDefinitions for typical use cases.

I don't know what you think is a reasonable number, but even if there are just ~100 ContractDefinitions this will reduce the response time by ~100ms.

Wouldn't you agree that there is still potential for optimization here?

What we are asking for is a benchmark. How do you know the response time will be reduced by ~100ms? It certainly may be the case, but on what basis is this asserted?

What is the test case, how was the database and other infrastructure configured, and what were the actual numbers? Did you profile the behavior? What bottlenecks were observed?

@richardtreier
Copy link
Contributor

richardtreier commented Sep 26, 2024

Guys, I believe we do not need a benchmark to understand that any N+1 select problem on a highly used and visited endpoint such as the catalog, that scales over any of the EDC entities, is an issue.

I understand the challenges the EDC faces with its design decision to split the stores with accessing even its own DB. I believe we can still find a good solution by using batched DB queries. The following is a common pattern I have met in many JPA code bases, as JPA code bases also often face N+1 select issues:

var policyIds = contractDefinitions.stream()
    .map(it -> it.getPolicyId())
    .collect(toSet());
var policiesById = policyService.findByIdIn(policyIds).stream()
    .collect(toMap(it -> it.getId(), Function.identity()))

return contractDefinitions.stream()
    .map(contractDefinition -> {
        var policy = policiesById.get(contractDefinition.getPolicyId());
        return buildContractOffer(contractDefinition, policy);
    })
    .toList();

Trying to change EDC usage also does not help here:

  • In cloud environments even the smallest DB query might take 24ms - 32ms.
  • In terms of quantity structure, even with optimized EDC usage, we have seen:
    • Transfers have easily hundreds of thousands.
    • Hundreds of individual data offers with individual policies on a brokering EDC connecting a data marketplace platform with an EDC dataspace

@paullatzelsperger
Copy link
Member

paullatzelsperger commented Sep 26, 2024

Guys, I believe we do not need a benchmark to understand that any N+1 select problem on a highly used and visited endpoint such as the catalog, that scales over any of the EDC entities, is an issue.

Three committers disagree with you and are in agreement that we do need a benchmark and hard facts.

I believe we can still find a good solution

one has been offered, in the form of a local policy cache. Fetching the policies in batch for all the contract defs (that are materialized in memory anyway) is just a way of populating the cache.

The following is a common pattern I have met in many JPA code bases,

This is not JPA.

In cloud environments even the smallest DB query might take 24ms - 32ms.

measurements? benchmarks? timing results? what does the deployment situation look like? is this geo-colocated? is database caching enabled? also, it is likely the connection establishment that takes time, and connections are pooled.

Transfers have easily hundreds of thousands.

this has nothing to do with offers

To be honest, this is starting to annoy me. We have asked for hard data, which you seem incapable of or unwilling to provide. We even offered a solution, that would alleviate this supposed problem, while keeping our entity data model and object model intact.

It seems I have to be blunt: unless you clearly demonstrate a problem, supported by hard facts (benchmarks, profiling data, timings, etc.), we will not invest any more time in this.

@paullatzelsperger paullatzelsperger added question Further information is requested and removed triage all new issues awaiting classification labels Sep 26, 2024
@ndr-brt
Copy link
Member

ndr-brt commented Sep 27, 2024

Other than agreeing with @paullatzelsperger , taking a look at the code, from a "simplicity" point of view, I see that the ContractDefinitionResolver could be inlined into DatasetResolverImpl, other than simplify the design, it would permit also to add a thread local cache that would avoid multiple policy fetch in the case that multiple contract definitions use the same policies for access/contract scopes (that, I guess is really likely what happens).

I like the ids idea brought by @richardtreier but, really, we need data to take a decision that will impact countless adopters

@paullatzelsperger
Copy link
Member

it would permit also to add a thread local cache that would avoid multiple policy fetch in the case that multiple contract definitions use the same policies for access/contract scopes (that, I guess is really likely what happens).

that was exactly my point.

@richardtreier
Copy link
Contributor

richardtreier commented Sep 27, 2024

Please note, our facts regarding this issue are not purely related to this issue, hence the hesitation.

  • We are using the current Catena-X Prod Version (0.2.1)
  • We are aware of https://github.com/eclipse-edc/Connector/pull/2753/files:
    • We cannot provide exact numbers for 0.6/0.7 just yet.
    • Am I wrong to assume that this performance test is purely running with a latency-free in-memory db?
    • Please note, that the above 25ms per DB query are numbers cited by our ops team from measurements, that are not just last percentile numbers, but slow downs that might happen from time to time. N+1 select issues get amplified when these timings happen.

We have a connector running in production that was running into timeouts of our reverse proxy and this is where our investigation began.

  • Quantity Structure:
    • 300 Contract Offers
  • Timings:
    • Reverse Proxy has timeouts at 15s
    • Query Full Catalog: 15s-30s
    • Query Catalog by Asset ID: 1s-5s
      • This contains a policy definition N+1 select problem, which is still present in the current version.
      • This does not include an asset N+1 select problem, which would have been fixed in the current version.
  • Measured by:
    • Traces of the EDC
    • Traces of the PostgreSQL
    • We also try to see things with a grain of salt due to the different versions, looking into the current code itself.
  • Findings:
    • We had several findings, and the current issue only adresses one of them.
    • N+1 Selects for assets
      • Seem to be fixed in 0.6/0.7
    • N+1 Selects for policies
      • this issue
    • Some Connection Pool / unknown issues we are still analyzing
      • Some SELECT ? queries take 130ms:
        • "This command has been terminated by an administrator"
        • The EDC proceeds to take 3s to re-start the connection
      • This might have been fixed in 0.6/0.7 or be unrelated
      • As I said, we are still analyzing this.
      • Should anything come from this it would be a separate issue.

We wanted to only raise the N+1 select issue here, since we need time to fully test things in 0.7.

From my perspective, the fact that the catalog query by the asset ID that excludes the asset N+1 select issue still takes 1-5s was a sign that this issue was worth writing.


Regarding N+1 Select problems generally:

  • A request as large as a catalog returning 300 assets + policies would by payload size already need at least 200ms in Java Backends, ballpark number, by experience.
  • If we have an N+1 select issue, time gets simply added on top.
  • The data amount itself is nowhere near large enough to spike the duration of a single db query.
  • Network latency on these sequential requests is what really causes the time spent.
  • If we assume 100 * 10 ms, we are already at 1s of additional time, 200 * 10ms ~ 2s, 300 * 10ms ~ 3s, and so forth
  • This is already what a user perceives as very slow

Having a hundred of something really happens too easily. And even if it happens, even if by accident, you'd really notice.

Thus they are a nice low hanging fruit to take care of, with the rule of thumb, not to call the DB in any loop.

Whether it is a Map<String, Policy> or an object wrapped around it giving one a nicer API surface for both testing and readability, I would leave that entirely up to you. I also misunderstood it assuming the suggestion to mean a global cache of sorts.

@paullatzelsperger
Copy link
Member

paullatzelsperger commented Sep 27, 2024

@richardtreier thanks for your efforts, but it goes without saying that we will require those numbers based on 0.9.1, ideally 0.10.0-SNAPSHOT. I'm sure you are not expecting us to do anything on an ancient and thus unsupported version.

The fix, if one will be created, will only land on the main branch, i.e. 0.10.0-SNAPSHOT and onward at time of writing.

@efiege
Copy link
Author

efiege commented Oct 2, 2024

Awesome, thanks for incorparating those changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
5 participants