Optimize Catalog-Endpoint Performance #4499

efiege · 2024-09-26T09:36:20Z

Feature Request

The performance of the Catalog-Endpoint degrades as the number of ContractDefinitions increases. This appears to be due to an n+1 query issue within the ContractDefinitionResolverImpl. Currently, for each ContractDefinition, the corresponding access policy is resolved by its ID, resulting in a separate database query for each policy.

This issue seems to persists in the main branch. The relevant code can be found here:

Connector/core/control-plane/control-plane-contract/src/main/java/org/eclipse/edc/connector/controlplane/contract/offer/ContractDefinitionResolverImpl.java

Lines 72 to 89 in dfa1be6

    
               /** 
        
                * Determines the applicability of a definition to an agent by evaluating its access policy. 
        
                */ 
        
               private boolean evaluateAccessPolicy(ContractDefinition definition, ParticipantAgent agent) { 
        
                   var policyContext = PolicyContextImpl.Builder.newInstance().additional(ParticipantAgent.class, agent).build(); 
        
                   var accessResult = Optional.of(definition.getAccessPolicyId()) 
        
                           .map(policyStore::findById) 
        
                           .map(PolicyDefinition::getPolicy) 
        
                           .map(policy -> policyEngine.evaluate(CATALOGING_SCOPE, policy, policyContext)) 
        
                           .orElse(Result.failure(format("Policy %s not found", definition.getAccessPolicyId()))); 
        
                   if (accessResult.failed()) { 
        
                       monitor.debug(format("Access not granted for %s: \n%s", definition.getId(), String.join("\n", accessResult.getFailureMessages()))); 
        
                       return false; 
        
                   } 
        
                   return true; 
        
               }

Which Areas Would Be Affected?

Control-Plane
PolicyDefinitionStore

Why Is the Feature Desired?

As the number of ContractDefinitions grows, the Catalog-Endpoint experiences significant performance degradation due to the increasing number of database queries.

Solution Proposal

Implement a database join to retrieve all relevant policies in a single query, reducing the number of queries executed.

The text was updated successfully, but these errors were encountered:

ndr-brt · 2024-09-26T10:43:00Z

could you please provide a detailed benchmark trying to highlight after how many contract definitions the performances start to degrade and what's the time factor?

EDIT: it was a typo, I was referring to contract definitions

paullatzelsperger · 2024-09-26T13:02:57Z

I 100% agree with what @ndr-brt said - provide speeds and feeds, otherwise it's all conjecture.

Querying with a JOIN will have a widespread impact, because we'd have to introduce a new ContractDefinition object, i.e. one that contains Policy objects instead of just their IDs. This will also have an impact on all the stores. And even then, I'm very sceptic that it would bring much of an advantage, since connections are pooled, and establishing them is what takes time. Also, databases typically have a cache.

If anything, I would think about a local policy cache, i.e. a HashMap that gets populated, and gets evicted after the request.

But again, we can't do anything without concrete numbers.

efiege · 2024-09-26T13:11:50Z

Thank you for your prompt responses.

The performance degradation is not tied to any contract negotiation process, but rather scales with the number of ContractDefinitions present in the EDC.

To illustrate this, consider the following example: resolving a policy by its ID from the database takes approximately 1ms. With 100,000 ContractDefinitions, a catalog request requires at least 100,000 * 1ms, which translates to a minimum of 100 seconds.

paullatzelsperger · 2024-09-26T13:19:17Z

first, is that an actual measurement? if so, can you share the results?
second, why are there 100_000 contract definitions? again, is this an actual use case?

like I said, a local policy cache would be a good option, but unless you provide hard facts (not fictitious numbers) there is nothing we can do.

edit: database queries do not scale linearly, and are influenced by many factors, so I think your calculation is a bit too simplistic. we should base our implementation on facts, not conjecture.

ndr-brt · 2024-09-26T13:19:34Z

Thank you for your prompt responses.

The performance degradation is not tied to any contract negotiation process, but rather scales with the number of ContractDefinitions present in the EDC.

it was a typo 🤭

To illustrate this, consider the following example: resolving a policy by its ID from the database takes approximately 1ms. With 100,000 ContractDefinitions, a catalog request requires at least 100,000 * 1ms, which translates to a minimum of 100 seconds.

as we already stated, please provide more details.

(btw: if you have 100.000 definitions, is very likely that you have modeled them in the wrong way. read: the pattern 1 asset/1 policy/1 contract definition is definitely not correct and needs to be avoided)

efiege · 2024-09-26T14:21:05Z

I just wanted to give an example ☺️

If I understand correctly, you suggest that the response time for the catalog is acceptable when dealing with a reasonable number of ContractDefinitions for typical use cases.

I don't know what you think is a reasonable number, but even if there are just ~100 ContractDefinitions this will reduce the response time by ~100ms.

Wouldn't you agree that there is still potential for optimization here?

jimmarino · 2024-09-26T14:32:29Z

I just wanted to give an example ☺️

If I understand correctly, you suggest that the response time for the catalog is acceptable when dealing with a reasonable number of ContractDefinitions for typical use cases.

I don't know what you think is a reasonable number, but even if there are just ~100 ContractDefinitions this will reduce the response time by ~100ms.

Wouldn't you agree that there is still potential for optimization here?

What we are asking for is a benchmark. How do you know the response time will be reduced by ~100ms? It certainly may be the case, but on what basis is this asserted?

What is the test case, how was the database and other infrastructure configured, and what were the actual numbers? Did you profile the behavior? What bottlenecks were observed?

richardtreier · 2024-09-26T15:26:43Z

Guys, I believe we do not need a benchmark to understand that any N+1 select problem on a highly used and visited endpoint such as the catalog, that scales over any of the EDC entities, is an issue.

I understand the challenges the EDC faces with its design decision to split the stores with accessing even its own DB. I believe we can still find a good solution by using batched DB queries. The following is a common pattern I have met in many JPA code bases, as JPA code bases also often face N+1 select issues:

var policyIds = contractDefinitions.stream()
    .map(it -> it.getPolicyId())
    .collect(toSet());
var policiesById = policyService.findByIdIn(policyIds).stream()
    .collect(toMap(it -> it.getId(), Function.identity()))

return contractDefinitions.stream()
    .map(contractDefinition -> {
        var policy = policiesById.get(contractDefinition.getPolicyId());
        return buildContractOffer(contractDefinition, policy);
    })
    .toList();

Trying to change EDC usage also does not help here:

In cloud environments even the smallest DB query might take 24ms - 32ms.
In terms of quantity structure, even with optimized EDC usage, we have seen:
- Transfers have easily hundreds of thousands.
- Hundreds of individual data offers with individual policies on a brokering EDC connecting a data marketplace platform with an EDC dataspace

paullatzelsperger · 2024-09-26T15:38:20Z

Guys, I believe we do not need a benchmark to understand that any N+1 select problem on a highly used and visited endpoint such as the catalog, that scales over any of the EDC entities, is an issue.

Three committers disagree with you and are in agreement that we do need a benchmark and hard facts.

I believe we can still find a good solution

one has been offered, in the form of a local policy cache. Fetching the policies in batch for all the contract defs (that are materialized in memory anyway) is just a way of populating the cache.

The following is a common pattern I have met in many JPA code bases,

This is not JPA.

In cloud environments even the smallest DB query might take 24ms - 32ms.

measurements? benchmarks? timing results? what does the deployment situation look like? is this geo-colocated? is database caching enabled? also, it is likely the connection establishment that takes time, and connections are pooled.

Transfers have easily hundreds of thousands.

this has nothing to do with offers

To be honest, this is starting to annoy me. We have asked for hard data, which you seem incapable of or unwilling to provide. We even offered a solution, that would alleviate this supposed problem, while keeping our entity data model and object model intact.

It seems I have to be blunt: unless you clearly demonstrate a problem, supported by hard facts (benchmarks, profiling data, timings, etc.), we will not invest any more time in this.

ndr-brt · 2024-09-27T06:57:00Z

Other than agreeing with @paullatzelsperger , taking a look at the code, from a "simplicity" point of view, I see that the ContractDefinitionResolver could be inlined into DatasetResolverImpl, other than simplify the design, it would permit also to add a thread local cache that would avoid multiple policy fetch in the case that multiple contract definitions use the same policies for access/contract scopes (that, I guess is really likely what happens).

I like the ids idea brought by @richardtreier but, really, we need data to take a decision that will impact countless adopters

paullatzelsperger · 2024-09-27T07:36:14Z

it would permit also to add a thread local cache that would avoid multiple policy fetch in the case that multiple contract definitions use the same policies for access/contract scopes (that, I guess is really likely what happens).

that was exactly my point.

richardtreier · 2024-09-27T09:26:40Z

Please note, our facts regarding this issue are not purely related to this issue, hence the hesitation.

We are using the current Catena-X Prod Version (0.2.1)
We are aware of https://github.com/eclipse-edc/Connector/pull/2753/files:
- We cannot provide exact numbers for 0.6/0.7 just yet.
- Am I wrong to assume that this performance test is purely running with a latency-free in-memory db?
- Please note, that the above 25ms per DB query are numbers cited by our ops team from measurements, that are not just last percentile numbers, but slow downs that might happen from time to time. N+1 select issues get amplified when these timings happen.

We have a connector running in production that was running into timeouts of our reverse proxy and this is where our investigation began.

Quantity Structure:
- 300 Contract Offers
Timings:
- Reverse Proxy has timeouts at 15s
- Query Full Catalog: 15s-30s
- Query Catalog by Asset ID: 1s-5s
  - This contains a policy definition N+1 select problem, which is still present in the current version.
  - This does not include an asset N+1 select problem, which would have been fixed in the current version.
Measured by:
- Traces of the EDC
- Traces of the PostgreSQL
- We also try to see things with a grain of salt due to the different versions, looking into the current code itself.
Findings:
- We had several findings, and the current issue only adresses one of them.
- N+1 Selects for assets
  - Seem to be fixed in 0.6/0.7
- N+1 Selects for policies
  - this issue
- Some Connection Pool / unknown issues we are still analyzing
  - Some SELECT ? queries take 130ms:
    - "This command has been terminated by an administrator"
    - The EDC proceeds to take 3s to re-start the connection
  - This might have been fixed in 0.6/0.7 or be unrelated
  - As I said, we are still analyzing this.
  - Should anything come from this it would be a separate issue.

We wanted to only raise the N+1 select issue here, since we need time to fully test things in 0.7.

From my perspective, the fact that the catalog query by the asset ID that excludes the asset N+1 select issue still takes 1-5s was a sign that this issue was worth writing.

Regarding N+1 Select problems generally:

A request as large as a catalog returning 300 assets + policies would by payload size already need at least 200ms in Java Backends, ballpark number, by experience.
If we have an N+1 select issue, time gets simply added on top.
The data amount itself is nowhere near large enough to spike the duration of a single db query.
Network latency on these sequential requests is what really causes the time spent.
If we assume 100 * 10 ms, we are already at 1s of additional time, 200 * 10ms ~ 2s, 300 * 10ms ~ 3s, and so forth
This is already what a user perceives as very slow

Having a hundred of something really happens too easily. And even if it happens, even if by accident, you'd really notice.

Thus they are a nice low hanging fruit to take care of, with the rule of thumb, not to call the DB in any loop.

Whether it is a Map<String, Policy> or an object wrapped around it giving one a nicer API surface for both testing and readability, I would leave that entirely up to you. I also misunderstood it assuming the suggestion to mean a global cache of sorts.

paullatzelsperger · 2024-09-27T11:04:58Z

@richardtreier thanks for your efforts, but it goes without saying that we will require those numbers based on 0.9.1, ideally 0.10.0-SNAPSHOT. I'm sure you are not expecting us to do anything on an ancient and thus unsupported version.

The fix, if one will be created, will only land on the main branch, i.e. 0.10.0-SNAPSHOT and onward at time of writing.

efiege · 2024-10-02T11:46:50Z

Awesome, thanks for incorparating those changes!

github-actions bot added the triage all new issues awaiting classification label Sep 26, 2024

paullatzelsperger added question Further information is requested and removed triage all new issues awaiting classification labels Sep 26, 2024

ndr-brt mentioned this issue Sep 30, 2024

refactor: avoid potential unnecessary store accesses on dataset resolution #4513

Merged

ndr-brt closed this as completed in #4513 Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Catalog-Endpoint Performance #4499

Optimize Catalog-Endpoint Performance #4499

efiege commented Sep 26, 2024 •

edited

Loading

ndr-brt commented Sep 26, 2024 •

edited

Loading

paullatzelsperger commented Sep 26, 2024 •

edited

Loading

efiege commented Sep 26, 2024

paullatzelsperger commented Sep 26, 2024 •

edited

Loading

ndr-brt commented Sep 26, 2024

efiege commented Sep 26, 2024

jimmarino commented Sep 26, 2024 •

edited

Loading

richardtreier commented Sep 26, 2024 •

edited

Loading

paullatzelsperger commented Sep 26, 2024 •

edited

Loading

ndr-brt commented Sep 27, 2024

paullatzelsperger commented Sep 27, 2024

richardtreier commented Sep 27, 2024 •

edited

Loading

paullatzelsperger commented Sep 27, 2024 •

edited

Loading

efiege commented Oct 2, 2024

Optimize Catalog-Endpoint Performance #4499

Optimize Catalog-Endpoint Performance #4499

Comments

efiege commented Sep 26, 2024 • edited Loading

Feature Request

Which Areas Would Be Affected?

Why Is the Feature Desired?

Solution Proposal

ndr-brt commented Sep 26, 2024 • edited Loading

paullatzelsperger commented Sep 26, 2024 • edited Loading

efiege commented Sep 26, 2024

paullatzelsperger commented Sep 26, 2024 • edited Loading

ndr-brt commented Sep 26, 2024

efiege commented Sep 26, 2024

jimmarino commented Sep 26, 2024 • edited Loading

richardtreier commented Sep 26, 2024 • edited Loading

paullatzelsperger commented Sep 26, 2024 • edited Loading

ndr-brt commented Sep 27, 2024

paullatzelsperger commented Sep 27, 2024

richardtreier commented Sep 27, 2024 • edited Loading

paullatzelsperger commented Sep 27, 2024 • edited Loading

efiege commented Oct 2, 2024

efiege commented Sep 26, 2024 •

edited

Loading

ndr-brt commented Sep 26, 2024 •

edited

Loading

paullatzelsperger commented Sep 26, 2024 •

edited

Loading

paullatzelsperger commented Sep 26, 2024 •

edited

Loading

jimmarino commented Sep 26, 2024 •

edited

Loading

richardtreier commented Sep 26, 2024 •

edited

Loading

paullatzelsperger commented Sep 26, 2024 •

edited

Loading

richardtreier commented Sep 27, 2024 •

edited

Loading

paullatzelsperger commented Sep 27, 2024 •

edited

Loading