Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API limit restrict output to user, but doesn't restrict amount of data from mongodb #549

Open
nuclearcat opened this issue Oct 1, 2024 · 8 comments
Labels
bug Something isn't working critical techdebt Something that works for now, but should be done better

Comments

@nuclearcat
Copy link
Member

If we do two queries:
curl "https://staging.kernelci.org:9000/latest/nodes?kind=test&limit=1"
real 0m5.994s
user 0m0.012s
sys 0m0.000s

curl "https://staging.kernelci.org:9000/latest/nodes?kind=checkout&limit=1"
real 0m0.023s
user 0m0.012s
sys 0m0.000s

We can notice significant difference in response time. While it should be fast both (kind is indexed).

On preliminary investigation reason is that search query submitted to mongodb without limit parameter, which is causing very large output in case of some queries.
We need to resolve that, as it is also causing excessive load on database and ram consumption.

@nuclearcat nuclearcat added bug Something isn't working critical techdebt Something that works for now, but should be done better labels Oct 1, 2024
@JenySadadia
Copy link
Collaborator

Based on an initial investigation, API uses fastapi-pagination to query the database and it does use limit and offset parameters while querying DB.
Please see https://github.com/kernelci/kernelci-api/blob/main/api/db.py#L149.
I think the reason could be kind=checkout is introduced later in the development whereas kind=test nodes are in maestro from the very start. That's why it would be a huge difference in terms of data of both these kinds in DB.
Hence, the response time difference is there in the queries.

@nuclearcat
Copy link
Member Author

It does not pass limit parameter, and if we talk directly with mongodb it is way faster. Example, over mongo cli on staging, similar query:

kernelci> {
...     var startTime = new Date();
...     cursor = db.node.find({ 'kind':'test'}).limit(1000);
...     count = 0
...     while (cursor.hasNext()) {
...         count++;
...         cursor.next();
...     }
...     var endTime = new Date();
...     var timeTaken = endTime - startTime; // time taken in milliseconds
...     print("Time taken: " + timeTaken + " ms, count: " + count);
... }
Time taken: 360 ms, count: 1000

kernelci> {
...     var startTime = new Date();
...     cursor = db.node.find({ 'kind':'checkout'}).limit(1000);
...     count = 0
...     while (cursor.hasNext()) {
...         count++;
...         cursor.next();
...     }
...     var endTime = new Date();
...     var timeTaken = endTime - startTime; // time taken in milliseconds
...     print("Time taken: " + timeTaken + " ms, count: " + count);
... }
Time taken: 402 ms, count: 1000

kernelci> {
...     var startTime = new Date();
...     cursor = db.node.find({ 'kind':'test'}).limit(1);
...     count = 0
...     while (cursor.hasNext()) {
...         count++;
...         cursor.next();
...     }
...     var endTime = new Date();
...     var timeTaken = endTime - startTime; // time taken in milliseconds
...     print("Time taken: " + timeTaken + " ms, count: " + count);
... }
Time taken: 9 ms, count: 1

kernelci> 

kernelci> {
...     var startTime = new Date();
...     cursor = db.node.find({ 'kind':'checkout'}).limit(1);
...     count = 0
...     while (cursor.hasNext()) {
...         count++;
...         cursor.next();
...     }
...     var endTime = new Date();
...     var timeTaken = endTime - startTime; // time taken in milliseconds
...     print("Time taken: " + timeTaken + " ms, count: " + count);
... }
Time taken: 4 ms, count: 1

As you can see, it is way not 5 seconds, both are same (cause field is indexed), and lesser limit also means lesser execution time (approx proportionally).

@r-c-n
Copy link

r-c-n commented Oct 1, 2024

If you have a way to find out how the fastapi queries are translated into mongodb query calls, that'd be the fastest way to figure out what's happening for sure.

fastapi-pagination ends up doing this, but it's not clear to me how the skip and limit parameters are passed and where they're taken from. The docs say that you can paginate by limit and offset (obviously), and the API is used in the kernelci models but it's not clear if this is actually translated into the DB query or is it just used to present the results returned by the DB.

@JenySadadia
Copy link
Collaborator

Here is a summary of my further investigation:

  • API uses paginate method from the package to handle /nodes endpoint
  • The paginate uses resolve_params to get limit and offset attributes provided as request query parameters.
    To confirm this, I updated db.find_by_attributes function to have some debug prints as the below:
async def find_by_attributes(self, model, attributes):
        """Find objects with matching attributes

        Find all objects with attributes matching the key/value pairs in the
        attributes dictionary using pagination and return the paginated
        response.
        The response dictionary will include 'items', 'total', 'limit',
        and 'offset' keys.
        """
        print("attributes", attributes)
        col = self._get_collection(model)
        query = self._prepare_query(attributes)
        print("query", query)
        print("Params value:", api.params_value)
        print("Resolve params", resolve_params(None))
        page = await paginate(collection=col, query_filter=query)
        print("Limit:", page.limit, "offset:", page.offset)
        return await paginate(collection=col, query_filter=query)

For the request http://localhost:8001/nodes?kind=job&limit=1&offset=2, I received the below prints:

kernelci-api          | attributes {'kind': 'job'}
kernelci-api          | query {'kind': 'job'}
kernelci-api          | Params value: <ContextVar name='params_value' at 0x783e88a26bb0>
kernelci-api          | Resolve params limit=1 offset=2
kernelci-api          | Limit: 1 offset: 2

kernelci-api          | await collection.count_documents(query_filter): 1755
kernelci-api          | await cursor.to_list(): [{'_id': ObjectId('665dc5b4170801c7478589da'), 'kind': 'job', 'name': 'kunit', 'path': ['checkout', 'kunit'], 'group': 'kunit', 'parent': ObjectId('665dc587170801c7478589d9'), 'state': 'done', 'result': 'incomplete', 'artifacts': {}, 'data': {'kernel_revision': {'tree': 'kernelci', 'url': 'https://github.com/kernelci/linux.git', 'branch': 'staging-mainline', 'commit': '8685538943a0000f7e97c4b516a4f66f0613aa0f', 'describe': 'staging-mainline-20240603.0', 'version': {'version': 6, 'patchlevel': 10, 'extra': '-rc2-1-g8685538943a00'}}, 'kernel_type': None, 'arch': None, 'defconfig': None, 'config_full': None, 'compiler': None, 'runtime': 'docker', 'platform': 'docker', 'error_code': 'submit_error', 'error_msg': '400 Client Error for http+docker://localhost/v1.43/containers/create: Bad Request ("create data/ssh/: "data/ssh/" includes invalid characters for a local volume name, only "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed. If you intended to pass a host directory, use absolute path")'}, 'debug': None, 'jobfilter': None, 'created': datetime.datetime(2024, 6, 3, 13, 31, 32, 334000), 'updated': datetime.datetime(2024, 6, 3, 13, 31, 36, 268000), 'timeout': datetime.datetime(2024, 6, 3, 19, 31, 32, 334000), 'holdoff': None, 'owner': 'test', 'submitter': '4dcc0f8e75710a735cf097c493edb5a0', 'user_groups': []}]

To summarize, fastapi-pagination uses skip and limit params at DB layer. So, we get paginated data directly.

@r-c-n
Copy link

r-c-n commented Oct 2, 2024

Ooc, isn't it counterproductive that fastapi-pagination:paginate() calls count_documents on the query without the limit parameter to retrieve the total number of results before running the actual find query? What's the point? Maybe it won't be noticeable, but I doubt that that count is necessarily cached.

@nuclearcat
Copy link
Member Author

I did some experiments, and i suspect more and more this pagination library is not efficient at all. But not sure what are best solution, i guess we discuss with @JenySadadia outcome and results of tests and decide what are solution.
Additional tests:
#552

non-paginated api call
time curl "https://staging.kernelci.org:9000/latest/rawnodes?kind=test&limit=1"
[{"_id":"65ae71cf1a0918beda7ad735","kind":"test","name":"kunit","path":["checkout","kunit"],"group":"kunit","parent":"65ae71691a0918beda7ad733","state":"done","result":null,"artifacts":{"tarball":"https://kciapistagingstorage1.file.core.windows.net/staging/linux-kernelci-staging-stable-staging-stable-20240122.5.tar.gz?sv=2022-11-02&ss=f&srt=sco&sp=r&se=2024-10-17T19:19:12Z&st=2023-10-17T11:19:12Z&spr=https&sig=sLmFlvZHXRrZsSGubsDUIvTiv%2BtzgDq6vALfkrtWnv8%3D"},"data":{"kernel_revision":{"tree":"kernelci","url":"https://github.com/kernelci/linux.git","branch":"staging-stable","commit":"800b835378ea1df495b8211c427f80a2ffcb1e6c","describe":"staging-stable-20240122.5","version":{"version":"5","patchlevel":"15","extra":"-20805-g800b835378ea"}},"runtime":"docker","platform":"docker","asdf":"asdf"},"debug":null,"jobfilter":null,"created":"2024-01-22T13:46:55.543000","updated":"2024-01-22T19:47:02.304000","timeout":"2024-01-22T19:46:55.543000","holdoff":null,"owner":"admin","submitter":null,"treeid":null,"user_groups":[]}]
real    0m0.029s
user    0m0.008s
sys    0m0.004s

And paginated:

curl "https://staging.kernelci.org:9000/latest/nodes?kind=test&limit=1"
{"items":[{"id":"65ae71cf1a0918beda7ad735","kind":"test","name":"kunit","path":["checkout","kunit"],"group":"kunit","parent":"65ae71691a0918beda7ad733","state":"done","result":null,"artifacts":{"tarball":"https://kciapistagingstorage1.file.core.windows.net/staging/linux-kernelci-staging-stable-staging-stable-20240122.5.tar.gz?sv=2022-11-02&ss=f&srt=sco&sp=r&se=2024-10-17T19:19:12Z&st=2023-10-17T11:19:12Z&spr=https&sig=sLmFlvZHXRrZsSGubsDUIvTiv%2BtzgDq6vALfkrtWnv8%3D"},"data":{"kernel_revision":{"tree":"kernelci","url":"https://github.com/kernelci/linux.git","branch":"staging-stable","commit":"800b835378ea1df495b8211c427f80a2ffcb1e6c","describe":"staging-stable-20240122.5","version":{"version":"5","patchlevel":"15","extra":"-20805-g800b835378ea"}},"runtime":"docker","platform":"docker","asdf":"asdf"},"debug":null,"jobfilter":null,"created":"2024-01-22T13:46:55.543000","updated":"2024-01-22T19:47:02.304000","timeout":"2024-01-22T19:46:55.543000","holdoff":null,"owner":"admin","submitter":null,"treeid":null,"user_groups":[]}],"total":19070890,"limit":1,"offset":0}
real    0m8.603s
user    0m0.017s
sys    0m0.000s
kernelci@kernelci-staging:~/kernelci-deploy$ 

@JenySadadia
Copy link
Collaborator

Ooc, isn't it counterproductive that fastapi-pagination:paginate() calls count_documents on the query without the limit parameter to retrieve the total number of results before running the actual find query? What's the point? Maybe it won't be noticeable, but I doubt that that count is necessarily cached.

The implementation has changed in the latest version for paginate and count documents.
Maybe we can try out the latest version in the API and check if it improves performance.

@JenySadadia
Copy link
Collaborator

This issue states that the latest version uses pydantic v2 and it increased the performance.
Other performance related issues I found:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working critical techdebt Something that works for now, but should be done better
Projects
None yet
Development

No branches or pull requests

3 participants