API limit restrict output to user, but doesn't restrict amount of data from mongodb #549

nuclearcat · 2024-10-01T05:57:48Z

If we do two queries:
curl "https://staging.kernelci.org:9000/latest/nodes?kind=test&limit=1"
real 0m5.994s
user 0m0.012s
sys 0m0.000s

curl "https://staging.kernelci.org:9000/latest/nodes?kind=checkout&limit=1"
real 0m0.023s
user 0m0.012s
sys 0m0.000s

We can notice significant difference in response time. While it should be fast both (kind is indexed).

On preliminary investigation reason is that search query submitted to mongodb without limit parameter, which is causing very large output in case of some queries.
We need to resolve that, as it is also causing excessive load on database and ram consumption.

JenySadadia · 2024-10-01T07:00:02Z

Based on an initial investigation, API uses fastapi-pagination to query the database and it does use limit and offset parameters while querying DB.
Please see https://github.com/kernelci/kernelci-api/blob/main/api/db.py#L149.
I think the reason could be kind=checkout is introduced later in the development whereas kind=test nodes are in maestro from the very start. That's why it would be a huge difference in terms of data of both these kinds in DB.
Hence, the response time difference is there in the queries.

nuclearcat · 2024-10-01T09:29:23Z

It does not pass limit parameter, and if we talk directly with mongodb it is way faster. Example, over mongo cli on staging, similar query:

kernelci> {
...     var startTime = new Date();
...     cursor = db.node.find({ 'kind':'test'}).limit(1000);
...     count = 0
...     while (cursor.hasNext()) {
...         count++;
...         cursor.next();
...     }
...     var endTime = new Date();
...     var timeTaken = endTime - startTime; // time taken in milliseconds
...     print("Time taken: " + timeTaken + " ms, count: " + count);
... }
Time taken: 360 ms, count: 1000

kernelci> {
...     var startTime = new Date();
...     cursor = db.node.find({ 'kind':'checkout'}).limit(1000);
...     count = 0
...     while (cursor.hasNext()) {
...         count++;
...         cursor.next();
...     }
...     var endTime = new Date();
...     var timeTaken = endTime - startTime; // time taken in milliseconds
...     print("Time taken: " + timeTaken + " ms, count: " + count);
... }
Time taken: 402 ms, count: 1000

kernelci> {
...     var startTime = new Date();
...     cursor = db.node.find({ 'kind':'test'}).limit(1);
...     count = 0
...     while (cursor.hasNext()) {
...         count++;
...         cursor.next();
...     }
...     var endTime = new Date();
...     var timeTaken = endTime - startTime; // time taken in milliseconds
...     print("Time taken: " + timeTaken + " ms, count: " + count);
... }
Time taken: 9 ms, count: 1

kernelci> 

kernelci> {
...     var startTime = new Date();
...     cursor = db.node.find({ 'kind':'checkout'}).limit(1);
...     count = 0
...     while (cursor.hasNext()) {
...         count++;
...         cursor.next();
...     }
...     var endTime = new Date();
...     var timeTaken = endTime - startTime; // time taken in milliseconds
...     print("Time taken: " + timeTaken + " ms, count: " + count);
... }
Time taken: 4 ms, count: 1

As you can see, it is way not 5 seconds, both are same (cause field is indexed), and lesser limit also means lesser execution time (approx proportionally).

r-c-n · 2024-10-01T10:28:24Z

If you have a way to find out how the fastapi queries are translated into mongodb query calls, that'd be the fastest way to figure out what's happening for sure.

fastapi-pagination ends up doing this, but it's not clear to me how the skip and limit parameters are passed and where they're taken from. The docs say that you can paginate by limit and offset (obviously), and the API is used in the kernelci models but it's not clear if this is actually translated into the DB query or is it just used to present the results returned by the DB.

JenySadadia · 2024-10-02T07:45:17Z

Here is a summary of my further investigation:

API uses paginate method from the package to handle /nodes endpoint
The paginate uses resolve_params to get limit and offset attributes provided as request query parameters.
To confirm this, I updated db.find_by_attributes function to have some debug prints as the below:

async def find_by_attributes(self, model, attributes):
        """Find objects with matching attributes

        Find all objects with attributes matching the key/value pairs in the
        attributes dictionary using pagination and return the paginated
        response.
        The response dictionary will include 'items', 'total', 'limit',
        and 'offset' keys.
        """
        print("attributes", attributes)
        col = self._get_collection(model)
        query = self._prepare_query(attributes)
        print("query", query)
        print("Params value:", api.params_value)
        print("Resolve params", resolve_params(None))
        page = await paginate(collection=col, query_filter=query)
        print("Limit:", page.limit, "offset:", page.offset)
        return await paginate(collection=col, query_filter=query)

For the request http://localhost:8001/nodes?kind=job&limit=1&offset=2, I received the below prints:

kernelci-api          | attributes {'kind': 'job'}
kernelci-api          | query {'kind': 'job'}
kernelci-api          | Params value: <ContextVar name='params_value' at 0x783e88a26bb0>
kernelci-api          | Resolve params limit=1 offset=2
kernelci-api          | Limit: 1 offset: 2

Those parameters are used for DB query here
To confirm this further, I have printed total documents in DB and fetched documents without length param:

kernelci-api          | await collection.count_documents(query_filter): 1755
kernelci-api          | await cursor.to_list(): [{'_id': ObjectId('665dc5b4170801c7478589da'), 'kind': 'job', 'name': 'kunit', 'path': ['checkout', 'kunit'], 'group': 'kunit', 'parent': ObjectId('665dc587170801c7478589d9'), 'state': 'done', 'result': 'incomplete', 'artifacts': {}, 'data': {'kernel_revision': {'tree': 'kernelci', 'url': 'https://github.com/kernelci/linux.git', 'branch': 'staging-mainline', 'commit': '8685538943a0000f7e97c4b516a4f66f0613aa0f', 'describe': 'staging-mainline-20240603.0', 'version': {'version': 6, 'patchlevel': 10, 'extra': '-rc2-1-g8685538943a00'}}, 'kernel_type': None, 'arch': None, 'defconfig': None, 'config_full': None, 'compiler': None, 'runtime': 'docker', 'platform': 'docker', 'error_code': 'submit_error', 'error_msg': '400 Client Error for http+docker://localhost/v1.43/containers/create: Bad Request ("create data/ssh/: "data/ssh/" includes invalid characters for a local volume name, only "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed. If you intended to pass a host directory, use absolute path")'}, 'debug': None, 'jobfilter': None, 'created': datetime.datetime(2024, 6, 3, 13, 31, 32, 334000), 'updated': datetime.datetime(2024, 6, 3, 13, 31, 36, 268000), 'timeout': datetime.datetime(2024, 6, 3, 19, 31, 32, 334000), 'holdoff': None, 'owner': 'test', 'submitter': '4dcc0f8e75710a735cf097c493edb5a0', 'user_groups': []}]

To summarize, fastapi-pagination uses skip and limit params at DB layer. So, we get paginated data directly.

r-c-n · 2024-10-02T09:42:16Z

Ooc, isn't it counterproductive that fastapi-pagination:paginate() calls count_documents on the query without the limit parameter to retrieve the total number of results before running the actual find query? What's the point? Maybe it won't be noticeable, but I doubt that that count is necessarily cached.

nuclearcat · 2024-10-02T10:34:52Z

I did some experiments, and i suspect more and more this pagination library is not efficient at all. But not sure what are best solution, i guess we discuss with @JenySadadia outcome and results of tests and decide what are solution.
Additional tests:
#552

non-paginated api call
time curl "https://staging.kernelci.org:9000/latest/rawnodes?kind=test&limit=1"
[{"_id":"65ae71cf1a0918beda7ad735","kind":"test","name":"kunit","path":["checkout","kunit"],"group":"kunit","parent":"65ae71691a0918beda7ad733","state":"done","result":null,"artifacts":{"tarball":"https://kciapistagingstorage1.file.core.windows.net/staging/linux-kernelci-staging-stable-staging-stable-20240122.5.tar.gz?sv=2022-11-02&ss=f&srt=sco&sp=r&se=2024-10-17T19:19:12Z&st=2023-10-17T11:19:12Z&spr=https&sig=sLmFlvZHXRrZsSGubsDUIvTiv%2BtzgDq6vALfkrtWnv8%3D"},"data":{"kernel_revision":{"tree":"kernelci","url":"https://github.com/kernelci/linux.git","branch":"staging-stable","commit":"800b835378ea1df495b8211c427f80a2ffcb1e6c","describe":"staging-stable-20240122.5","version":{"version":"5","patchlevel":"15","extra":"-20805-g800b835378ea"}},"runtime":"docker","platform":"docker","asdf":"asdf"},"debug":null,"jobfilter":null,"created":"2024-01-22T13:46:55.543000","updated":"2024-01-22T19:47:02.304000","timeout":"2024-01-22T19:46:55.543000","holdoff":null,"owner":"admin","submitter":null,"treeid":null,"user_groups":[]}]
real    0m0.029s
user    0m0.008s
sys    0m0.004s

And paginated:

curl "https://staging.kernelci.org:9000/latest/nodes?kind=test&limit=1"
{"items":[{"id":"65ae71cf1a0918beda7ad735","kind":"test","name":"kunit","path":["checkout","kunit"],"group":"kunit","parent":"65ae71691a0918beda7ad733","state":"done","result":null,"artifacts":{"tarball":"https://kciapistagingstorage1.file.core.windows.net/staging/linux-kernelci-staging-stable-staging-stable-20240122.5.tar.gz?sv=2022-11-02&ss=f&srt=sco&sp=r&se=2024-10-17T19:19:12Z&st=2023-10-17T11:19:12Z&spr=https&sig=sLmFlvZHXRrZsSGubsDUIvTiv%2BtzgDq6vALfkrtWnv8%3D"},"data":{"kernel_revision":{"tree":"kernelci","url":"https://github.com/kernelci/linux.git","branch":"staging-stable","commit":"800b835378ea1df495b8211c427f80a2ffcb1e6c","describe":"staging-stable-20240122.5","version":{"version":"5","patchlevel":"15","extra":"-20805-g800b835378ea"}},"runtime":"docker","platform":"docker","asdf":"asdf"},"debug":null,"jobfilter":null,"created":"2024-01-22T13:46:55.543000","updated":"2024-01-22T19:47:02.304000","timeout":"2024-01-22T19:46:55.543000","holdoff":null,"owner":"admin","submitter":null,"treeid":null,"user_groups":[]}],"total":19070890,"limit":1,"offset":0}
real    0m8.603s
user    0m0.017s
sys    0m0.000s
kernelci@kernelci-staging:~/kernelci-deploy$

JenySadadia · 2024-10-02T11:25:28Z

Ooc, isn't it counterproductive that fastapi-pagination:paginate() calls count_documents on the query without the limit parameter to retrieve the total number of results before running the actual find query? What's the point? Maybe it won't be noticeable, but I doubt that that count is necessarily cached.

The implementation has changed in the latest version for paginate and count documents.
Maybe we can try out the latest version in the API and check if it improves performance.

JenySadadia · 2024-10-02T11:39:39Z

This issue states that the latest version uses pydantic v2 and it increased the performance.
Other performance related issues I found:

nuclearcat added bug Something isn't working critical techdebt Something that works for now, but should be done better labels Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API limit restrict output to user, but doesn't restrict amount of data from mongodb #549

API limit restrict output to user, but doesn't restrict amount of data from mongodb #549

nuclearcat commented Oct 1, 2024

JenySadadia commented Oct 1, 2024

nuclearcat commented Oct 1, 2024

r-c-n commented Oct 1, 2024

JenySadadia commented Oct 2, 2024

r-c-n commented Oct 2, 2024

nuclearcat commented Oct 2, 2024

JenySadadia commented Oct 2, 2024

JenySadadia commented Oct 2, 2024

API limit restrict output to user, but doesn't restrict amount of data from mongodb #549

API limit restrict output to user, but doesn't restrict amount of data from mongodb #549

Comments

nuclearcat commented Oct 1, 2024

JenySadadia commented Oct 1, 2024

nuclearcat commented Oct 1, 2024

r-c-n commented Oct 1, 2024

JenySadadia commented Oct 2, 2024

r-c-n commented Oct 2, 2024

nuclearcat commented Oct 2, 2024

JenySadadia commented Oct 2, 2024

JenySadadia commented Oct 2, 2024