Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FT.SEARCH gives incomplete result when used with EVAL (Silent error, potentially hard to spot, but hopefully easy to fix) #3706

Open
CodeToFreedom opened this issue Sep 13, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@CodeToFreedom
Copy link

During our migration tests to dragonflydb we came across a quite dangerous silent error which is quite hard to spot as incomplete result.
In short: The lua-script output of FT.SEARCH is cut off and therefore different from the normal FT.SEARCH output.
This could lead to potentially dangerous scenarios where users rely on the output to be correct which it is not.

Steps to reproduce:

-Create test data index:
FT.CREATE idx:testindex ON JSON PREFIX 1 testindex: SCHEMA $.pk AS pk TAG $.color AS color TAG

-Insert test data:

JSON.SET testindex:abc1 . "{\"pk\":\"abc1\",\"color\":\"blue\"}" 
JSON.SET testindex:abc2 . "{\"pk\":\"abc2\",\"color\":\"blue\"}"

Without EVAL/lua the FT.SEARCH command gives the correct result in dragonflydb + redis:

'FT.SEARCH' 'idx:testindex' '((@color:{blue}))' 'LIMIT' '0' '20'
1) "2"
2) "testindex:abc1"
3) 1) "$"
   2) "{\"color\":\"blue\",\"pk\":\"abc1\"}"
4) "testindex:abc2"
5) 1) "$"
   2) "{\"color\":\"blue\",\"pk\":\"abc2\"}"

Wrong output format and incomplete result:

The same command via EVAL/lua script is incorrect in dragonflydb though:

SCRIPT FLAGS f90904580109ee522dcbd414dd3103e5ecdccd6b disable-atomicity
EVAL "return redis.call('FT.SEARCH', 'idx:testindex', '((@color:{blue}))', 'LIMIT', '0', '20')" 0`

Wrong Result (also different output format):
"{\"color\":\"blue\",\"pk\":\"abc2\"}"

Expected result (equal to redis and regular df FT.SEARCH command):

The lua script result should be the exact same output that the FT.SEARCH command above gives.
Here the correct output of redis instance:

EVAL "return redis.call('FT.SEARCH', 'idx:testindex', '((@color:{blue}))', 'LIMIT', '0', '20')" 0
1) "2"
2) "testindex:abc1"
3) 1) "$"
   2) "{\"pk\":\"abc1\",\"color\":\"blue\"}"
4) "testindex:abc2"
5) 1) "$"
   2) "{\"pk\":\"abc2\",\"color\":\"blue\"}"

Dragonfly version used: Most recent version [v1.22.1]

Please let me know in case something is unclear and I try to help as good as possible.
Many thanks for your efforts.

@CodeToFreedom CodeToFreedom added the bug Something isn't working label Sep 13, 2024
@CodeToFreedom CodeToFreedom changed the title FT.SEARCH gives incomplete result when used with EVAL (Silent error + potentially dangerous) FT.SEARCH gives incomplete result when used with EVAL (Silent error, potentially hard to spot, but hopefully easy to fix) Sep 13, 2024
@CodeToFreedom
Copy link
Author

@BagritsevichStepan Have you already found the reason why the FT.SEARCH call result is so unexpectedly different depending on whether it is called from lua or sent as direct command?

Many thanks for your efforts

@CodeToFreedom
Copy link
Author

Hi @BagritsevichStepan and @romange,
I am not very familiar with the codebase yet but have tried to have narrowed the reason down by likelihood of seeing the current output.

Looking through the code (didn't set it up in IDE to test it yet) I found that in doc_index.cc the search function uses the correct response class "SearchResult" .
I assume the regular FT.SEARCH command uses this implementation as it returns the correct type there when called directly.

doc_index.cc

(Click here to expand relevant code part of doc_index.cc)
SearchResult ShardDocIndex::Search(const OpArgs& op_args, const SearchParams& params,
                                   search::SearchAlgorithm* search_algo) const {
  auto& db_slice = op_args.GetDbSlice();
  auto search_results = search_algo->Search(&*indices_, params.limit_offset + params.limit_total);

  if (!search_results.error.empty())
    return SearchResult{facade::ErrorReply{std::move(search_results.error)}};

  SearchFieldsList fields_to_load =
      ToSV(params.ShouldReturnAllFields() ? params.load_fields : params.return_fields);

  vector<SerializedSearchDoc> out;
  out.reserve(search_results.ids.size());

  size_t expired_count = 0;
  for (size_t i = 0; i < search_results.ids.size(); i++) {
    auto key = key_index_.Get(search_results.ids[i]);
    auto it = db_slice.FindReadOnly(op_args.db_cntx, key, base_->GetObjCode());

    if (!it || !IsValid(*it)) {  // Item must have expired
      expired_count++;
      continue;
    }

    auto accessor = GetAccessor(op_args.db_cntx, (*it)->second);

    SearchDocData doc_data;
    if (params.ShouldReturnAllFields()) {
      /*
      In this case we need to load the whole document or loaded fields.
      For JSON indexes it would be {"$", <the whole document as string>}
      */
      doc_data = accessor->SerializeDocument(base_->schema);

      SearchDocData loaded_fields = accessor->Serialize(base_->schema, fields_to_load);
      doc_data.insert(std::make_move_iterator(loaded_fields.begin()),
                      std::make_move_iterator(loaded_fields.end()));
    } else {
      /* Load only specific fields */
      doc_data = accessor->Serialize(base_->schema, fields_to_load);
    }

    auto score = search_results.scores.empty() ? monostate{} : std::move(search_results.scores[i]);
    out.push_back(SerializedSearchDoc{string{key}, std::move(doc_data), std::move(score)});
  }

  return SearchResult{search_results.total - expired_count, std::move(out),
                      std::move(search_results.profile)};
}

This suggests the problem is likely in the data layer between the FtSearch implementation and lua.

The Problem suggests => FT.SEARCH is called correctly:

When we execute FT.SEARCH via a Lua script in DragonflyDB, the results aren't formatted correctly. Here's what we expect:

  • The first element should be the total count of results.
  • Following elements should be alternating pairs of document keys (IDs) and their corresponding data.

However, when we run the command through Lua, we're not getting this structure:

  • Expected Result: [2, "key1", {"data1": "value1"}, "key2", {"data2": "value2"}]
  • Actual Result: {"data2": "value2"}

This discrepancy suggests that:

  1. The count of documents is missing at the start of the result array.
  2. The structure where document keys and data alternate is not being preserved.
  3. =>Conclusion: The search command itself is executed correctly, just not returned or stored correctly.

Suggested Changes => Fix for data layer RedisTranslator:

Assuming the class RedisTranslator is the responsible layer to make the data correctly accessible from lua, I think this might be where it can be fixed.

1. Handling the Total Count:

  • Issue: The original RedisTranslator::OnArrayStart method wasn't setting the total count of documents at the beginning of the array.

  • Change:

    if (array_index_.empty()) {  // For top-level array
        lua_pushinteger(lua_, len / 2); // Push the count of documents (len/2 because of key-value pairs)
        lua_rawseti(lua_, -2, 1); // Set count as first element
        array_index_.push_back(2); // Start from 2 for document ID and data
    }
    

Explanation: This change ensures that when the FT.SEARCH command returns, the total number of documents (which is half the length of the result array due to key-value pairs) is set as the first element of the result table in Lua. This matches the expected format where the count is at the start.

2. Preserving the Structure:

  • Issue: The original code was setting all array elements sequentially, which didn't respect the alternating structure of keys and data.

  • Change:

    array_index_.push_back(2); // For the top-level array, start from 2 to leave room for the count
    

This COULD be be the root cause:
At least if RedisTranslator wasn't properly formatting the array to include the total count at the beginning and to maintain the alternating structure, it would result in the incorrect format we're seeing.

Disclaimer: I haven't tested the code and didn't set it up for local debugging yet.

Please let me know if my assumptions are correct or let me know a more likely reason for the output difference.

Thanks a lot

@BagritsevichStepan
Copy link
Contributor

BagritsevichStepan commented Nov 10, 2024

Hi @CodeToFreedom, thank you for reporting this. I was unable to reproduce the error. Does this error occur consistently?

@CodeToFreedom
Copy link
Author

CodeToFreedom commented Nov 11, 2024

Hi @BagritsevichStepan,
many thanks for looking into it. I checked this evening in different environments and saw the following behaviour:

DF Cloud ->Wrong behaviour

-On a fresh test df cloud instance the error (wrong output) occurs consistently.

DF local docker instance ->Wrong behaviour

On a fresh local docker instance (using the currently newest docker image 1.24.0) the error occurs consistently.

Local Dragonfly Source Build->CORRECT BEHAVIOUR (!!)

-When building from source and running it locally (newest version 1.25.0) it works correctly.
./dragonfly --alsologtostderr
(I think this changed with the new version 1.25.0. I tested it also locally with 1.22.0 but with this older version it didn't work at the time if I remember correctly.)

DB commands to reproduce:

FT.CREATE idx:testindex ON JSON PREFIX 1 testindex: SCHEMA $.pk AS pk TAG $.color AS color TAG
JSON.SET testindex:abc1 . "{\"pk\":\"abc1\",\"color\":\"blue\"}"
JSON.SET testindex:abc2 . "{\"pk\":\"abc2\",\"color\":\"blue\"}"

Direct FT.SEARCH command works correctly (see details above):
'FT.SEARCH' 'idx:testindex' '((@color:{blue}))' 'LIMIT' '0' '20'

Check the same via lua and it shows the bug:

SCRIPT FLAGS f90904580109ee522dcbd414dd3103e5ecdccd6b disable-atomicity
EVAL "return redis.call('FT.SEARCH', 'idx:testindex', '((@color:{blue}))', 'LIMIT', '0', '20')" 0

Conclusion

This leads to the conclusion that either something was fixed in v.1.25.0 or the internal state is handled differently on the local source. When docker v1.25.0 is released and/or deployed to DF cloud instances we'll know for sure whether it got resolved.

Btw, is the source build using a debug mode despite using --release flag during or otherwhise handling the state variables differently?
Side note: I found comments like this in the code which may indicate that states could be handled differently depending on the environment:
/interpreter.h

// Note: We leak the state for now.
  // Production code should not access this method.
  lua_State* lua() {
    return lua_;
  }

Please let me know if you continue to have problems reproducing the error and I send a video. (Hopefully the bug doesn't occur anymore in all environments when all environments are testable with 1.25.0)

@BagritsevichStepan
Copy link
Contributor

BagritsevichStepan commented Nov 11, 2024

I ran version 1.24 locally and caught a crash. With version 1.25, everything works fine. So I believe the issue was fixed in 1.25.

Please try version 1.25, and if the error still persists, please reopen the issue

@CodeToFreedom
Copy link
Author

Hi @BagritsevichStepan,
unfortunately the error still persists in the new docker release v.1.25.0 as you can see in the video below:
docker run --name dragonflydb --network=host --ulimit memlock=-1 docker.dragonflydb.io/dragonflydb/dragonfly

df-lua-bug.mp4

Only when building from source the error is not present.
Therefore I suggest to reopen the issue.

Many thanks for your efforts.

@BagritsevichStepan
Copy link
Contributor

Thank you! Have you tried running EVAL in the Redis CLI instead of RedisInsight?

@CodeToFreedom
Copy link
Author

CodeToFreedom commented Nov 11, 2024

@BagritsevichStepan Thanks for reopening it.
As you can see here when built from source it works perfectly with the same test setup. I also see the same behaviour when running our tests that use redis client. So the RedisInsight is not the reason.

Video: The same tests run on locally built df-instance v1.25.0

df-lua-from-source.mp4

@BagritsevichStepan
Copy link
Contributor

BagritsevichStepan commented Nov 11, 2024

Yes, I can reproduce this bug

@CodeToFreedom
Copy link
Author

@adiholden @BagritsevichStepan Can you share the status of what you found out so far? It seems strange that the bug exists in the docker version and dragonflydb cloud instance but not in the locally built version.
Many thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants