FT.SEARCH gives incomplete result when used with EVAL (Silent error, potentially hard to spot, but hopefully easy to fix) #3706

CodeToFreedom · 2024-09-13T21:02:50Z

During our migration tests to dragonflydb we came across a quite dangerous silent error which is quite hard to spot as incomplete result.
In short: The lua-script output of FT.SEARCH is cut off and therefore different from the normal FT.SEARCH output.
This could lead to potentially dangerous scenarios where users rely on the output to be correct which it is not.

Steps to reproduce:

-Create test data index:
FT.CREATE idx:testindex ON JSON PREFIX 1 testindex: SCHEMA $.pk AS pk TAG $.color AS color TAG

-Insert test data:

JSON.SET testindex:abc1 . "{\"pk\":\"abc1\",\"color\":\"blue\"}" 
JSON.SET testindex:abc2 . "{\"pk\":\"abc2\",\"color\":\"blue\"}"

Without EVAL/lua the FT.SEARCH command gives the correct result in dragonflydb + redis:

'FT.SEARCH' 'idx:testindex' '((@color:{blue}))' 'LIMIT' '0' '20'
1) "2"
2) "testindex:abc1"
3) 1) "$"
   2) "{\"color\":\"blue\",\"pk\":\"abc1\"}"
4) "testindex:abc2"
5) 1) "$"
   2) "{\"color\":\"blue\",\"pk\":\"abc2\"}"

Wrong output format and incomplete result:

The same command via EVAL/lua script is incorrect in dragonflydb though:

SCRIPT FLAGS f90904580109ee522dcbd414dd3103e5ecdccd6b disable-atomicity
EVAL "return redis.call('FT.SEARCH', 'idx:testindex', '((@color:{blue}))', 'LIMIT', '0', '20')" 0`

Wrong Result (also different output format):
"{\"color\":\"blue\",\"pk\":\"abc2\"}"

Expected result (equal to redis and regular df FT.SEARCH command):

The lua script result should be the exact same output that the FT.SEARCH command above gives.
Here the correct output of redis instance:

EVAL "return redis.call('FT.SEARCH', 'idx:testindex', '((@color:{blue}))', 'LIMIT', '0', '20')" 0
1) "2"
2) "testindex:abc1"
3) 1) "$"
   2) "{\"pk\":\"abc1\",\"color\":\"blue\"}"
4) "testindex:abc2"
5) 1) "$"
   2) "{\"pk\":\"abc2\",\"color\":\"blue\"}"

Dragonfly version used: Most recent version [v1.22.1]

Please let me know in case something is unclear and I try to help as good as possible.
Many thanks for your efforts.

The text was updated successfully, but these errors were encountered:

CodeToFreedom · 2024-10-14T20:06:06Z

@BagritsevichStepan Have you already found the reason why the FT.SEARCH call result is so unexpectedly different depending on whether it is called from lua or sent as direct command?

Many thanks for your efforts

CodeToFreedom · 2024-11-07T22:40:42Z

Hi @BagritsevichStepan and @romange,
I am not very familiar with the codebase yet but have tried to have narrowed the reason down by likelihood of seeing the current output.

Looking through the code (didn't set it up in IDE to test it yet) I found that in doc_index.cc the search function uses the correct response class "SearchResult" .
I assume the regular FT.SEARCH command uses this implementation as it returns the correct type there when called directly.

doc_index.cc

(Click here to expand relevant code part of doc_index.cc)

SearchResult ShardDocIndex::Search(const OpArgs& op_args, const SearchParams& params,
                                   search::SearchAlgorithm* search_algo) const {
  auto& db_slice = op_args.GetDbSlice();
  auto search_results = search_algo->Search(&*indices_, params.limit_offset + params.limit_total);

  if (!search_results.error.empty())
    return SearchResult{facade::ErrorReply{std::move(search_results.error)}};

  SearchFieldsList fields_to_load =
      ToSV(params.ShouldReturnAllFields() ? params.load_fields : params.return_fields);

  vector<SerializedSearchDoc> out;
  out.reserve(search_results.ids.size());

  size_t expired_count = 0;
  for (size_t i = 0; i < search_results.ids.size(); i++) {
    auto key = key_index_.Get(search_results.ids[i]);
    auto it = db_slice.FindReadOnly(op_args.db_cntx, key, base_->GetObjCode());

    if (!it || !IsValid(*it)) {  // Item must have expired
      expired_count++;
      continue;
    }

    auto accessor = GetAccessor(op_args.db_cntx, (*it)->second);

    SearchDocData doc_data;
    if (params.ShouldReturnAllFields()) {
      /*
      In this case we need to load the whole document or loaded fields.
      For JSON indexes it would be {"$", <the whole document as string>}
      */
      doc_data = accessor->SerializeDocument(base_->schema);

      SearchDocData loaded_fields = accessor->Serialize(base_->schema, fields_to_load);
      doc_data.insert(std::make_move_iterator(loaded_fields.begin()),
                      std::make_move_iterator(loaded_fields.end()));
    } else {
      /* Load only specific fields */
      doc_data = accessor->Serialize(base_->schema, fields_to_load);
    }

    auto score = search_results.scores.empty() ? monostate{} : std::move(search_results.scores[i]);
    out.push_back(SerializedSearchDoc{string{key}, std::move(doc_data), std::move(score)});
  }

  return SearchResult{search_results.total - expired_count, std::move(out),
                      std::move(search_results.profile)};
}

This suggests the problem is likely in the data layer between the FtSearch implementation and lua.

The Problem suggests => FT.SEARCH is called correctly:

When we execute FT.SEARCH via a Lua script in DragonflyDB, the results aren't formatted correctly. Here's what we expect:

The first element should be the total count of results.
Following elements should be alternating pairs of document keys (IDs) and their corresponding data.

However, when we run the command through Lua, we're not getting this structure:

Expected Result: [2, "key1", {"data1": "value1"}, "key2", {"data2": "value2"}]
Actual Result: {"data2": "value2"}

This discrepancy suggests that:

The count of documents is missing at the start of the result array.
The structure where document keys and data alternate is not being preserved.
=>Conclusion: The search command itself is executed correctly, just not returned or stored correctly.

Suggested Changes => Fix for data layer RedisTranslator:

Assuming the class RedisTranslator is the responsible layer to make the data correctly accessible from lua, I think this might be where it can be fixed.

1. Handling the Total Count:

Issue: The original RedisTranslator::OnArrayStart method wasn't setting the total count of documents at the beginning of the array.

Change:

if (array_index_.empty()) {  // For top-level array
    lua_pushinteger(lua_, len / 2); // Push the count of documents (len/2 because of key-value pairs)
    lua_rawseti(lua_, -2, 1); // Set count as first element
    array_index_.push_back(2); // Start from 2 for document ID and data
}

Explanation: This change ensures that when the FT.SEARCH command returns, the total number of documents (which is half the length of the result array due to key-value pairs) is set as the first element of the result table in Lua. This matches the expected format where the count is at the start.

2. Preserving the Structure:

Issue: The original code was setting all array elements sequentially, which didn't respect the alternating structure of keys and data.

Change:

array_index_.push_back(2); // For the top-level array, start from 2 to leave room for the count

This COULD be be the root cause:
At least if RedisTranslator wasn't properly formatting the array to include the total count at the beginning and to maintain the alternating structure, it would result in the incorrect format we're seeing.

Disclaimer: I haven't tested the code and didn't set it up for local debugging yet.

Please let me know if my assumptions are correct or let me know a more likely reason for the output difference.

Thanks a lot

BagritsevichStepan · 2024-11-10T14:17:14Z

Hi @CodeToFreedom, thank you for reporting this. I was unable to reproduce the error. Does this error occur consistently?

CodeToFreedom · 2024-11-11T00:26:44Z

Hi @BagritsevichStepan,
many thanks for looking into it. I checked this evening in different environments and saw the following behaviour:

DF Cloud ->Wrong behaviour

-On a fresh test df cloud instance the error (wrong output) occurs consistently.

DF local docker instance ->Wrong behaviour

On a fresh local docker instance (using the currently newest docker image 1.24.0) the error occurs consistently.

Local Dragonfly Source Build->CORRECT BEHAVIOUR (!!)

-When building from source and running it locally (newest version 1.25.0) it works correctly.
./dragonfly --alsologtostderr
(I think this changed with the new version 1.25.0. I tested it also locally with 1.22.0 but with this older version it didn't work at the time if I remember correctly.)

DB commands to reproduce:

FT.CREATE idx:testindex ON JSON PREFIX 1 testindex: SCHEMA $.pk AS pk TAG $.color AS color TAG
JSON.SET testindex:abc1 . "{\"pk\":\"abc1\",\"color\":\"blue\"}"
JSON.SET testindex:abc2 . "{\"pk\":\"abc2\",\"color\":\"blue\"}"

Direct FT.SEARCH command works correctly (see details above):
'FT.SEARCH' 'idx:testindex' '((@color:{blue}))' 'LIMIT' '0' '20'

Check the same via lua and it shows the bug:

SCRIPT FLAGS f90904580109ee522dcbd414dd3103e5ecdccd6b disable-atomicity
EVAL "return redis.call('FT.SEARCH', 'idx:testindex', '((@color:{blue}))', 'LIMIT', '0', '20')" 0

Conclusion

This leads to the conclusion that either something was fixed in v.1.25.0 or the internal state is handled differently on the local source. When docker v1.25.0 is released and/or deployed to DF cloud instances we'll know for sure whether it got resolved.

Btw, is the source build using a debug mode despite using --release flag during or otherwhise handling the state variables differently?
Side note: I found comments like this in the code which may indicate that states could be handled differently depending on the environment:
/interpreter.h

// Note: We leak the state for now.
  // Production code should not access this method.
  lua_State* lua() {
    return lua_;
  }

Please let me know if you continue to have problems reproducing the error and I send a video. (Hopefully the bug doesn't occur anymore in all environments when all environments are testable with 1.25.0)

BagritsevichStepan · 2024-11-11T13:24:46Z

I ran version 1.24 locally and caught a crash. With version 1.25, everything works fine. So I believe the issue was fixed in 1.25.

Please try version 1.25, and if the error still persists, please reopen the issue

CodeToFreedom · 2024-11-11T14:03:18Z

Hi @BagritsevichStepan,
unfortunately the error still persists in the new docker release v.1.25.0 as you can see in the video below:
docker run --name dragonflydb --network=host --ulimit memlock=-1 docker.dragonflydb.io/dragonflydb/dragonfly

df-lua-bug.mp4

Only when building from source the error is not present.
Therefore I suggest to reopen the issue.

Many thanks for your efforts.

BagritsevichStepan · 2024-11-11T14:07:23Z

Thank you! Have you tried running EVAL in the Redis CLI instead of RedisInsight?

CodeToFreedom · 2024-11-11T14:14:43Z

@BagritsevichStepan Thanks for reopening it.
As you can see here when built from source it works perfectly with the same test setup. I also see the same behaviour when running our tests that use redis client. So the RedisInsight is not the reason.

Video: The same tests run on locally built df-instance v1.25.0

df-lua-from-source.mp4

BagritsevichStepan · 2024-11-11T14:40:33Z

Yes, I can reproduce this bug

CodeToFreedom · 2024-11-26T20:40:37Z

@adiholden @BagritsevichStepan Can you share the status of what you found out so far? It seems strange that the bug exists in the docker version and dragonflydb cloud instance but not in the locally built version.
Many thanks

CodeToFreedom added the bug Something isn't working label Sep 13, 2024

CodeToFreedom changed the title ~~FT.SEARCH gives incomplete result when used with EVAL (Silent error + potentially dangerous)~~ FT.SEARCH gives incomplete result when used with EVAL (Silent error, potentially hard to spot, but hopefully easy to fix) Sep 13, 2024

romange assigned BagritsevichStepan Sep 14, 2024

BagritsevichStepan closed this as completed Nov 11, 2024

BagritsevichStepan reopened this Nov 11, 2024

adiholden unassigned BagritsevichStepan Nov 14, 2024

romange assigned BagritsevichStepan and unassigned BagritsevichStepan Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FT.SEARCH gives incomplete result when used with EVAL (Silent error, potentially hard to spot, but hopefully easy to fix) #3706

FT.SEARCH gives incomplete result when used with EVAL (Silent error, potentially hard to spot, but hopefully easy to fix) #3706

CodeToFreedom commented Sep 13, 2024

CodeToFreedom commented Oct 14, 2024

CodeToFreedom commented Nov 7, 2024

BagritsevichStepan commented Nov 10, 2024 •

edited

Loading

CodeToFreedom commented Nov 11, 2024 •

edited

Loading

BagritsevichStepan commented Nov 11, 2024 •

edited

Loading

CodeToFreedom commented Nov 11, 2024

BagritsevichStepan commented Nov 11, 2024

CodeToFreedom commented Nov 11, 2024 •

edited

Loading

BagritsevichStepan commented Nov 11, 2024 •

edited

Loading

CodeToFreedom commented Nov 26, 2024

FT.SEARCH gives incomplete result when used with EVAL (Silent error, potentially hard to spot, but hopefully easy to fix) #3706

FT.SEARCH gives incomplete result when used with EVAL (Silent error, potentially hard to spot, but hopefully easy to fix) #3706

Comments

CodeToFreedom commented Sep 13, 2024

Steps to reproduce:

Wrong output format and incomplete result:

Expected result (equal to redis and regular df FT.SEARCH command):

CodeToFreedom commented Oct 14, 2024

CodeToFreedom commented Nov 7, 2024

The Problem suggests => FT.SEARCH is called correctly:

Suggested Changes => Fix for data layer RedisTranslator:

BagritsevichStepan commented Nov 10, 2024 • edited Loading

CodeToFreedom commented Nov 11, 2024 • edited Loading

DF Cloud ->Wrong behaviour

DF local docker instance ->Wrong behaviour

Local Dragonfly Source Build->CORRECT BEHAVIOUR (!!)

DB commands to reproduce:

Check the same via lua and it shows the bug:

Conclusion

BagritsevichStepan commented Nov 11, 2024 • edited Loading

CodeToFreedom commented Nov 11, 2024

BagritsevichStepan commented Nov 11, 2024

CodeToFreedom commented Nov 11, 2024 • edited Loading

BagritsevichStepan commented Nov 11, 2024 • edited Loading

CodeToFreedom commented Nov 26, 2024

BagritsevichStepan commented Nov 10, 2024 •

edited

Loading

CodeToFreedom commented Nov 11, 2024 •

edited

Loading

BagritsevichStepan commented Nov 11, 2024 •

edited

Loading

CodeToFreedom commented Nov 11, 2024 •

edited

Loading

BagritsevichStepan commented Nov 11, 2024 •

edited

Loading