Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different Caikit server instances returns different protos #237

Open
kpouget opened this issue Oct 17, 2023 · 8 comments
Open

Different Caikit server instances returns different protos #237

kpouget opened this issue Oct 17, 2023 · 8 comments

Comments

@kpouget
Copy link

kpouget commented Oct 17, 2023

Describe the bug

As part of my automated testing, I observed that some requests were returning very quickly.

Investigations highlighted that min_new_tokens: 0 was the reason for the quick return:

  • max_new_tokens: 25, min_new_tokens: 0
  • Request generated 5 tokens before EosToken

while the query is set with max_new_tokens == min_new_tokens.

Further investigations let to this reproducer:

HOST_1=flan-t5-small-cpu-1-predictor-watsonx-e2e-0.apps.20231017-06h53-watsonx-ci-kpouget.psap.aws.rhperfscale.org:443;
HOST_2=flan-t5-small-cpu-2-predictor-watsonx-e2e-1.apps.20231017-06h53-watsonx-ci-kpouget.psap.aws.rhperfscale.org:443;

get_proto() {
    grpcurl -insecure $1 describe caikit.runtime.Nlp.TextGenerationTaskRequest
}

get_proto $HOST_1 > 1
get_proto $HOST_2 > 2

echo diff
diff --side <(get_proto $HOST_1) <(get_proto $HOST_2)

1.log | 2.log

image

which highlights that the protos returned by two endpoints running the same image are different.

Image is quay.io/opendatahub/caikit-tgis-serving@sha256:794adc22d52cb3ac4b5aadfb286e8431cca829acdc4909719329cf8c4fabb4ec

Platform

Caikit packages in this image have this version:

caikit                  0.19.3
caikit-nlp              0.0.1                   /caikit/src/caikit-nlp
caikit-tgis-backend     0.1.18

Python 3.9

Sample Code

See above.

The invalid launch happens ~50% of the time, from what I observed.

Expected behavior

The prototypes are always the same.

Observed behavior

The prototypes do not have the same ordering.
No error printed anywhere.

Additional info

The location of this block (+ the field numbering) is the key difference between the "different versions" of the protos:

  oneof _preserve_input_text {
    bool preserve_input_text = 15;
  }
@kpouget
Copy link
Author

kpouget commented Oct 18, 2023

I have a simpler reproducer !

#! /bin/bash

TARGET_PROTO=serverstreamingtextgenerationtaskrequest.proto
TRIES=10

mkdir -p /tmp/proto.ref /tmp/proto.cmp/

echo "Generating the reference protos ..."
RUNTIME_LIBRARY=caikit_nlp python -m caikit.runtime.dump_services /tmp/proto.ref 2>/dev/null

target_proto_sum=$(cat /tmp/proto.ref/$TARGET_PROTO | sha256sum)

for i in $(seq $TRIES); do
    echo "Generating some new protos (try #${i}) ..."
    RUNTIME_LIBRARY=caikit_nlp python -m caikit.runtime.dump_services /tmp/proto.cmp/$TARGET_PROTO 2>/dev/null
    echo "Comparing ..."
    current_proto_sum=$(cat /tmp/proto.cmp/$TARGET_PROTO | sha256sum)
    if [[ "$current_proto_sum" == "$target_proto_sum" ]]; then
        echo "Identical ..."
        continue
    fi
    cat <<EOF
Found different $TARGET_PROTO protos at try #${i}:
/tmp/proto.ref/$TARGET_PROTO --> $target_proto_sum
/tmp/proto.cmp/$TARGET_PROTO --> $current_proto_sum
EOF
    if which diff 2>/dev/null; then # diff isn't available in the container image ...
        diff "/tmp/proto.ref/$TARGET_PROTO" "/tmp/proto.cmp/$TARGET_PROTO"
    fi
    break
done

if [[ "$i" == "$TRIES" ]]; then
    echo "Didn't find any different proto :/"
    exit 1
fi
sh-5.1$ bash try.sh
Generating the reference protos ...
Generating some new protos (try #{i}) ...
Comparing ...
Generating some new protos (try #{i}) ...
Comparing ...
Generating some new protos (try #{i}) ...
Comparing ...
Generating some new protos (try #{i}) ...
Comparing ...
Generating some new protos (try #{i}) ...
Comparing ...
Found different serverstreamingtextgenerationtaskrequest.proto protos at try 5:
/tmp/proto.ref/serverstreamingtextgenerationtaskrequest.proto --> 55befeea6899aefdc0af416eab2fc8e4a522917540827a2648ae5a83d00fbe11  -
/tmp/proto.cmp/serverstreamingtextgenerationtaskrequest.proto --> 0a843c0e74fb8a91fb432b060a726ca6afa464b9dcb4b460c65c4fe545aa4a35  -
sh-5.1$ bash try.sh
Generating the reference protos ...
Generating some new protos (try #1) ...
Comparing ...
Found different serverstreamingtextgenerationtaskrequest.proto protos at try #1:
/tmp/proto.ref/serverstreamingtextgenerationtaskrequest.proto --> 0a843c0e74fb8a91fb432b060a726ca6afa464b9dcb4b460c65c4fe545aa4a35  -
/tmp/proto.cmp/serverstreamingtextgenerationtaskrequest.proto --> 55befeea6899aefdc0af416eab2fc8e4a522917540827a2648ae5a83d00fbe11  -

image

again, the location of this field is the issue:

  optional bool preserve_input_text = 2; # or 15

kpouget added a commit to kpouget/topsail that referenced this issue Oct 18, 2023
…roto'

workaround for caikit/caikit-nlp#237

so that llm-load-test always runs with the right protos
kpouget added a commit to kpouget/topsail that referenced this issue Oct 18, 2023
…roto'

workaround for caikit/caikit-nlp#237

so that llm-load-test always runs with the right protos
kpouget added a commit to kpouget/topsail that referenced this issue Oct 18, 2023
…roto'

workaround for caikit/caikit-nlp#237

so that llm-load-test always runs with the right protos
kpouget added a commit to kpouget/topsail that referenced this issue Oct 18, 2023
…roto'

workaround for caikit/caikit-nlp#237

so that llm-load-test always runs with the right protos
kpouget added a commit to kpouget/topsail that referenced this issue Oct 18, 2023
…roto'

workaround for caikit/caikit-nlp#237

so that llm-load-test always runs with the right protos
kpouget added a commit to kpouget/topsail that referenced this issue Oct 18, 2023
…roto'

workaround for caikit/caikit-nlp#237

so that llm-load-test always runs with the right protos
kpouget added a commit to kpouget/topsail that referenced this issue Oct 18, 2023
…roto'

workaround for caikit/caikit-nlp#237

so that llm-load-test always runs with the right protos
@kpouget kpouget changed the title Same Caikit server returns different protos Different Caikit server instances returns different protos Oct 18, 2023
@gkumbhat
Copy link
Collaborator

@kpouget is this happening on current main branch as well? This was fixed with this PR

@kpouget
Copy link
Author

kpouget commented Oct 18, 2023

@kpouget is this happening on current main branch as well? This was fixed with this PR

I don't really know, I've tested it only in my OpenShift cluster with the stable image of
https://github.com/opendatahub-io/caikit-tgis-serving/tree/main

should be easy to check with this reproducer

kpouget added a commit to kpouget/topsail that referenced this issue Oct 18, 2023
…roto'

workaround for caikit/caikit-nlp#237

so that llm-load-test always runs with the right protos
@gabe-l-hart
Copy link
Contributor

Ok, looks like the version of caikit-nlp is set here as 6f099312d65a15dc20952abfe77d3990a83669a3 which is 0.2.1. This is the release directly before the fix was merged, so to fix the current problem, we should simply update the version in the RH build to a newer release.

kpouget added a commit to kpouget/topsail that referenced this issue Oct 18, 2023
…roto'

workaround for caikit/caikit-nlp#237

so that llm-load-test always runs with the right protos
kpouget added a commit to kpouget/topsail that referenced this issue Oct 19, 2023
…roto'

workaround for caikit/caikit-nlp#237

so that llm-load-test always runs with the right protos
@kpouget
Copy link
Author

kpouget commented Oct 23, 2023

thanks for the updates, I can't find any problematic difference now :)

only thing I noticed is that the Services still have a random order. Two examples:
image
image

I think this last bit will be solved by caikit/caikit#535, @evaline-ju could you confirm?

@evaline-ju
Copy link
Collaborator

@aluu317 worked on caikit/caikit#535 - could you provide any insight on the above? I think I was tagged by mistake

@kpouget
Copy link
Author

kpouget commented Oct 23, 2023

@aluu317 worked on caikit/caikit#535 - could you provide any insight on the above? I think I was tagged by mistake

oups, I got it wrong with the auto-completion, sorry Evaline!

@aluu317
Copy link

aluu317 commented Oct 23, 2023

@kpouget Ah, unfortunately, the PR caikit/caikit#535 will not fix the ordering of rpcs, it was meant to fix the ordering of the fields in your original bug.
But ordering of rpcs should also be a quick fix which I can do next

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants