Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds tokenization task for embeddings #365

Merged
merged 2 commits into from
Jul 24, 2024
Merged

Conversation

flaviabeo
Copy link
Contributor

@flaviabeo flaviabeo commented Jun 27, 2024

This implements the TokenizeTask for the Embeddings module. Works for all the models of type SentenceTransformer.

cc: @mynhardtburger @markstur

TokenizationResults
The token count
"""
result = self.model.tokenizer.encode_plus(text, return_offsets_mapping=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@evaline-ju evaline-ju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes themselves mainly LGTM, but let's add a small unit test to make sure we have expected results/prevent possible future breakages?

caikit_nlp/modules/text_embedding/embedding.py Outdated Show resolved Hide resolved
@flaviabeo flaviabeo force-pushed the tokenize branch 4 times, most recently from 5fd8f44 to 0822f3f Compare July 24, 2024 19:25
Copy link
Contributor

@markstur markstur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge caused an indexing miss. See inline

]
tokens = [Token(start=i[0], end=i[1], text=text[i[0] : i[1]]) for i in mapping]

return TokenizationResults(token_count=len(result.input_ids), results=tokens)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs index zero because we wrapped the text in a list for _get_tokenized().

Suggested change
return TokenizationResults(token_count=len(result.input_ids), results=tokens)
return TokenizationResults(token_count=len(result.input_ids[0), results=tokens)

tests/modules/text_embedding/test_embedding.py Outdated Show resolved Hide resolved
flaviabeo and others added 2 commits July 24, 2024 17:58
Signed-off-by: Flavia Beo <[email protected]>
Co-authored-by: Evaline Ju <[email protected]>
Signed-off-by: Flávia Béo <[email protected]>
Copy link
Collaborator

@evaline-ju evaline-ju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@evaline-ju evaline-ju dismissed gkumbhat’s stale review July 24, 2024 21:07

Dismissing on maintainer availability, the requested change appears addressed.

@evaline-ju evaline-ju merged commit 1499cab into caikit:main Jul 24, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants