Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Tokenizer for c-tf-idf representations #2

Open
eichinflo opened this issue Jan 8, 2025 · 1 comment
Open

Custom Tokenizer for c-tf-idf representations #2

eichinflo opened this issue Jan 8, 2025 · 1 comment

Comments

@eichinflo
Copy link
Collaborator

The current version does not allow for custom tokenizers to be passed to the SCA class. As semantic_components.representation.CTFIDFRepresenter it should take tokenizer as an argument (and ignore the language argument which is used to infer the tokenizer right now).

The tokenizer that can be passed should operate like semantic_components.representation.GenericTokenizer and minimally implement a tokenize and __call__ method.

eichinflo added a commit that referenced this issue Jan 8, 2025
decomposition.ClusterDecomposer.get_component_repr
method to avoid errors when no components are found.

Adressing issue #2: We've added functionalityto the
`SCA` initialization method to allow for custom tokenizers.
We've alos added respective test cases and notes in the README.
@eichinflo
Copy link
Collaborator Author

You can now pass a tokenizer argument to the initialization method of SCA. Let me know if there are any problems with it. It seems to be fine for the test case I created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant