Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document segmenter options #183

Open
anne17 opened this issue Dec 13, 2023 · 1 comment
Open

Document segmenter options #183

anne17 opened this issue Dec 13, 2023 · 1 comment
Assignees
Labels
documentation Improvements or additions to documentation fixed-unreleased Issue that has been fixed and is waiting to be released

Comments

@anne17
Copy link
Member

anne17 commented Dec 13, 2023

Right now the different options for the available segmenters can only be found in the code:

whitespace=nltk.WhitespaceTokenizer
linebreaks=LinebreakTokenizer
blanklines=nltk.BlanklineTokenizer
punkt_sentence=nltk.PunktSentenceTokenizer
punctuation=PunctuationTokenizer
better_word=BetterWordTokenizer
crf_tokenizer=CRFTokenizer
simple_word_punkt=nltk.WordPunctTokenizer
fsv_paragraph=FSVParagraphSplitter

They should be visible to the user somehow!

@anne17 anne17 added the documentation Improvements or additions to documentation label Dec 13, 2023
@MartinHammarstedt MartinHammarstedt self-assigned this Feb 7, 2024
@MartinHammarstedt
Copy link
Member

They are now listed when running sparv modules segment (and included in the schema).

@MartinHammarstedt MartinHammarstedt added the fixed-unreleased Issue that has been fixed and is waiting to be released label Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation fixed-unreleased Issue that has been fixed and is waiting to be released
Projects
None yet
Development

No branches or pull requests

2 participants