Sindhi tokenization data from ISRA
A collection of text files, with token and sentence boundaries marked in the tkns_ and stns_ files respectively.
A tool in Stanza,
convert_text_files.py
, processes this data into a CoNLL-style
suitable for training a tokenizer.
(The other annotations are left blank.)