Skip to content

Body123/Arabic-Punctuation-Prediction

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AraPunc dataset:

it is based on the pre-processing of the Tashkeela “Arabic diacritization corpus”. We keep six classes: space ‘0’, full-stop ‘.’, comma ‘,’, the colon‘:’, semicolon ‘;’, and the question mark ‘?’.

In the following table, you can find the distribution of punctuation classes in AraPunc dataset:

Label Train Dev Test
, 1756058 309118 514741
. 638133 112409 187367
? 51798 9193 15448
0 33639104 5923672 9888211
: 939876 165549 275918
; 233479 40846 67756

You can download the dataset from HERE

About

A model that predicts the punctuation of English, Italian, French and German texts.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 93.3%
  • Shell 6.7%