Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization problem #3

Open
mcavdar opened this issue Mar 1, 2018 · 4 comments
Open

Tokenization problem #3

mcavdar opened this issue Mar 1, 2018 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@mcavdar
Copy link
Owner

mcavdar commented Mar 1, 2018

Spacy models should be modified according medical corpus. For example:
tokens['train'][0:10]: [['EMEA', '/', 'H', '/', 'C', '/', '551', 'PRIALT']...

@mcavdar mcavdar added the bug Something isn't working label Mar 1, 2018
@mcavdar mcavdar self-assigned this Mar 1, 2018
@mcavdar
Copy link
Owner Author

mcavdar commented Mar 10, 2018

The Quaero French Medical Corpus: A Ressource for Medical Entity Recognition and Normalization:
"Note that due to time constraints, the documents were supplied to the human annotators without prior tokenization."

Etiquetage morpho-syntaxique en domaine de spécialité: le domaine médical:
"Le tokenizer utilisé lors de la première phase d’annotation est un outil maison suivant une segmentation proche de celui du FTB. Il n’est pas adapté au domaine médical, il a été important de modifier la segmentation manuellement afin d’obtenir une annotation morpho-syntaxique de qualité optimale." "Ces observations indiquent qu’avec un tokeniseur adapté, l’utilisation d’une pré-annotation permet un gain de temps significatif"

@mcavdar
Copy link
Owner Author

mcavdar commented Mar 10, 2018

@mcavdar
Copy link
Owner Author

mcavdar commented May 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant