Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding extractFromDocument() to TurkishSentenceExtractor #102

Open
ahmetaa opened this issue Feb 5, 2017 · 6 comments
Open

Consider adding extractFromDocument() to TurkishSentenceExtractor #102

ahmetaa opened this issue Feb 5, 2017 · 6 comments

Comments

@ahmetaa
Copy link
Owner

ahmetaa commented Feb 5, 2017

Currently there are two methods

List<String> extract(String paragraph)
List<String> extract(List<String> paragraphs)

Those methods ignore line breaks as a sentence boundary.
A common use is to extract sentences from a complete document represented as a single String. Which contains line breaks. So this method would do the following

List<String> extractFromDocument(String document) {
  List<String> paragraphs = split document from line breaks;
  return extract(paragraphs); 
}

also other method name can be changed to extractFromParagraph() if this is added.

@mdakin wdyt?

@mdakin
Copy link
Collaborator

mdakin commented Feb 5, 2017

maybe for now only providing this would be enough?

List<String> paragraphs = split document from line breaks;

@ahmetaa
Copy link
Owner Author

ahmetaa commented Feb 5, 2017

Ok, like this?

List<String> splitFromLineBreaks(String foo)

in the same class or in a utility class..

@mdakin
Copy link
Collaborator

mdakin commented Feb 5, 2017

So the usage will be like:

extractor = ...;
for (String paragraph: extractor.splitFromLineBreaks(doc)) {`
   extractor.extract(paragraph);
}

or directly

 extractor.extract(extractor.splitFromLineBreaks(doc))

So I am not sure now, maybe your initial suggestion was not bad, like adding a new method like extractFromDocument that explains it uses line breaks as paragraph endings. Your call bro.

@ahmetaa
Copy link
Owner Author

ahmetaa commented Feb 5, 2017

Thanks, This is not a pressing issue anyway. I will make one of these and see if it will stick.

@mdakin
Copy link
Collaborator

mdakin commented Feb 6, 2017

There is a possibility of using objects like Document, Paragraph etc, but that is another issue. So your initial suggestion is fine I think, maybe having separate method names like extractWords , extractSentences and extractParagraphs could be better instead of overloading.

@ahmetaa
Copy link
Owner Author

ahmetaa commented Feb 6, 2017

Agreed, soon there will be a need for such structures. But when they come, overloading those methods may suffice. I will go with my initial suggestion then.

ahmetaa added a commit that referenced this issue Feb 14, 2017
…ractor #102

Remove SentenceExtractor interface
make Span class public (Perhaphs later move it to core).
Add some documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants