Consider adding extractFromDocument() to TurkishSentenceExtractor #102

ahmetaa · 2017-02-05T11:12:03Z

Currently there are two methods

List<String> extract(String paragraph)
List<String> extract(List<String> paragraphs)

Those methods ignore line breaks as a sentence boundary.
A common use is to extract sentences from a complete document represented as a single String. Which contains line breaks. So this method would do the following

List<String> extractFromDocument(String document) {
  List<String> paragraphs = split document from line breaks;
  return extract(paragraphs); 
}

also other method name can be changed to extractFromParagraph() if this is added.

@mdakin wdyt?

The text was updated successfully, but these errors were encountered:

mdakin · 2017-02-05T12:02:30Z

maybe for now only providing this would be enough?

List<String> paragraphs = split document from line breaks;

ahmetaa · 2017-02-05T12:10:39Z

Ok, like this?

List<String> splitFromLineBreaks(String foo)

in the same class or in a utility class..

mdakin · 2017-02-05T12:24:14Z

So the usage will be like:

extractor = ...;
for (String paragraph: extractor.splitFromLineBreaks(doc)) {`
   extractor.extract(paragraph);
}

or directly

 extractor.extract(extractor.splitFromLineBreaks(doc))

So I am not sure now, maybe your initial suggestion was not bad, like adding a new method like extractFromDocument that explains it uses line breaks as paragraph endings. Your call bro.

ahmetaa · 2017-02-05T12:25:59Z

Thanks, This is not a pressing issue anyway. I will make one of these and see if it will stick.

mdakin · 2017-02-06T09:38:49Z

There is a possibility of using objects like Document, Paragraph etc, but that is another issue. So your initial suggestion is fine I think, maybe having separate method names like extractWords , extractSentences and extractParagraphs could be better instead of overloading.

ahmetaa · 2017-02-06T10:31:22Z

Agreed, soon there will be a need for such structures. But when they come, overloading those methods may suffice. I will go with my initial suggestion then.

…ractor #102 Remove SentenceExtractor interface make Span class public (Perhaphs later move it to core). Add some documentation.

ahmetaa added a commit that referenced this issue Feb 14, 2017

Fix for : Consider adding extractFromDocument() to TurkishSentenceExt…

9d9bda0

…ractor #102 Remove SentenceExtractor interface make Span class public (Perhaphs later move it to core). Add some documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding extractFromDocument() to TurkishSentenceExtractor #102

Consider adding extractFromDocument() to TurkishSentenceExtractor #102

ahmetaa commented Feb 5, 2017

mdakin commented Feb 5, 2017

ahmetaa commented Feb 5, 2017 •

edited

Loading

mdakin commented Feb 5, 2017

ahmetaa commented Feb 5, 2017 •

edited

Loading

mdakin commented Feb 6, 2017

ahmetaa commented Feb 6, 2017

Consider adding extractFromDocument() to TurkishSentenceExtractor #102

Consider adding extractFromDocument() to TurkishSentenceExtractor #102

Comments

ahmetaa commented Feb 5, 2017

mdakin commented Feb 5, 2017

ahmetaa commented Feb 5, 2017 • edited Loading

mdakin commented Feb 5, 2017

ahmetaa commented Feb 5, 2017 • edited Loading

mdakin commented Feb 6, 2017

ahmetaa commented Feb 6, 2017

ahmetaa commented Feb 5, 2017 •

edited

Loading

ahmetaa commented Feb 5, 2017 •

edited

Loading