-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add configuration option to disable full text extract #1205
Conversation
There appears to be an existing DRY issue here that was likely before code climate (I suppose) that I'm testing the cleanup for now. |
a7b8bea
to
275e92d
Compare
@@ -83,5 +81,11 @@ def create_image_derivatives(filename) | |||
def derivative_path_factory | |||
Hyrax::DerivativePath | |||
end | |||
|
|||
def extract_full_text(filename, uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you please add some yard doc about the expected types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -74,6 +74,10 @@ | |||
# Path to the file derivatives creation tool | |||
# config.libreoffice_path = "soffice" | |||
|
|||
# Option to enable/disable full text extraction from PDFs | |||
# Default is false, set to true to skip extraction | |||
config.extract_full_text = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the template typically we comment out the default options just to show what they are. The default should be set in Configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
lib/hyrax/configuration.rb
Outdated
@@ -373,6 +373,11 @@ def subject_prefix | |||
@subject_prefix ||= "Contact form:" | |||
end | |||
|
|||
attr_writer :extract_full_text | |||
def extract_full_text? | |||
@extract_full_text ||= false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backwards compatibility the default should be true, right? I don't think you can use the standard memorization idiom with booleans (consider what happens if @extract_full_text = false
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - that's what I thought. I was having issues testing until this setting. I'll work it the other way and see what was up with my test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at @enable_noids
for ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that resolved the odd testing issues I had.
@@ -74,6 +74,10 @@ | |||
# Path to the file derivatives creation tool | |||
# config.libreoffice_path = "soffice" | |||
|
|||
# Option to enable/disable full text extraction from PDFs | |||
# Default is false, set to true to skip extraction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems backwards to me 🤔 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. See comment above. Was having weird testing and felt odd about the naming vs. the true/false value. I'll update this and rework it in the AM. ;-)
5a4fd21
to
2956281
Compare
I believe all comments were addressed. |
# Calls the Hydra::Derivates::FulltextExtraction unless the extract_full_text | ||
# configuration option is set to false | ||
# @param [String] filename of the object to be used for full text extraction | ||
# @param [String] uri to the files set (deligated to file_set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"uri to the file set (delegated to file_set)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, typo. Fixed.
def extract_full_text? | ||
return @extract_full_text unless @extract_full_text.nil? | ||
@extract_full_text = true | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative pattern for this method body could use a ternary expression, i.e.
x.nil? ? x = true : x
The assignment in the ternary expression is valid and it returns true
.
PS, this is a suggestion, not a requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be a reasonable change, but the above follows the pattern used throughout this particular source. I would suggest a specific change/issue to fix this seperately IMHO.
@@ -55,4 +55,5 @@ | |||
it { is_expected.to respond_to(:translate_uri_to_id) } | |||
it { is_expected.to respond_to(:upload_path) } | |||
it { is_expected.to respond_to(:work_requires_files?) } | |||
it { is_expected.to respond_to(:extract_full_text?) } | |||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also check that the default is actually true
, both when the config value is explicitly set and when it is not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree - and happy to work in a change like this. However, should it be throughout? As it's not currently setup that way?
@@ -373,6 +373,12 @@ def subject_prefix | |||
@subject_prefix ||= "Contact form:" | |||
end | |||
|
|||
attr_writer :extract_full_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the writer is necessary and whether the config value should be the only way to initialize this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Following convention here, so I haven't tested otherwise but can. This would have wider implications on the file as it's the pattern used throughout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally seems OK to me, a couple minor changes requested and a suggestion or two.
2956281
to
a9b309d
Compare
Agree with @aaron-collier that the changes he's made are consistent with past boolean config options, so changes to that pattern should be in a separate PR. Thx for the review, @darrenleeweber. |
This is related to issue #1089.
This provides an option that can be set to allow the skipping of full text generation when desired.
Note: This does not address issue #1089, but provides a work around for instances where full text extraction is causing encoding issues and full text is not required.