Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configuration option to disable full text extract #1205

Merged
merged 1 commit into from
Jun 14, 2017

Conversation

aaron-collier
Copy link
Contributor

This is related to issue #1089.

This provides an option that can be set to allow the skipping of full text generation when desired.

Note: This does not address issue #1089, but provides a work around for instances where full text extraction is causing encoding issues and full text is not required.

@aaron-collier
Copy link
Contributor Author

There appears to be an existing DRY issue here that was likely before code climate (I suppose) that I'm testing the cleanup for now.

@@ -83,5 +81,11 @@ def create_image_derivatives(filename)
def derivative_path_factory
Hyrax::DerivativePath
end

def extract_full_text(filename, uri)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you please add some yard doc about the expected types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -74,6 +74,10 @@
# Path to the file derivatives creation tool
# config.libreoffice_path = "soffice"

# Option to enable/disable full text extraction from PDFs
# Default is false, set to true to skip extraction
config.extract_full_text = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the template typically we comment out the default options just to show what they are. The default should be set in Configuration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -373,6 +373,11 @@ def subject_prefix
@subject_prefix ||= "Contact form:"
end

attr_writer :extract_full_text
def extract_full_text?
@extract_full_text ||= false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backwards compatibility the default should be true, right? I don't think you can use the standard memorization idiom with booleans (consider what happens if @extract_full_text = false)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - that's what I thought. I was having issues testing until this setting. I'll work it the other way and see what was up with my test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at @enable_noids for ideas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that resolved the odd testing issues I had.

@@ -74,6 +74,10 @@
# Path to the file derivatives creation tool
# config.libreoffice_path = "soffice"

# Option to enable/disable full text extraction from PDFs
# Default is false, set to true to skip extraction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems backwards to me 🤔 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. See comment above. Was having weird testing and felt odd about the naming vs. the true/false value. I'll update this and rework it in the AM. ;-)

@aaron-collier
Copy link
Contributor Author

I believe all comments were addressed.

# Calls the Hydra::Derivates::FulltextExtraction unless the extract_full_text
# configuration option is set to false
# @param [String] filename of the object to be used for full text extraction
# @param [String] uri to the files set (deligated to file_set)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"uri to the file set (delegated to file_set)"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, typo. Fixed.

def extract_full_text?
return @extract_full_text unless @extract_full_text.nil?
@extract_full_text = true
end
Copy link
Contributor

@dazza-codes dazza-codes Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative pattern for this method body could use a ternary expression, i.e.

x.nil? ? x = true : x

The assignment in the ternary expression is valid and it returns true.

PS, this is a suggestion, not a requirement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be a reasonable change, but the above follows the pattern used throughout this particular source. I would suggest a specific change/issue to fix this seperately IMHO.

@@ -55,4 +55,5 @@
it { is_expected.to respond_to(:translate_uri_to_id) }
it { is_expected.to respond_to(:upload_path) }
it { is_expected.to respond_to(:work_requires_files?) }
it { is_expected.to respond_to(:extract_full_text?) }
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also check that the default is actually true, both when the config value is explicitly set and when it is not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree - and happy to work in a change like this. However, should it be throughout? As it's not currently setup that way?

@@ -373,6 +373,12 @@ def subject_prefix
@subject_prefix ||= "Contact form:"
end

attr_writer :extract_full_text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the writer is necessary and whether the config value should be the only way to initialize this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Following convention here, so I haven't tested otherwise but can. This would have wider implications on the file as it's the pattern used throughout.

Copy link
Contributor

@dazza-codes dazza-codes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally seems OK to me, a couple minor changes requested and a suggestion or two.

@mjgiarlo
Copy link
Member

Agree with @aaron-collier that the changes he's made are consistent with past boolean config options, so changes to that pattern should be in a separate PR. Thx for the review, @darrenleeweber.

@jcoyne jcoyne merged commit 88f56d4 into master Jun 14, 2017
@jcoyne jcoyne deleted the skip_full_text_option branch June 14, 2017 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants