Add Groovy crawl configs #632

ato · 2024-11-30T15:20:15Z

This enables crawl configuration files to use Spring's Groovy Bean Definition DSL as an optional alternative to Spring XML. It uses the same bean configuration model but the syntax is more terse and human-readable. No more need for & in seed URLs. :-)

   checkpointService(CheckpointService) {
        checkpointIntervalMinutes = 15
        checkpointsDir = 'checkpoints'
        forgetAllButLatest = true
   }

It also enables some powerful scripting capabilities. For example, defining a custom DecideRule directly in the crawl scope without needing ScriptedDecideRule:

scope(DecideRuleSequence) {
    rules = [
        new RejectDecideRule(),
        // ACCEPT everything linked from a .pdf file
        new PredicatedDecideRule() {
             boolean evaluate(CrawlURI uri) {
                 return uri.via?.path?.endsWith(".pdf")
             }
        },
        // ...
    ]
}

The main downsides are defining nested inner beans can be a bit awkward, some of the errors can be cryptic, and you can't just manipulate the config files with an XML parser.

This commit includes a Groovy version of the default crawl profile for reference, but doesn't expose a way to use it in the UI yet. For now, you need to manually create a crawler-beans.groovy file in your job directory.

This enables crawl configuration files to use Spring's [Groovy Bean Definition DSL] as an optional alternative to Spring XML. It uses the same bean configuration model but the syntax is more terse and human-readable. No more need for `&` in seed URLs. :-) ```groovy checkpointService(CheckpointService) { checkpointIntervalMinutes = 15 checkpointsDir = 'checkpoints' forgetAllButLatest = true } ``` It also enables some powerful scripting capabilities. For example, defining a custom DecideRule directly in the crawl scope: ```groovy scope(DecideRuleSequence) { rules = [ new RejectDecideRule(), // ACCEPT everything linked from a .pdf file new PredicatedDecideRule() { boolean evaluate(CrawlURI uri) { return uri.via?.path?.endsWith(".pdf") } }, // ... ] } ``` The main downsides are defining nested inner beans can be a bit awkward, some of the errors can be cryptic, and you can't just manipulate the config files with an XML parser. This commit includes a Groovy version of the default crawl profile for reference, but doesn't expose a way to use it in the UI yet. For now, you need to manually create a `crawler-beans.groovy` file in your job directory. [Groovy Bean Definition DSL]: https://docs.spring.io/spring-framework/reference/core/beans/basics.html#beans-factory-groovy

ato merged commit 4c4510a into master Dec 24, 2024
7 checks passed

ato deleted the groovy-config branch December 24, 2024 03:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Groovy crawl configs #632

Add Groovy crawl configs #632

ato commented Nov 30, 2024 •

edited

Loading

Add Groovy crawl configs #632

Add Groovy crawl configs #632

Conversation

ato commented Nov 30, 2024 • edited Loading

ato commented Nov 30, 2024 •

edited

Loading