LogStash configuration is compiled for every sample #33

ggrossetie · 2015-09-08T12:15:13Z

We have a quite complex configuration file (approximately 1500 lines) and I believe that this line is costly: https://github.com/elastic/logstash/blob/master/lib/logstash/pipeline.rb#L26

If I understand correctly this code, a new pipeline is created for every sample() method, meaning that the configuration is compiled every time.

Meaning that the following test will load and compile 3 times the same configuration:

require "test_utils"

describe "apache common log format" do
  extend LogStash::RSpec

  config %q{
    # Your logstash filter config here
  }

  sample "apache log in debug" do
    insist { subject["level"] } == "DEBUG"
  end
  sample "apache log in info" do
    insist { subject["level"] } == "INFO"
  end
  sample "apache log with stack trace" do
    insist { subject["stacktrace"] } == "oups!"
  end
end

Would it be possible to configure this behavior ? We have 300+ samples and running our tests suite is really slow.
Having said that I think that this behavior should be the default. If I do want to use a different configuration I can call config again with the new configuration, no ?

The text was updated successfully, but these errors were encountered:

ggrossetie · 2015-09-08T13:16:02Z

Each time pipeline.instance_eval is called, LogStash::Pipeline.new(config) is invoked.
I've modified logstash_helpers.rb to add timings:

      let(:results) do
        puts "do results"
        results = []
        start = Time.now
        pipeline.instance_eval { @filters.each(&:register) }
        puts "pipeline.instance_eval in #{Time.now - start} s"

        start = Time.now
        event.each do |e|
          # filter call the block on all filtered events, included new events added by the filter
          pipeline.filter(e) { |filtered_event| results << filtered_event }
        end
        puts "pipeline.filter in #{Time.now - start} s"

        start = Time.now
        # flush makes sure to empty any buffered events in the filter
        pipeline.flush_filters(:final => true) { |flushed_event| results << flushed_event }
        puts "pipeline.flush_filters in #{Time.now - start} s"

        start = Time.now
        results.select { |e| !e.cancelled? }
        puts "results.select in #{Time.now - start} s"
      end

And here is the output:

do results
pipeline.instance_eval in 19.147 s
pipeline.filter in 0.018 s
pipeline.flush_filters in 0.001 s
results.select in 0.0 s
do results
pipeline.instance_eval in 17.688 s
pipeline.filter in 0.003 s
pipeline.flush_filters in 0.0 s
results.select in 0.0 s
do results
pipeline.instance_eval in 17.342 s
pipeline.filter in 0.002 s
pipeline.flush_filters in 0.0 s
results.select in 0.0 s
do results

ggrossetie · 2015-09-08T13:36:11Z

Actually the value is cached across multiple calls in the same sample but not across samples.
See: https://www.relishapp.com/rspec/rspec-core/v/3-3/docs/helper-methods/let-and-let

guyboertje · 2015-09-08T14:26:30Z

@Mogztter - would it help if the sample method allowed you to pass in a pipeline object?

  def sample(sample_event, user_pipeline = nil , &block)
    name = sample_event.is_a?(String) ? sample_event : LogStash::Json.dump(sample_event)
    name = name[0..50] + "..." if name.length > 50

    describe "\"#{name}\"" do
      let(:pipeline) { user_pipeline || LogStash::Pipeline.new(config) }
...
....
    end
end

However I would be worried about the pipeline having some in-flight events internally. The pipeline object does not have a reset method.

Obviously you want a clean setup for RSpec.

ggrossetie · 2015-09-08T15:06:30Z

@guyboertje Yes this is one way of solving the problem but I think "end user" should not be aware of LogStash internal (pipeline).

It would be nice if we could be able to execute multiple samples against the same configuration using Rspec context.
I think this make sense to state in this context (with this configuration) I was to execute sample A and B. Behind the scene a pipeline will be instantiate and configure only once to execute sample A and B.

If I want to execute sample A with another configuration, I can declare another context with another configuration.

guyboertje · 2015-09-08T15:27:42Z

@Mogztter - in this case the "end user" has the role of test case developer and with this role comes some degree of responsibility for the objects under test. Presently, the sample method hides the fact that 'object under test' is really a (new) pipeline instance in a RSpec let block (preserving the clean state for each it block contract). As soon as that is deviated from, the test case developer must take responsibility for the 'object under test'.

I know what you want to achieve, but I think there is a strong chance of 'not clean slate' tests.

Also, FWIW, this issue is the correct place to generally discuss this (not your PR, as it is only one possible solution to this issue).

ph · 2015-09-08T15:37:01Z

I am okay with having a way to execute multiples test with a single pipeline instance, I understand the cost of bringing one up on every tests. But the problem with the #sample method is the context, it is used by plugin author, core developers and config developers and they don't have the same requirements.

By requirements, I mean they expect a clean state on every calls of sample which is the current behavior and this is usually what you want when you when are developping the system, you want to make sure that every call is independent of the other, one way to do it is to have a new pipeline under evaluation.

On the other hand, You as the devops/user, you want want fast feedback against the same configuration and the sample method doesn't provide it out of the box.

ph · 2015-09-08T15:38:27Z

@guyboertje @Mogztter maybe having a subset of specialized helpers targeted at config developers would be better?

jsvd · 2015-09-08T15:39:17Z

@Mogztter all this you're fighting are rspec mechanisms to ensure isolation in tests. The reason you're fighting it is because logstash is slow when setting up the filters :(

I wonder if we can profile the filter register calls and see which ones are making this delay so big.

ph · 2015-09-08T15:39:31Z

@Mogztter the problem in changing sample behavior can impact other plugins or core tests.

guyboertje · 2015-09-08T15:47:33Z

@ph - Note: @Mogztter is also (in the PR) proposing to change the config helper method behaviour.

I was thinking of this:

let(:config) { }
samples do |p|
  # or let(:config) { }
  sample("apache log in debug", p) do
    #expectation
  end
  sample("other", p) do
    # expectation
  end
end

where the samples method provides the 'clean' pipeline. Also, by injecting it into sample here and not providing it elsewhere we retain backwards compatibility.

There is a way to hide the (reused) pipeline provided by samples but it means more work in sample

All of the above is dependent on a way to reset or light-weight clone a pipeline instance when receiving it in sample.

guyboertje · 2015-09-08T15:49:17Z

@ph - yes perhaps a tailored helpers is a good idea.

ggrossetie · 2015-09-08T19:04:58Z

in this case the "end user" has the role of test case developer and with this role comes some degree of responsibility for the objects under test. Presently, the sample method hides the fact that 'object
under test' is really a (new) pipeline instance in a RSpec let block (preserving the clean state for each it block contract). As soon as that is deviated from, the test case developer must take responsibility for the 'object under test'.

I agree but I would prefer to use an utility method rather than instantiating and configuring manually a Pipeline object.

Also, FWIW, this issue is the correct place to generally discuss this (not your PR, as it is only one possible solution to this issue).

The idea is not to merge this pull request but to highlight what part of the code need to be enhanced.

I am okay with having a way to execute multiples test with a single pipeline instance, I understand the cost of bringing one up on every tests. But the problem with the #sample method is the context, it is used by plugin author, core developers and config developers and they don't have the same requirements.

I understand

maybe having a subset of specialized helpers targeted at config developers would be better?

Sounds great.

The reason you're fighting it is because logstash is slow when setting up the filters

The most costly part of the code is LogStash::Pipeline.new(configstr). On my output pipeline.instance_eval took 19 seconds because the Pipeline was lazily instantiate before instance_eval invocation.

I wonder if we can profile the filter register calls and see which ones are making this delay so big.

That would be nice and will also make LogStash starts faster.

A simple enhancement could be to create a new method static_config. This method will instantiate and configure a Pipeline and use this pipeline for the subsequent sample calls.

The behavior of sample will remain the same with config.

purbon · 2015-09-09T07:09:05Z

This is actually a pain, the way sample works and used is very slow, but for now is a way used in tons of places in logstash, but I see here several open questions.

Is sample something to be used for all plugins? definitely not, but this days is used everywhere. For inputs nowadays it's kinda necessary to spin up the pipeline as it makes the shutdown easier and uniform, for this you can use an input helper [1].
In situations where the config is the same I agree with the comments here, is not necessary to compile again the config, with the input helper you could just generate some events and then do the check, does this makes sense to you? Keep in mind this helper is rspec friendly not like sample that end up messing in the Rspec DSL.
There is also the discussion of spinning up the pipeline or not when doing test for the inputs? I should not be necessary in all cases, although this makes managing the input life cycle more complicated for now.
Then we've filters, they would not really have to do anything with the pipeline, here I would go for having specific helpers that simplify test process and semantics.
Same thing apply to outputs.

And at the end we've full pipeline test, what I think people who aim to test a configuration would really need. This would certainly also benefit from a sample alike method, although I would most probably change the implementation of it to a direction closer to how the input helper works

What do you think?

I wonder @Mogztter, what is the main use case why do you write test? are you willing to test plugins? a given configuration for logstash?

[1] input helper usage example: https://github.com/logstash-plugins/logstash-input-relp/blob/master/spec/inputs/relp_spec.rb#L28-L56

ggrossetie · 2015-09-09T15:00:47Z

I wonder @Mogztter, what is the main use case why do you write test? are you willing to test plugins? a given configuration for logstash?

Our main use case is to catch regressions in our custom patterns and/or in our LogStash filters configuration. We also use our test suites to add new messages in order to validate that LogStash will properly extract and transform the data.
Finally, since we have a fairly complex configuration, we also catch regressions in LogStash itself (plugins) 😄

So basically we are doing integration tests to make sure that everything works.

Our filters configuration is split by message type but we also have common filters. So we could create smaller configuration files for each message type but it's faster to load/compile the full configuration and start JRuby only once.

purbon · 2015-09-09T21:23:26Z

Hi,
I started today, well, restarted poking around this kind of issues. I wanted to share my experiments of today.

In https://github.com/purbon/logstash-helpers I started creating a gem that basically aims to be a collection of helpful helpers and matchers for logstash configuration, you could see for example an input helper or a grok/patterns matcher. This aims to easy and organize your experience with test in the project.

In https://github.com/purbon/logstash-input-syslog/blob/feature/use-logstash-helpers/spec/inputs/syslog_spec.rb, I did an small example, experiment on having a pipeline helper compile the configuration only once (still could be improved a bit). This is useful, until the configuration change when is necessary to redo this process.

some numbers run in my laptop with the syslog input.

With the actual set of helpers took around 1.31 seconds to run the tests.
With the new (one time compile) input helper, took around 0.68 seconds.

This means approx an speed up of 47%, worth persuading !

I will keep you up to date with the evolution of this, it would be super nice to keep in touch to now how you are using the test more in detail. From your last comment I see:

Validate custom patterns. Do you know about the patterns spec helpers ? you will also find similar matchers in the project I shared with you earlier.
Configuration validation? Do you test the hole pipeline here? do you start testing with stdin as input, do the set of filters and output to stdout?

Thanks for your feedback @Mogztter !

purbon · 2015-09-10T10:05:23Z

@Mogztter I was wondering if you are able to share your config so I might be able to do some work on improving the expectations of people like you doing this kind of test. Makes sense to you?

/cheers

ggrossetie · 2015-09-10T11:46:50Z

Configuration validation? Do you test the whole pipeline here? do you start testing with stdin as input, do the set of filters and output to stdout?

I really just want to assert that when a message is received, data are correctly extracted/transformed. I'm testing the filter part of the configuration not the input nor the output.

From an API point of view, the method config and sample are perfect because I'm just providing a configuration file (containing filters) and messages (samples) and then I can write assertions.

ggrossetie · 2015-09-10T11:57:49Z

Validate custom patterns. Do you know about the patterns spec helpers ? you will also find similar matchers in the project I shared with you earlier.

Yes but I can validate both the custom patterns and the filter configuration at once. But I do agree that this is not unit testing.

Configuration validation? Do you test the hole pipeline here? do you start testing with stdin as input, do the set of filters and output to stdout?

I really just want to test the filter part of the configuration, I don't care about the input or the output. When a message (sample) is received I want to assert that the data are extracted/transformed correctly.

From an API point of view, the config and sample methods are perfect. I'm providing a configuration file and a sample and then I can write my assertions.

ggrossetie changed the title ~~LogStash configuration is compiled for every sample~~ Sample run slowly if the configuration file is big Sep 8, 2015

ggrossetie changed the title ~~Sample run slowly if the configuration file is big~~ LogStash configuration is compiled for every sample Sep 8, 2015

ggrossetie added a commit to ggrossetie/logstash-devutils that referenced this issue Sep 8, 2015

Resolves elastic#33, instantiate and configure the pipeline only once

c9d1079

ggrossetie mentioned this issue Sep 8, 2015

Instantiate and configure the pipeline only once #35

Closed

jsvd added the enhancement label Sep 8, 2015

purbon mentioned this issue Sep 11, 2015

Collection of helpers to improve the testing experience elastic/logstash#3907

Open

gandalfb mentioned this issue Dec 29, 2016

Handle Errors From HTTP Client lucashenning/logstash-filter-rest#17

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LogStash configuration is compiled for every sample #33

LogStash configuration is compiled for every sample #33

ggrossetie commented Sep 8, 2015

ggrossetie commented Sep 8, 2015

ggrossetie commented Sep 8, 2015

guyboertje commented Sep 8, 2015

ggrossetie commented Sep 8, 2015

guyboertje commented Sep 8, 2015

ph commented Sep 8, 2015

ph commented Sep 8, 2015

jsvd commented Sep 8, 2015

ph commented Sep 8, 2015

guyboertje commented Sep 8, 2015

guyboertje commented Sep 8, 2015

ggrossetie commented Sep 8, 2015

purbon commented Sep 9, 2015

ggrossetie commented Sep 9, 2015

purbon commented Sep 9, 2015

purbon commented Sep 10, 2015

ggrossetie commented Sep 10, 2015

ggrossetie commented Sep 10, 2015

LogStash configuration is compiled for every sample #33

LogStash configuration is compiled for every sample #33

Comments

ggrossetie commented Sep 8, 2015

ggrossetie commented Sep 8, 2015

ggrossetie commented Sep 8, 2015

guyboertje commented Sep 8, 2015

ggrossetie commented Sep 8, 2015

guyboertje commented Sep 8, 2015

ph commented Sep 8, 2015

ph commented Sep 8, 2015

jsvd commented Sep 8, 2015

ph commented Sep 8, 2015

guyboertje commented Sep 8, 2015

guyboertje commented Sep 8, 2015

ggrossetie commented Sep 8, 2015

purbon commented Sep 9, 2015

ggrossetie commented Sep 9, 2015

purbon commented Sep 9, 2015

purbon commented Sep 10, 2015

ggrossetie commented Sep 10, 2015

ggrossetie commented Sep 10, 2015