Siembol provides parsing services for normalising logs into messages with one layer of key/value pairs. Clean normalised data is very important for further processing such as alerting.
Parser
is a siembol configuration that defines how to normalise a logParsing app
is a stream application (storm topology) that combines one or multiple parsers, reads logs from kafka topics and produces normalised logs to output kafka topics
These common fields are included in all siembol messages after parsing:
original_string
- The original log before normalisationtimestamp
- Timestamp extracted from the log in milliseconds since the UNIX epochsource_type
- Data source - the siembol parser that was used for parsing the logguid
- Unique identification of the message
The configuration defines how the log is normalised
parser_name
- Name of the parserparser_version
- Version of the parserparser_author
- Author of the parserparser_description
- Description of the parser
parser_type
- The type of the parser- Netflow v9 parser - parses a netflow payload and produces a list of normalised messages. Netflow v9 parsing is based on templates and the parser is learning templates while parsing messages.
- Generic parser - Creates two fields
original_string
- The log copied from the inputtimestamp
- Current epoch time of parsing in milliseconds. This timestamp can be overwritten in further parsing
- Syslog Parser
syslog_version
- Expected version of the syslog message -RFC_3164
,RFC_5424
,RFC_3164, RFC_5424
merge_sd_elements
- Merge SD elements of the syslog message into one parsed objecttime_formats
- Time formats used for time formatting. Syslog default time formats are used if not providedtimezone
- Time zone used in syslog default time formats
Extractors are used for further extracting and normalising parts of the message.
An extractor reads an input field and produces the set of key value pairs extracted from the field. Each extractor is called in the chain and its produced messages are merged into the parsed message after finishing the extraction. This way the next extractor in the chain can use the outputs of the previous ones. If the input field of the extractor is not part of the parsed message then its execution is skipped and the next one in the chain is called. A preprocessing function of the extractor is called before the extraction in order to normalise and clean the input field. Post-processing functions are called on extractor outputs in order to normalise its output messages.
is_enabled
- The extractor is enableddescription
- The description of the extractorname
- The name of the extractorfield
- The field on which the extractor is appliedpre_processing_function
- The pre-processing function applied before the extractionstring_replace
- Replace the first occurrence ofstring_replace_target
bystring_replace_replacement
string_replace_all
- Replace all occurrences ofstring_replace_target
bystring_replace_replacement
. You can use a regular expression instring_replace_target
post_processing_functions
- The list of post-processing functions applied after the extractorconvert_unix_timestamp
- Converttimestamp_field
in unix epoch timestamp in seconds to millisecondsformat_timestamp
- Converttimestamp_field
usingtime_formats
validation_regex
- validation regular expression for checking format of the timestamp, if there is no match the next formatter from the list is triedtime_format
using syntax from https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.htmltimezone
- Time zone used by the time formatter
convert_to_string
- Convert all extracted fields as strings except fields from the listconversion_exclusions
extractor_type
- The extractor type - one frompattern_extractor
,key_value_extractor
,csv_extractor
,json_extractor
- flags
should_overwrite_fields
- Extractor should overwrite an existing field with the same name, otherwise it creates a new field with the prefixduplicate
should_remove_field
- Extractor should remove input field after extractionremove_quotes
- Extractor removes quotes in the extracted valuesskip_empty_values
- Extractor will remove empty strings after the extractionthrown_exception_on_error
- Extractor throws an exception on error (recommended for testing), otherwise it skips the further processing
Extracting key value pairs by matching a list of regular expressions with named-capturing groups, where names of the groups are used for naming fields. Siembol supports syntax from https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html except allowing to use underscores in the names of captured groups
regular_expressions
- The list of regular expressionsdot_all_regex_flag
- The regular expression.
matches any character - including a line terminatorshould_match_pattern
- At least one pattern should match otherwise the extractor throws an exception
Key value extractor extracts values from the field which has the form key1=value1 ... keyN=valueN
word_delimiter
- Word delimiter used for splitting words, by defaultkey_value_delimiter
- Key-value delimiter used for splitting key value pairs, by default=
;escaped_character
- Character for escaping quotes, delimiters, brackets, by default\\
;quota_value_handling
- Handling quotes during parsingnext_key_strategy
- Strategy for key value extraction where key-value delimiter is found first and then the word delimiter is searched backwardescaping_handling
- Handling escaping during parsing
column_names
- Specification for selecting column names, whereskipping_column_name
is a name that can be used to not include a column with this name in the parsed messageword_delimiter
- Word delimiter used for splitting words
Json extractor extracts valid json message and unfolds json into flat json key value pairs.
path_prefix
- The prefix added to the extracted field names after json parsingnested_separator
- The separator added during unfolding of nested json objects
Json Path extractor evaluates json path queries on the input field and stores their results. Siembol supports dot
and bracket
notation using syntax from https://github.com/json-path/JsonPath#readme
at_least_one_query_result
- At least one query should store its result otherwise the extractor throws an exceptionjson_path_queries
- List of json path queries, where a json pathquery
is evaluated and stored inoutput_field
on success
All key value pairs generated by parsers and extractors can be modified by a chain of transformations. This stage allows the parser to clean data by renaming fields, removing fields or even filtering the whole message.
Common transformation fields:
is_enabled
- The transformation is enableddescription
- The description of the transformation
Replace the first occurrence of string_replace_target
in field names by string_replace_replacement
Replace all occurrences of string_replace_target
by string_replace_replacement
. You can use a regular expression in string_replace_target
Delete all occurrences of string_replace_target
. You can use a regular expression in string_replace_target
Change case in all field names to case_type
Rename fields according to mapping in field_rename_map
, where you specify pairs of field_to_rename
, new_name
Delete fields according to the filter in fields_filter
, where you specify the lists of patterns for including_fields
, excluding_fields
Trim values in the fields according to the filter in fields_filter
, where you specify the lists of patterns for including_fields
, excluding_fields
Lowercase values in the fields according to the filter in fields_filter
, where you specify the lists of patterns for including_fields
, excluding_fields
Uppercase values in the fields according to the filter in fields_filter
, where you specify the lists of patterns for including_fields
, excluding_fields
Remove new line ending from values in the fields according to the filter in fields_filter
, where you specify the lists of patterns for including_fields
, excluding_fields
Filter logs that are matching the message_filter
, where matchers
are specified for filtering.
Parsers are integrated in a stream application (storm topology) that combines one or multiple parsers, reads a log from input kafka topics and produces a normalised log to output kafka topics when parsing is successful or to an error topic on error.
parsing_app_name
- The name of the parsing applicationparsing_app_version
- The version of the parsing applicationparsing_app_author
- The author of the parsing applicationparsing_app_description
- Description of the parsing applicationparsing_app_settings
- Parsing application settingsparsing_app_type
- The type of the parsing application -single_parser
,router_parsing
,topic_routing_parsing
orheader_routing_parsing
input_topics
- The kafka topics for reading messages for parsingerror_topic
- The kafka topic for publishing error messagesnum_workers
- The number of workers for the parsing applicationinput_parallelism
- The number of parallel executors per worker for reading messages from the input kafka topicsparsing_parallelism
- The number of parallel executors per worker for parsing messagesoutput_parallelism
- The number of parallel executors per worker for publishing parsed messages to kafkaparse_metadata
- Parsing json metadata from input key records usingmetadata_prefix
added to metadata field names, by defaultmetadata_
max_num_fields
- Maximum number of fields after parsing the messagemax_field_size
- Maximum field size after parsing the message in bytesoriginal_string_topic
- Kafka topic for messages with truncated original_string field. The raw input log for a message with truncatedoriginal_string
will be sent to this topic
parsing_settings
- Parsing settings depends on parsing application type
The application integrates a single parser.
parser_name
- The name of the parser from parser configurationsoutput_topic
- The kafka topic for publishing parsed messages
The application integrates multiple parsers. First, the router parser parses the input message, from its output the routing field is extracted and used to select the next parser from the list of parsers by using pattern matching. The parsers are evaluated in order and only one is selected per log.
router_parser_name
- The name of the parser that will be used for routingrouting_field
- The field of the message parsed by the router that will be used for selecting the next parserrouting_message
- The field of the message parsed by the router that will be routed to the next parsermerged_fields
- The fields from the message parsed by the router that will be merged to a message parsed by the next parserdefault_parser
- The parser that should be used if no other parsers is selected withparser_name
andoutput_topic
parsers
- The list of parsers for further parsingrouting_field_pattern
- The pattern for selecting the parserparser_properties
- The properties of the selected parser withparser_name
andoutput_topic
The application integrates multiple parsers and reads logs from multiple topics. The parser is selected based on the topic name on which the log was received.
default_parser
- The parser that should be used if no other parser is selected withparser_name
andoutput_topic
parsers
- The list of parsers for further parsingtopic_name
- The name of the topic for selecting the parserparser_properties
- The properties of the selected parser withparser_name
andoutput_topic
The application integrates multiple parsers and uses a kafka message header for routing. The parser is selected based on the dedicated header value.
default_parser
- The parser that should be used if no other parser is selected withparser_name
andoutput_topic
header_name
- The name of the header used for routingparsers
- The list of parsers for further parsingsource_header_value
- The value in the header for selecting the parserparser_properties
- The properties of the selected parser withparser_name
andoutput_topic
topology.name.prefix
- The prefix that will be used to create a topology name using the application name, by defaultparsing
client.id.prefix
- The prefix that will be used to create a kafka producer client id using the application namegroup.id.prefix
- The prefix that will be used to create a kafka group id reader using the application namezookeeper.attributes
- Zookeeper attributes for updating parser configurationszk.url
- Zookeeper servers url. Multiple servers are separated by a commazk.path
- Path to a zookeeper node
kafka.batch.writer.attributes
- Global settings for the kafka batch writer used if they are not overriddenproducer.properties
- Defines kafka producer properties, see https://kafka.apache.org/0102/documentation.html#producerconfigs
storm.attributes
- Global settings for storm attributes used if they are not overriddenbootstrap.servers
- Kafka brokers servers url. Multiple servers are separated by a commafirst.pool.offset.strategy
- Defines how the kafka spout seeks the offset to be used in the first poll to kafkakafka.spout.properties
- Defines the kafka consumer attributes for the kafka spout such as group.id, protocol, see https://kafka.apache.org/0102/documentation.html#consumerconfigspoll.timeout.ms
- Kafka consumer parameterpoll.timeout.ms
used in the kafka spoutoffset.commit.period.ms
- Specifies the period of time (in milliseconds) after which the spout commits to Kafka, see https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/storm-moving-data/content/tuning_kafkaspout_performance.htmlmax.uncommitted.offsets
- Defines the maximum number of polled offsets (records) that can be pending commit before another poll can take placestorm.config
- Defines storm attributes for a topology, see https://storm.apache.org/releases/current/Configuration.html
overridden_applications
- List of overridden settings for individual parsing applications. The overridden application is selected byapplication.name
,kafka.batch.writer.attributes
andstorm.attributes