Skip to content

Splunk Admins application to assist with troubleshooting Splunk enterprise installations

License

Notifications You must be signed in to change notification settings

iraging/SplunkAdmins

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

** Now available on SplunkBase https://splunkbase.splunk.com/app/3796/ **

****************
Introduction
****************
This application accompanies the Splunk conf 2017 presentation "How did you get so big? Tips and tricks for growing your Splunk installation from 50GB/day to 1TB/day"

The overall idea behind this application is to provide a variety of alerts that detect issues or potential issues within the splunk log files and then advise via an alert that this has occurred
This application was built as there were a variety of messages in the Splunk console and logs in Splunk that if acted upon could have prevented an issue within the environment.

The original presentation is available as a recording http://conf.splunk.com/files/2017/recordings/howd-you-get-so-big-tips-n-tricks-for-growing-your-splunk-deployment-from-50-gb-per-day-to-1-tb-per-day.mp4 or PDF http://conf.splunk.com/files/2017/slides/howd-you-get-so-big-tips-tricks-for-growing-your-splunk-deployment-from-50-gb-day-to-1-tb-day.pdf
The powerpoint should it be required is available on https://github.com/gjanders/splunkconf2017

There are many potential alerts that might cause an issue so this application has all alerts disabled by default, post-installation once the required macros are configured you can enable the alerts you wish to use and add the required actions

There are also a few dashboards for investigating indexer performance, heavy forwarder queue usage and data model acceleration issues

Please note that the all alerts & dashboards were tested on Linux-based Splunk infrastructure, with AIX, Linux and Windows forwarders

If you are running your Splunk enterprise installation on Windows or have customised your installation directory you will need to customise some of the macros such as splunkadmins_splunkd_source to point to the correct splunkd log file location

****************
Macros - required configuration
****************
The various saved searches and dashboards use macros within their searches, you will need to update the macros to ensure the searches/dashboards work as expected
To check the contents of the macros in Splunk 7 or newer, use CTRL-SHFT-E within the search window

The macros are listed below, many expect a host=A OR host=B item to assist in narrowing down a search while others expect only a single value...

indexerhosts - a host=... list of your indexers (for example host=indexer1 OR host=indexer2)

heavyforwarderhosts - a host=... list of your heavy forwarders (for example host=heavyforwarder1 OR host=heavyforwarder2)

searchheadhosts - a host=... list of your search head(s) (for example host=searchhead1 OR host=searchhead2)

localsearchheadhosts - a host=... list of your search head(s) within the cluster that these alerts are running on

splunkenterprisehosts - a host=... list of any Splunk enterprise instance (for example host=indexer1 OR host=searchhead1 OR ...)

deploymentserverhosts - a host=... list of deployment server(s) (for example host=splunkdeploymentserver)

licensemasterhost - a host=... entry for the license master server (for example host=splunklicensemaster)

searchheadsplunkservers - a splunk_server=... list of any Splunk search head hosts (for example splunk_server=searchhead*)

splunkindexerhostsvalue - a splunk_server=... list of any Splunk indexer hosts (for example splunk_server=indexer*)

splunkadmins_splunkd_source - this defaults to source=/opt/splunk/var/log/splunk/splunkd.log, if your splunkd log location is not in this directory please override this macro as appropriate

splunkadmins_splunkuf_source - this defaults to source=*splunkd.log, you may wish to narrow down this location if your splunkd logs on universal forwarders have consistent installation directories 

splunkadmins_mongo_source - this defaults to source=/opt/splunk/var/log/splunk/mongod.log, if your mongodb log file is not in this directory please override this macro as appropriate

splunkadmins_clustermaster_oshost - a host=... entry for the cluster master server (for example host=splunkclustermaster)

The macros are used in various alerts which you can optionally enable, the alerts will raise a triggered alert only as emails are not allowed for Splunk app certification purposes 

The vast majority of the alerts also have a macro(s) which you can customise to tweak the search results, for example the macro `splunkadmins_weekly_truncated` allows the alert, "IndexerLevel - Weekly Truncated Logs Report", to be customised without changing the alert itself. This will make upgrading to a new version of this app more straightforward
I have attempted to provide an appropriate macro in any alert where I deemed it appropriate, feedback is welcome for any alert that you believe should have a macro or requires further improvement

****************
Installation
****************
The application is designed to work on a search head or search head cluster instance, installation on the indexing tier is not required
There are a few searches that use REST API calls which are specific to the search head cluster they run on. These alerts will have to be placed on each search head or search head cluster, alternatively any server with the required search peers will also work, the relevant alerts are:
SearchHeadLevel - Accelerated DataModels with All Time Searching Enabled
SearchHeadLevel - Realtime Scheduled Searches are in use
SearchHeadLevel - Realtime Search Queries in dashboards
SearchHeadLevel - Scheduled Searches without a configured earliest and latest time
SearchHeadLevel - Scheduled searches not specifying an index
SearchHeadLevel - Scheduled searches not specifying an index macro version
SearchHeadLevel - Scheduled Searches Configured with incorrect sharing
SearchHeadLevel - Saved Searches with privileged owners and excessive write perms
SearchHeadLevel - User - Dashboards searching all indexes
SearchHeadLevel - User - Dashboards searching all indexes macro version
SearchHeadLevel - Users exceeding the disk quota (recent jobs list uses a REST call so you may need to adjust the search)

The following reports also are specific to a search head or search head cluster:
SearchHeadLevel - Alerts that have not fired an action in X days
SearchHeadLevel - Data Model Acceleration Completion Status
SearchHeadLevel - Macro report
What Access Do I Have?

The following dashboards are search head or search head cluster specific:
Data Model Rebuild Monitor
Data Model Status

The following reports / alert must either run on the cluster master or a server where the cluster master is a peer:
ClusterMasterLevel - Per index status
ClusterMasterLevel - Primary bucket count per peer

****************
Using the application
****************
Once the application is installed, *all* alerts are disabled by default and you can enable those you require or want to test in your local environment
If you choose not to customise the macros then many searches will search for all hosts, which will make the alerts and dashboards inaccurate!

****************
Which alerts should be enabled?
****************
The alerts are all useful for detecting a variety of different scenarios which may or may not be applicable within your Splunk environment
The description field has an (extremely) simple way of determining if an alert will require action, there are three levels:
  Low - the alert is informational and likely relates to a potential issue, these alerts may produce false alarms
  Moderate - the alert is a warning, most likely further action will need to be taken, a moderate chance of false alarms
  High - the alert is likely relating to something that requires action and there is a very low chance that this will create false alarms

I do not have a nice way to auto-enable various alerts excluding editing the local/savedsearches.conf or via the GUI, any contribution of a setup file would be welcome here!

****************
How is this application used?
****************
In the current environment the vast majority of the alerts are enabled to detect issues, they raise automated tickets or email depending on the urgency of the specific alert.
There are a few environment characteristics that may require changes to the way the app is used, and feedback is welcome if there is a nicer way to structure the alerts/application
The overall assumption is that the admin(s) are not carefully watching the splunkd logs or the messages in the console of the monitoring server/Splunk servers

****************
How is this application tested?
****************
The universal forwarders in use are installed on a mix of Windows, Linux & AIX servers
All heavy forwarders, and Splunk enterprise installations are Linux based, while I expect the alerts will work with only changes to the macros.conf for a Windows based environment this remains untested
The test environment for this application has a single indexer cluster and two search head clusters

****************
Why was this application and associated conf talk created?
****************
Inspired by articles such as "Things I wish I knew then" and knowledge collected from various conference replays, SplunkAnswers, 200+ support tickets & nearly four years of working on a Splunk environment I decided that I would attempt to share what I have learned in an attempt to prevent others from repeating the same mistakes
There are many Splunk conf talks available on this subject in various conference replays, however my goal was to provide practical steps to implement the ideas. That is why this application exists

****************
Which alerts are best suited to automation?
****************
SearchHeadLevel - Scheduled searches not specifying an index
SearchHeadLevel - Scheduled Searches Configured with incorrect sharing
SearchHeadLevel - Splunk login attempts from users that do not have any LDAP roles
SearchHeadLevel - Scheduled Searches That Cannot Run
SearchHeadLevel - Scheduled Searches without a configured earliest and latest time
SearchHeadLevel - Users exceeding the disk quota
SearchHeadLevel - User - Dashboards searching all indexes

Are all well suited to an automated email using the sendresults command or a similar function as they involve end user configuration which the individual can change/fix

****************
Feedback?
****************
Feel free to open an issue on github or use the contact author on the SplunkBase link and I will try to get back to you when possible, thanks!

****************
Release Notes
****************
2.3.4
Update summary:
Updated SearchHeadLevel - Scheduled searches not specifying an index to exclude 1 additional type of search
Updated SearchHeadLevel - KVStore Or Conf Replication Issues Are Occurring to detect a disconnected member scenario
Updated Troubleshooting indexer CPU & drilldown dashboards to include commmas and the search head field (to make it easier to update to search head instead of indexer hosts) 

2.3.3
Update summary:
New alert SearchHeadLevel - Disabled modular inputs are running
Updated SearchHeadLevel - Detect MongoDB errors to timechart to have no limit on the number of hosts involved
Updated the shutdown macros to find one additional scenarios

2.3.2
Due to resourcing issues on the search heads this includes a few warnings/errors related to performance issues

Update summary:
New alert AllSplunkEnterpriseLevel - Splunk Servers with resource starvation
New alert IndexerLevel - S2SFileReceiver Error
New alert SearchHeadLevel - Captain Switchover Occurring
Updated ForwarderLevel - Splunk Universal Forwarders that are time shifting to include "System time went backwards by..."
Updated IndexerLevel - Failures To Parse Timestamp Correctly (excluding breaking issues) to show when the failure related to been outside the acceptable time window
Updated SearchHeadLevel - User - Dashboards searching all indexes to simplify regex (ignore anything starting with a pipe symbol)
Updated SearchHeadLevel - User - Dashboards searching all indexes macro version to simplify regex (ignore anything starting with a pipe symbol)
Corrected AllSplunkEnterpriseLevel - sendmodalert errors to not show random savedsearch_names when no match is found
Corrected SearchHeadLevel - Alerts that have not fired an action in X days to only show alerts relevant to the current search head/cluster

2.3.1
Update summary:
New alert AllSplunkEnterpriseLevel - Non-existent roles are assigned to users
New alert IndexerLevel - Index not defined
New alert IndexerLevel - Search Failures
New alert SearchHeadLevel - Scheduled searches not specifying an index macro version (detect lack of index= with 1 level of macro expansion)
New alert SearchHeadLevel - Saved Searches with privileged owners and excessive write perms (detect 1 way of accessing data outside your level of access)
New alert SearchHeadLevel - User - Dashboards searching all indexes macro version
New report SearchHeadLevel - Macro report (required by "macro version" alerts)
Updated AllSplunkEnterpriseLevel - TCP or SSL Config Issue to include an additional scenario as reported by a customer
Updated SearchHeadLevel - Scheduled searches not specifying an index to not find searches with macros and to include example query
Updated SearchHeadLevel - Scheduled Searches That Cannot Run to make the message field accurate in all situations
Updated SearchHeadLevel - User - Dashboards searching all indexes to include example query to find indexes, and to not find macro-based queries
Corrected AllSplunkLevel - Unable To Distribute to Peer
Corrected IndexerLevel - Failures To Parse Timestamp Correctly (excluding breaking issues) to correctly exclude broken events & to handle newer 7.0.2 errors

2.3.0
Minor updates to a few alerts and a new alert

Update summary:
New alert AllSplunkEnterpriseLevel - Detect LDAP groups that no longer exist
New alert ClusterMasterLevel - Per index status
New report ClusterMasterLevel - Primary bucket count per peer
Updated AllSplunkEnterpriseLevel - TCP or SSL Config Issue to find most recent (not oldest example)
Updated AllSplunkEnterpriseLevel - Splunk Scheduler skipped searches and the reason to exclude the timewindow upto 10 minutes post-shutdown of an indexer
Updated AllSplunkLevel - TCP Output Processor has paused the data flow to use a stats command instead of raw/host information
Updated DeploymentServer - Unsupported attribute within DS config - to find most recent (not oldest example)
Updated IndexerLevel - Failures To Parse Timestamp Correctly (excluding breaking issues) - to find most recent (not oldest example)
Updated SearchHeadLevel - Detect MongoDB errors - mild tweak to output data, added customisation macros
Updated SearchHeadLevel - Scheduled Searches That Cannot Run - to detect errors in splunkd related to saved searches
Corrected SearchHeadLevel - User - Dashboards searching all indexes - a newline resulted in it working in search but not via the scheduler!

2.2
Not released, combined with 2.3.0
Attempt to reduce false alarms and improve investigationQuery searches
Created macros for shutdown events for indexers/search heads/enterprise servers for excluding false alarms related to restarts

Update summary:
New macro splunkadmins_shutdown_list 
New macro splunkadmins_shutdown_time
Updated AllSplunkEnterpriseLevel - Splunk Scheduler skipped searches and the reason - to use shutdown macro
Updated AllSplunkEnterpriseLevel - Splunk Scheduler excessive delays in executing search - to use shutdown macro
Updated AllSplunkLevel - TCP Output Processor has paused the data flow - to use shutdown macro
Updated AllSplunkLevel - Unable To Distribute to Peer - to use shutdown macro
Updated ForwarderLevel - Splunk forwarders are having issues with sending data to indexers - to use shutdown macro
Updated ForwarderLevel - SplunkStream Errors - to use shutdown macro
Updated ForwarderLevel - Unusual number of duplication alerts - to use shutdown macro & changed the alert to fire on >10 results per host
Updated IndexerLevel - Weekly Truncated Logs Report - hostnames wildcarded to deal with short names (for syslog for example)
Updated SearchHeadLevel - Detect MongoDB errors - timespan increased to 10 minutes and 5 minutes produces false alarms, and added shutdown macro
Updated SearchHeadLevel - KVStore Or Conf Replication Issues Are Occurring - to use shutdown macro

2.1
Added macros which can be customised to the majority of alerts, this reduces the need to customise the alert itself and should make upgrading to new versions of the application easier...

Update summary:
New alert AllSplunkEnterpriseLevel - Unable to dispatch searches due to disk space
New alert IndexerLevel - Unclean Shutdown - Fsck
New macros - various macros introduced due to customer feedback about the requirement to customise the alerts
Updated SearchHeadLevel - Users exceeding the disk quota to list top 10 consumers of disk
Updated SearchHeadLevel - Scheduled Searches That Cannot Run (to ignore the above)
Updated IndexerLevel - Failures To Parse Timestamp Correctly (excluding breaking issues) to list sources per sourcetype/host
Updated ForwarderLevel - Splunk Insufficient Permissions to Read Files to include new macro, hint, invesQuery & to improve the accuracy
Updated SearchHeadLevel - Detect MongoDB errors (customer feedback, includes F/fatal errors now)
README and description updates for searches

2.0
Multiple searches now have an "investigationQuery" in them, the idea is that you can copy and paste the output into a search window and see results relevant to the particular alert
The last few releases have been attempting to reduce false alarms from alerts related to server restarts

Update summary:
New alert IndexerLevel - Cold data location approaching size limits
New/renamed alert AllSplunkEnterpriseLevel - Splunk Scheduler excessive delays in executing search
New/renamed alert AllSplunkEnterpriseLevel - Splunk Scheduler skipped searches and the reason
Updated AllSplunkLevel - Time skew on Splunk Servers
Updated IndexerLevel - Future Dated Events that appeared in the last week
Updated IndexerLevel - Time format has changed multiple log types in one sourcetype
Updated IndexerLevel - Failures To Parse Timestamp Correctly (excluding breaking issues)
Updated IndexerLevel - Weekly Broken Events Report
Updated IndexerLevel - Weekly Truncated Logs Report
Updated IndexerLevel - Old data appearing in Splunk indexes
Updated IndexerLevel - Valid Timestamp Invalid Parsed Time
Updated IndexerLevel - Too many events with the same timestamp
Updated IndexerLevel - Large multiline events using SHOULD_LINEMERGE setting 
Updated SearchHeadLevel - Users exceeding the disk quota
Updated IndexerLevel - Indexer Queues May Have Issues (to be less sensitive to indexqueue issues)
Corrected AllSplunkLevel - TCP Output Processor has paused the data flow
Corrected ForwarderLevel - Splunk Universal Forwarders that are time shifting
Removed AllSplunkEnterpriseLevel - Splunk Servers with time skew (replaced by Time skew on Splunk Servers)
Removed SearchHead Level - Splunk Scheduler excessive delays in executing search (renamed)
Removed SearchHeadLevel - Splunk Scheduler Skipped Searches and the reason (renamed)

1.9
New macro splunkadmins_mongo_source 
New alert IndexerLevel - Too many events with the same timestamp
New alert SearchHeadLevel - Detect MongoDB errors
New dashboard Data Model Status
New dashboard Data Model Rebuild Monitor
Updated AllSplunkEnterpriseLevel - Email Sending Failures to include to the toaddress
Updated SearchHeadLevel - Splunk Users Violating the Search Quota to detect an alternative log message
Updated Scheduled searches not specifying an index
Updated SearchHeadLevel - Scheduled Searches without a configured earliest and latest time
Updated ForwarderLevel - File Too Small to checkCRC occurring multiple times, to handle spaces in filename
Updated ForwarderLevel - crcSalt or initCrcLength change may be required
Updated IndexerLevel - Indexer Queues May Have Issues to be less sensitive
Updated IndexerLevel - Splunk Indexers Losing Contact With Master for an additional scenario
Updated AllSplunkEnterpriseLevel - ulimit on Splunk enterprise servers is below 8192 to improve emails
Corrected AllSplunkLevel - Splunk forwarders that are not talking to the deployment server

1.8
New alert IndexerLevel - Peer will not return results due to outdated generation
New alert SearchHeadLevel - Scheduled searches failing in cluster with 404 error
Updated ForwarderLevel - File Too Small to checkCRC occurring multiple times to have the correct dispatch application
Updated AllSplunkEnterpriseLevel - Email Sending Failures with saved search name
Updated IndexerLevel - Indexer replication queue issues to some peers to be less sensitive as this cannot be tuned in 7.0.0
Updated AllSplunkLevel - Unable To Distribute to Peer to include the peer name
Updated IndexerLevel - Indexer Queues May Have Issues to ensure it fires when neccesary but is not too noisy (this may require tuning)
Corrected ForwarderLevel - Bandwidth Throttling Occurring, this alert was not working as expected 

1.7
New macro splunkadmins_splunkuf_source
New alert "AllSplunkEnterpriseLevel - TCP or SSL Config Issue" for detecting when the listener ports fail to start on a HF/Indexer
Updated macro splunkindexerhostsvalue to include splunk_server=
Updated searches to use (`splunkadmins_splunkd_source`) in brackets so it looks valid when expanded (and allows a future OR/NOT statement to be added before or after with no unexpected side effects)
Updated a few comments and improved some searches to narrow down the required hosts/sources/sourcetypes
Removed unused macro splunkenterprisehostsvalue
Removed hardcoded references to the location of splunkd.log file and replaced with splunkadmins_splunkd_source macro
Removed a few unnessary fields/fixed some other minor issues within the file

1.6
Removed "Splunk Alert failures" and updated "AllSplunkEnterpriseLevel - sendmodalert errors", also updated "Time format has changed" alert to have more clear output via email

1.5
Updated Splunk Alert Failures alert and the Time format has changed alerts to have more clear output via email
Simplified "Scheduled Searches without a configured earliest and latest time", and "Scheduled searches not specifying an index"

1.4
Two new alerts LicenseMaster - Duplicated License Situation, DeploymentServer - Unsupported attribute within DS config
Simplified Scheduled Searches without a configured earliest and latest time, and Scheduled searches not specifying an index
Created a macro splunkadmins_splunkd_source for Windows users or others using non-standard Splunk installation directories

1.0 to 1.3
Creation of app, addition of icons and removal of email functionality from the app for Splunk certification purposes

****************
Other
****************
Icons made by Freepik (http://www.freepik.com) from www.flaticon.com is licensed by Creative Commons BY 3.0 (http://creativecommons.org/licenses/by/3.0)

About

Splunk Admins application to assist with troubleshooting Splunk enterprise installations

Resources

License

Stars

Watchers

Forks

Packages

No packages published