This chapter will help you decide the best method for getting data into your Splunk app, as well as provide tips on how to work with your data.
Getting data into your Splunk Enterprise app is the essential first step toward understanding your data. The app infrastructure provides a number of mechanisms for ingesting data—from reading simple, well-known log file formats like Apache logs, to invoking programs to handle custom data formats. We provide suggestions for how to acquire the data for particular app scenarios, and for how to normalize your data to make it available to other apps, ensuring your data is robust and flexible. Deciding how you're going to get the data into your app often occurs in parallel with understanding how you're going to search your data. You might even need to augment your data at index time to make your data more meaningful.
• Configuring input sources
• Specifying your input source
• Input type classification
• Files and directories
• Network events
• Windows sources
• Other input sources
• Logging: Proven operational practices
• Use timestamps
• Format log data efficiently
• Choose modular input over scripted input
• Working with modular input
• Working with scripted input
• Managing initial data collection
• Distributed deployment considerations
• Choose the SDK over UDP/TCP or REST
• Use CIM to normalize log data
• Use HTTP Event Collector
• Testing and staging data inputs
• Simulate your event stream
To realize the full potential of Splunk Enterprise, you first have to give it data. When you give Splunk Enterprise data, it indexes the data by transforming it into a series of events that contain searchable fields. To feed a new set of data to Splunk Enterprise, you configure a data input by pointing Splunk Enterprise to the source of the data and then describing it.
Splunk Enterprise can index many different kinds of data, as illustrated by the following diagram.
Splunk Enterprise works best with time-series data (data with timestamps).
Here are some of the ways you can specify your input source.
Apps/add-ons. Splunk Enterprise has a large and growing number of apps and add-ons that support preconfigured inputs for particular types of data sources.
Splunk Enterprise is an extensible platform. In addition to a number of preconfigured inputs, the developer community makes it easy to get your data into Splunk Enterprise by building and sharing add-ons and tools for you to reuse. Check out splunkbase.splunk.com to see what's available.
For information about using apps and add-ons to input data to Splunk Enterprise, see "Use apps to get data in" in the "Getting Data In Manual."
Splunk Web. You can use a GUI-based approach to add and configure most inputs. To add data in Splunk Web, click Add data from either Splunk Home or the Settings menu. To configure existing data inputs, go to Settings > Data inputs.
For more information about how to add data using Splunk Web, see "How do you want to add data?" in the "Getting Data In Manual."
Command line Iinterface (CLI). You can use the CLI to add and configure most input types. To add a data input, navigate to $SPLUNK_HOME/bin/ and use the ./splunk add command, followed by the monitor object and the path you want Splunk Enterprise to monitor. For example, the following adds /var/log/ as a data input:
./splunk add monitor /var/log/
To learn more about CLI syntax, see "Administrative CLI commands" in the "Admin Manual."
The inputs.conf file. Input sources are not automatically discovered. You need to specify your inputs in the inputs.conf file. You can use Splunk Web or the CLI to declare and configure your input as a stanza in the inputs.conf file that you can then edit directly. You configure your data input by adding attribute/value pairs to your inputs.conf data input stanza. You can specify multiple attribute/value pairs. When not specified, a default value is assigned for the attribute.
Here's a simple example of adding a network input:
[tcp://:9995] connection_host = dns sourcetype = log4j source = tcp:9995
All data indexed into Splunk Enterprise is assigned a source type that helps identify the data format of the event and where it came from.
This input configuration stanza listens on TCP port 9995 for raw data from any remote server. The data source host is defined as the DNS name of the remote server and all data is assigned the "log4j" source type and "tcp:9995" source identifier.
Include both Windows and Unix style slashes in stanzas for portability.
For a complete list of input types and attributes, see the inputs.conf spec file documentation. Additionally, if you use forwarders to send data from remote servers, you can specify some inputs during forwarder installation.
Be aware that compressed data sent by a forwarder is automatically unzipped by the receiving indexer.
Splunk Enterprise supports the input types described below.
Much of the data you might be interested in comes directly from files and directories; practically every system produces a log file. For the most part, you can use file and directory monitoring input processors to get data from files and directories. File and directory monitoring simply involves setting up your inputs.conf file. No coding is involved. We recommend using time-based log files whenever possible.
A good alternative for file-based inputs is to use forwarders. The advantages of using those on production servers include: load balancing and buffering, better scalability and performance, and stronger security. Optionally, you can encrypt or compress the data. For more information, see the "Forwarding Data Manual."
When monitoring a directory, Splunk Enterprise allows you to whitelist that you want to index and blacklist files that you don't want to index. Check out the inputs.conf specification for other configuration options.
For more information about monitoring files and directories, see "Get data from files and directories" in the "Getting Data In Manual."
Splunk Enterprise can index data received from a network port, including SNMP events and alert notifications from remote servers. For example, you can index remote data from syslog-ng or any other app that transmits data using TCP. You can also receive and index UDP data, but we recommend using TCP when possible for greater reliability.
Sending log data with syslog is still the primary method to capture network device data from firewalls, routers, switches, and so on.
A good alternative for network inputs is to use an agent. In this case you send data to an output file and let an agent (a forwarder) send data to an indexer. This provides load balancing, greater resiliency, and better scalability.
For more information about capturing data from network sources, see "Get network events" in the "Getting Data In Manual."
Splunk Enterprise includes a wide range of Windows-specific inputs. It also provides pages in Settings for defining the following Windows-specific input types:
Important: You can index and search Windows data on an instance of Splunk Enterprise that is not running on Windows, but you must first use an instance running on Windows to acquire the data. You can do this using an agent (a Splunk forwarder) running on Windows. Configure the forwarder to gather Windows inputs then forward the data to your non-Windows instance.
For more information about acquiring and indexing Windows data with Splunk Enterprise, see "Get Windows data" in the "Getting Data In Manual."
Splunk Enterprise also provides support for scripted and modular inputs for getting data from APIs and other remote data interfaces. These kinds of inputs typically involve programming, and we discuss the capabilities and tradeoffs, later in this chapter.
Scripted and modular inputs are bundled as Splunk add-ons, and once installed, contain all the necessary code and configuration to add themselves to the Data inputs section of Splunk Enterprise for further customization and management.
For more information about all the other ways to get data into Splunk Enterprise, see "Other ways to get stuff in" in the "Getting Data In Manual."
When it comes to data collection, remember:
The NIST Guide to Computer Security Log Management contains many useful recommendations on log generation, storage, and effectively balancing a limited quantity of log management resources with a continuous, ever growing supply of log data.
One of the most powerful Splunk Enterprise features is the ability to extract fields from events when you search, creating structure out of unstructured data. An accurate timestamp is critical to understanding the proper sequence of events, to aid in debugging and analysis, and for deriving transactions. Splunk Enterprise automatically timestamps events that don't include timestamps. However, when you have the option to define the format of your data, be sure to include timestamps. Some timestamp-related tips for adding value to your data include:
The typical format of time-stamped data is a key-value pair. (JSON-formatted data is becoming more common, although time extraction is more difficult for JSON data.) Splunk Enterprise classifies source data by source etype, such as "syslog," "access_combined," or "apache_error," and extracts timestamps, dividing the data into individual events, which can be a single- or multiple-line events. Each timed event is then written to the index for later retrieval using a search command.
When you define your data format, there are a number of policies and configuration settings that can improve the performance and reliable attribute/value pair extraction of your input data.
Employ the following policies to ensure that events and fields can be easily parsed:
As a minimum, set the following props.conf file attributes to improve event breaking and timestamp recognition performance:
TIME_PREFIX: Provides the exact location to start looking for a timestamp pattern. The more precise this is, the faster timestamp processing is.
MAX_TIMESTAMP_LOOKAHEAD: Indicates how far after
TIME_PREFIXthe timestamp pattern extends.
MAX_TIMESTAMP_LOOKAHEADhas to include the length of the timestamp itself.
TIME_FORMAT: Indicates the exact timestamp format.
LINE_BREAKER: Indicates how to break the stream into events. This setting requires a capture group followed immediately by the breaking point. Capture group content is discarded. By defining
LINE_BREAKERyou're specifying a break on a definite pattern.
SHOULD_LINEMERGE: Indicates not to engage line merging (breaking on newlines and merging on timestamps), which is known to consume resources, especially for multiline events.
TRUNCATE:The maximum line/event length, in bytes. The default is 10KB. It's prudent to have a non-default value depending on expected event length.
KV_MODE: If you do not have input events as KV pairs, or any other similarly structured format, disable
KV_MODE. It is helpful to indicate exactly what should be looked for.
ANNOTATE_PUNCT: Unless you expect punctuation to be used in your searches, disable its extraction.
Punctuation is useful in security use cases, especially when classifying and grouping similar events and looking for anomalous events. For more information, see the "Classify and group similar events" section of the Knowledge Manager Manual.
Following those simple guidelines can significantly improve your data ingestion experience.
For input data that cannot be loaded using standard Splunk Enterprise inputs, you need to acquire your data using custom data ingestion techniques. This usually involves writing code that pulls data using an API, makes network requests, reads from multiple files, or does custom filtering.
Splunk Enterprise offers two mechanisms for defining and handling custom input through custom programming: scripted inputs and modular inputs. Scripted inputs are a legacy Splunk Enterprise mechanism that have generally been superseded by modular inputs, which offer greater flexibility and capability. Both have SDK support for popular languages. If your programming language is not supported by the SDK, you might need to write a low-level modular input. Modular inputs support custom input types, each with its own parameters and a graphical UI for validation. A scripted input is a single script that has no parameterization capability unless you've created a custom UI, and it must be shell-executable with specific STDIN and STDOUT formats. Scripted inputs tend to be more restrictive than modular inputs but are not without their unique capabilities as the following table shows.
|Feature||Scripted Inputs||Modular Inputs|
Separate, non-Splunk configuration
|Parameters defined in
Splunk Web fields treated as native inputs in Settings
|Specify event boundaries||Yes
But with additional complexity in your script
XML streaming simplifies specifying event boundaries
|Single instance mode||Yes
Requires manual implementation
You can package your script to include versions for separate platforms
Requires manual implementation
|Run as Splunk Enterprise user||Yes
You can specify which Splunk Enterprise user can run the script
All modular input scripts are run as Splunk Enterprise system user
|Custom REST endpoints||No||Yes
Modular inputs can be accessed using REST
|Endpoint permissions||N/A||Access implemented using Splunk Enterprise capabilities|
Modular inputs can be used almost anywhere a scripted input is used. However, while scripted input might offer a quick and easy implementation when compared with the development effort of a modular input, a scripted input might not be as easy to use as a modular input.
Scripted inputs are quick and easy, but may not be the easiest for an end user. Modular inputs require more upfront work, but are easier for end user interaction.
Your custom input requires that you provide an input handling function and, as with other input mechanisms, register the input source through the
inputs.conf file. Your input function parses events in your data and sends them to your app using
the write_event() method. Review the Auth0 app to see how we used modular inputs with Node.js to add validation to the continuation token implementation.
It is also possible to create custom UI setup pages for adding and editing your modular input configurations.
To summarize the main features of a modular input:
If you are new to scripted inputs, it might be helpful to read the Splunk documentation:
Like modular inputs, you register your scripted input in the
inputs.conf file. Any script that the operating system can run can be used for scripted input. And any output from the script, to STDOUT by default, also ends up in your index.
Scripted inputs are often much easier to implement than modular inputs if all you are doing is retrieving data from a source and adding it to Splunk Enterprise.
A session key will be passed to the script if
passAuth=true, so you can interact with the Splunk REST API.
Additionally, there are some guidelines that can make developing and debugging scripted input easier.
Do not hard code paths in scripts. When referencing file paths in the Splunk Enterprise folder, use the
$SPLUNK_HOME environment variable. This environment variable will be automatically expanded to the correct path based on the operating system Splunk Enterprise is running on.
Use the Splunk Enterprise Entity class or position files as a placeholder. Often, you may be calling an API with a scripted or modular input. In order to only query a specific range of values, use either the Splunk Enterprise Entity class or a position file to keep track of where the last run left off so the next run will pick up at that position. Where your input runs will dictate whether you should use the Entity class or a position file. For position files, avoid using files that start with a dot, as operating systems usually treat these types of files as special files. For example, instead of
Use error trapping and logging. The following example demonstrates how to use the Python logging facility:
import logging try: Some code that may fail like opening a file except IOError, err: logging.error('%s - ERROR - File may not exist %s\n' % (time.strftime("%Y-%m-%d %H:%M:%S"), str(err))) pass
Information logged with
logging.error() will be written to
splunkd.log as well as a special "
_internal" index that can used for troubleshooting. Anything written to
STDERR is also written to
splunkd.log so you could use the following statement in place of
sys.stderr.write('%s - ERROR - File may not exist %s\n' % (time.strftime("%Y-%m-%d %H:%M:%S"), str(err)))
It's best to use a separate/dedicated file other than
splunkd.logto log modular input errors or events of interest.
Test scripts using Splunk Enterprise CMD. To see the output of a script as if it was run by the Splunk Enterprise system, use the following:
On *nix, Linux, or OS X use:
/Applications/Splunk/bin/splunk cmd python /Applications/Splunk/etc/apps/<your app>/bin/<your script>
On Windows use:
C:\Program Files\Splunk\bin\splunk.exe cmd C:\Program Files\Splunk\etc\apps\<your app>\bin\<your script>
Use configuration files to store user preferences. Configuration files store specific settings that will vary for different environments. Examples include REST endpoints, API levels, or any specific setting. Configuration files are stored in either of the following locations and cascade:
For example, if there is a configuration file called
acme.conf in both the default and local directories, settings from the local folder will override settings in the default directory.
A developer should not be thinking about the
Never package/ship your app or add-ons with the
Use Splunk Enterprise methods to read cascaded settings. The Splunk Enterprise
cli_common library contains methods for reading combined settings from configuration files. The following Python example shows how to use
import splunk.clilib.cli_common def __init__(self,obj): self.object = obj self.settings = splunk.clilib.cli_common.getConfStanza("acme", "default")
Use script methods to construct file paths. Here is a Python example:
abs_file_path = os.path.join(script_dir, rel_path) Example (PowerShell): $positionFile = Join-Path $positionFilePath $positionFileName
Production servers often have large volumes of historical data that may not be worth collecting. When deploying your app, before starting to index, consider archiving, then deleting old data that should not be indexed and placing historical data in a separate location.
To avoid causing license violations, it may also be prudent to batch uploads of historical data and spread it over time.
Many Splunk Enterprise apps can be deployed in a variety of environments—from a stand-alone, single-instance server to a distributed environment with multiple servers. The Deployment topologies web page describes the different levels of distributed deployment. Choosing a topology depends on an organization's requirements. When developing a Splunk Enterprise app, it's necessary to understand the implications of a distributed architecture on app design, setup, management, and performance.
An app is a unit of modularization and, as such, inherently supports distributed deployment. Like server functionality, which can be all-in-one or distributed, an app can also include all the functionality or have its knowledge objects divided into different sets, to be deployed on different Splunk Enterprise instances. Because you might not know ahead of time the context your app will need to run in, it's a good idea to keep a few considerations in mind when designing your app:
A logical app decomposition strategy is to separate your app into those parts that run on the search head and those parts that run on the indexing tier. The parts that go on the indexers let you ingest the data. The parts that go on the search head let you view the data. Dividing your app among dedicated nodes also lets other apps benefit from your implementation if they need to handle the same interface or implement the same logic. Partitioned apps can share things like a generally useful search string or a parser that correctly ingests data from a particular source.
The main issue in handling large amounts of data is running out of disk space. When an indexer runs out of space, it stops indexing, possibly resulting in data loss. When a search head runs out of space, it stops searching. There are a few strategies and mechanisms you can keep employ in your design to take advantage of the Splunk Enterprise distributed architecture and mitigate against data loss.
You can view detailed performance information about your Splunk Enterprise deployment and configure platform alerts using the Distributed Management Console (DMC).
You can use data model acceleration, which moves the TSIDX idea to the indexing tier so disk space is not taken up on the search head. An advantage of data model acceleration over other acceleration methods is that if new data arrives that has not yet been summarized, it will be summarized automatically.
Use the same props.conf and transforms.conf files on each node to provide the field extraction and index time rules. This ensures uniformity of configuration across a large environment. An indexer will only read the part of the file related to indexing and a forwarder will only read the part of the file related to forwarding. In short, make distributed apps easier to create and manage by avoiding duplication.
As units of modularization, create small apps targeted for specific functionality:
By designing and implementing your app for a distributed architecture from the beginning, you are likely to produce a more useful and maintainable app. Some additional things to keep in mind:
While other input mechanisms are inherently a pull interface, SDK, UDP/TCP, and REST input mechanisms can both pull and push. In case of a pull, a Splunk Enterprise instance requests (pulls) the data. In case of a push, the data transfer is initiated by the external system and not by Splunk Enterprise. The discussion of both approaches and implementation details of the Auth0 reference application is included in the Journey.
Favor pulling data using an SDK over pushing.
While a push-style interface can be responsive to the time-sensitive requirements of the application, it is also possible that the external system providing the data can be blocked if the data rate is too high and there is insufficient buffering. In terms of performance, UDP/TCP transfer ranks as the highest performing method, followed by using the SDK, and a basic REST interface as the lowest performing method.
Another disadvantage of these protocols is that authentication and authorization are needed for the data to reach the server but Splunk Enterprise does not currently support third-party authentication providers.
One task you might want your app to perform is to consider conceptually-related data that comes from different sources, or from the same source but whose representation has changed over time. While the data might look different, it is the same kind of data and you want to analyze and report on the data as if it were from a single source. The obvious way to solve this problem is to write a separate search for each different data representation. However, that approach is limited, especially when the number of sources or different representations is large.
For example, say you are monitoring antivirus program results produced by a number of different antivirus program vendors. Instead of searching the results produced by each vendor and then somehow associating those results, you prefer to normalize those results, where similar notifications and alarms map to the same event at the conceptual level, before handing the results to your app. Or, suppose the format or representation of the data changes so you have older data in one form and newer data in another form, but both old and new data contain essentially the same information. It would be helpful to have a mechanism that normalizes the results before they get to your app.
The Splunk Enterprise Common Information Model (CIM) is intended as an easy way to provide data normalization, and includes supported data models for common application domains. CIM is implemented as an add-on that normalizes your data and presents the data as knowledge objects for your app to process. See the "Knowledge Manager Manual" for an introduction to CIM and the "Common Information Model Add-on Manual" for a list of data models and more detailed information about how to use CIM. This document shows the tags you need for the model, how they're mapped, what each field and its data type are, and whether the field is required. You will also need to use the Pivot Editor, so read the "Data Model and Pivot Tutorial" to learn how to use the tool with data models. For an example of our experience with CIM in the Journey, review "Working with data: where it comes from and how we manage it."
CIM simplifies the work you and your app need to do by presenting the data at a conceptual level instead of needing to use all the data available to represent an entity. Some of that data might not be important to your app. CIM lets you ask simple, generalized questions about your data and generalizes your query across all data sources. It is easier to write a search against the generalization than against separate but similar data items. However, your data can often align with more than one model, so one of the first things you'll need to do is analyze your data and see which model is the best fit. This manual process requires you to inspect the data coming into your app.
An important fact about CIM is that the conceptual model you define for your data is applied at search time, not index time. So you can always go back to your raw data, if needed. You can still choose to accelerate your data model to get improved performance, trading off indexer load for fast access to search results.
A disadvantage of the CIM implementation is that the models are not structurally related entities, so you can't ingest the results of one data model into another data model. The models are separate entities populated at search time. Also, data is mapped to the least common denominator across all your data sources. CIM does not attempt to resolve all of the disparate data but, because associations are made at search time, the raw events are still available to your application.
A number of .conf files and search language constructs are involved in using CIM to define your data sources, tags, and data transformations. Your raw data sources are defined in the
inputs.conf file. You perform extractions, lookups, define regular expressions, and apply tags to your data items using the
tags.conf files. The next step is to use the
models.conf file to apply a model schema, optionally using acceleration to improve performance, and using constraints to select data and add meaning to your data. Define your searches in the
Appendix C provides a cheat sheet with the configuration file names associations to their functions.
Data model configuration is stored in
datamodels.confusing a JSON structure. However, it's easier to manage this configuration in the UI.
Common steps for creating a CIM-compliant app include using the UI to:
transforms.conffile and map existing field names to applicable model field names. Optionally, provide regex definitions to extract fields. Choose the supported model that best matches your application. Alternatively, you can extract fields using the interactive field extractor or the
props.conffile. A time-saving step is to verify your definitions using the Search app.
eventtypeand defining a
tagfor it. You might need to create field aliases for your data to match what the model requires.
In summary, CIM is a good choice if you want to design your app for interoperability. Benefits are that you can do searches by calling one macro using tags instead of doing separate searches for all the commands. CIM permits different vendors and data sources to interoperate and creates relevant data that works with apps dedicated to solving specific problems.
The basics of HTTP Event Collector on Splunk Enterprise are relatively simple:
HTTP Event Collector is great for data input scenarios like the following:
For more information about HTTP Event Collector, see "Building in telemetry with high-performance data collection" or see "Introduction to Splunk HTTP Event Collector" on the Splunk Developer Portal.
Inputs are distinctive and some can be quite idiosyncratic. Therefore, you should designate a part of your Splunk Enterprise deployment to testing and staging. Use the sandbox for testing all new inputs (custom built or obtained from Splunk or other providers) before putting them into the production environment.
Make sure that your sample data set is large enough and robust enough to detect edge cases.
If you are unable to procure a sandbox Splunk Enterprise deployment, at least use a staging index for testing to avoid polluting your main indexes. The staging index can be deleted when tests have completed.
You usually want to begin testing the input handling part of your app early in the development cycle and for that, you need data to be available. But, for some reason, a live feed isn't available, the data is not immediately available in the format or volume you need, or the data is available but to limit your testing time you only want the data from a certain time interval or with specific values. In such cases, you can simulate your data. Splunk provides Eventgen, an event generation tool for you to use. The tool can replay events in a file or series of files, or randomly extract entries from a file and generate events at random intervals and change particular fields or values according to your specification. You can use the Eventgen tool to not only generate known and random events for your app, but to configure the tool to generate events that reflect natural usage patterns.
For your sample data, it's often convenient to start with an existing log file and use the
eventgen.conf specification to do token replacement.You can also create a file of tokens and let the
eventgen.conf replacement settings generate data the way you want it to look.
Eventgen is included with the PAS reference app download and install script. We preconfigured it to generate events for the Off-Hours Access and Terminated Employee Access scenarios. Run the install_addons.sh script (or install_addons.ps1 script for Windows PowerShell) at $SPLUNK_HOME/etc/apps/pas_ref_app/bin/ to install the PAS app, and it will install and configure Eventgen at the same time. The script creates a symbolic link in the $SPLUNK_HOME/etc/apps/ folder to the $SPLUNK_HOME/etc/apps/pas_ref_app/appserver/addons/eventgen folder. When you restart Splunk Enterprise, Eventgen starts and immediately begins creating events according to the PAS settings.
Here are the basic steps you'll need to follow to install and set up Eventgen using your own custom settings:
$SPLUNK_HOME/etc/apps/folder, and then rename the decompressed folder to
$SPLUNK_HOME/etc/apps/$MYAPP/samples(where $MYAPP represents your app folder).
/samplesfolder you just created. This is the data that Eventgen uses to replicate data as input to your app.
$SPLUNK_HOME/etc/apps/$MYAPP/localfolder, create the
eventgen.conffile. You can also copy and modify the
$SPLUNK_HOME/etc/apps/eventgen/README/eventgen.conf.examplefile you downloaded earlier.
eventgen.conffile to reference the sample data file in the
If you intend to ship Eventgen configuration with your app, include it in the
The Eventgen tool can be run as an add-on, or as a scripted or modular input inside your app. See the
$SPLUNK_HOME/etc/apps/eventgen/README/eventgen.conf.spec file for a complete description of the options available to you for generating sample data. You can also view the eventgen.conf.spec file on GitHub.
In addition to the preconfigured Eventgen install provided by the PAS reference app as a part of the Journey, see a good, example-based description in the Eventgen tutorial on GitHub. Eventgen can also be run as a standalone utility, using
eventgen.py in the
$SPLUNK_HOME/etc/apps/eventgen/bin folder. The tutorial provides descriptions of the proper settings for the various modes of operation and for getting the data in the /samples folder into your app.
See how we used Eventgen in the "Test and sample data" section in the "Platform and tools: a kitbag for our Journey" chapter of the Journey. You can find other interesting examples in the Splunk Blog's Tips & Tricks category with a basic introduction to Eventgen, how to create random data in events, and how to sample events randomly from the data set.
Appendix C contains tips for troubleshooting Eventgen.