Getting data in

This chapter will help you decide the best method for getting data into your Splunk app, as well as provide tips on how to work with your data.

Getting data into your Splunk Enterprise app is the essential first step toward understanding your data. The app infrastructure provides a number of mechanisms for ingesting data—from reading simple, well-known log file formats like Apache logs, to invoking programs to handle custom data formats. We provide suggestions for how to acquire the data for particular app scenarios, and for how to normalize your data to make it available to other apps, ensuring your data is robust and flexible. Deciding how you're going to get the data into your app often occurs in parallel with understanding how you're going to search your data. You might even need to augment your data at index time to make your data more meaningful.

     •  Configuring input sources
               •  Specifying your input source
               •  Input type classification
                         •  Files and directories
                         •  Network events
                         •  Windows sources
                         •  Other input sources
     •  Logging: Proven operational practices
     •  Use timestamps
     •  Format log data efficiently
     •  Choose modular input over scripted input
               •  Working with modular input
               •  Working with scripted input
               •  Managing initial data collection
               •  Distributed deployment considerations
     •  Choose the SDK over UDP/TCP or REST
     •  Use CIM to normalize log data
     •  Use HTTP Event Collector
     •  Testing and staging data inputs
     •  Simulate your event stream

Configuring input sources

To realize the full potential of Splunk Enterprise, you first have to give it data. When you give Splunk Enterprise data, it indexes the data by transforming it into a series of events that contain searchable fields. To feed a new set of data to Splunk Enterprise, you configure a data input by pointing Splunk Enterprise to the source of the data and then describing it.

Splunk Enterprise can index many different kinds of data, as illustrated by the following diagram.

Diagram illustrating what Splunk can index: Customer-Facing Data, Windows, Linux/Unix, Virtualization and Cloud, Applications, Databases, Networking, Outside the Data Center

ARCH

Splunk Enterprise works best with time-series data (data with timestamps).



Specifying your input source

Here are some of the ways you can specify your input source.

Apps/add-ons. Splunk Enterprise has a large and growing number of apps and add-ons that support preconfigured inputs for particular types of data sources. 

BUS

Splunk Enterprise is an extensible platform. In addition to a number of preconfigured inputs, the developer community makes it easy to get your data into Splunk Enterprise by building and sharing add-ons and tools for you to reuse. Check out splunkbase.splunk.com to see what's available. 

For information about using apps and add-ons to input data to Splunk Enterprise, see "Use apps to get data in" in the "Getting Data In Manual."

Splunk Web. You can use a GUI-based approach to add and configure most inputs. To add data in Splunk Web, click Add data from either Splunk Home or the Settings menu. To configure existing data inputs, go to Settings > Data inputs.

For more information about how to add data using Splunk Web, see "How do you want to add data?" in the "Getting Data In Manual."

Command line Iinterface (CLI). You can use the CLI to add and configure most input types. To add a data input, navigate to $SPLUNK_HOME/bin/ and use the ./splunk add command, followed by the monitor object and the path you want Splunk Enterprise to monitor. For example, the following adds /var/log/ as a data input:

./splunk add monitor /var/log/

To learn more about CLI syntax, see "Administrative CLI commands" in the "Admin Manual."

The inputs.conf file. Input sources are not automatically discovered. You need to specify your inputs in the inputs.conf file. You can use Splunk Web or the CLI to declare and configure your input as a stanza in the inputs.conf file that you can then edit directly. You configure your data input by adding attribute/value pairs to your inputs.conf data input stanza. You can specify multiple attribute/value pairs. When not specified, a default value is assigned for the attribute.

Here's a simple example of adding a network input:

[tcp://:9995]
connection_host = dns
sourcetype = log4j
source = tcp:9995
ARCH

All data indexed into Splunk Enterprise is assigned a source type that helps identify the data format of the event and where it came from.

This input configuration stanza listens on TCP port 9995 for raw data from any remote server. The data source host is defined as the DNS name of the remote server and all data is assigned the "log4j" source type and "tcp:9995" source identifier.

DEV

Include both Windows and Unix style slashes in stanzas for portability.


For a complete list of input types and attributes, see the inputs.conf spec file documentation. Additionally, if you use forwarders to send data from remote servers, you can specify some inputs during forwarder installation. 

ARCH

Be aware that compressed data sent by a forwarder is automatically unzipped by the receiving indexer.



Input type classification

Splunk Enterprise supports the input types described below.

Files and directories

Much of the data you might be interested in comes directly from files and directories; practically every system produces a log file. For the most part, you can use file and directory monitoring input processors to get data from files and directories. File and directory monitoring simply involves setting up your inputs.conf file. No coding is involved. We recommend using time-based log files whenever possible.

ADM

A good alternative for file-based inputs is to use forwarders. The advantages of using those on production servers include: load balancing and buffering, better scalability and performance, and stronger security. Optionally, you can encrypt or compress the data. For more information, see the "Forwarding Data Manual."


DEV

When monitoring a directory, Splunk Enterprise allows you to whitelist that you want to index and blacklist files that you don't want to index. Check out the inputs.conf specification for other configuration options.

For more information about monitoring files and directories, see "Get data from files and directories" in the "Getting Data In Manual."

Network events

Splunk Enterprise can index data received from a network port, including SNMP events and alert notifications from remote servers. For example, you can index remote data from syslog-ng or any other app that transmits data using TCP. You can also receive and index UDP data, but we recommend using TCP when possible for greater reliability.

ARCH

Sending log data with syslog is still the primary method to capture network device data from firewalls, routers, switches, and so on.


ADM

A good alternative for network inputs is to use an agent. In this case you send data to an output file and let an agent (a forwarder) send data to an indexer. This provides load balancing, greater resiliency, and better scalability.

For more information about capturing data from network sources, see "Get network events" in the "Getting Data In Manual."

Windows sources

Splunk Enterprise includes a wide range of Windows-specific inputs. It also provides pages in Settings for defining the following Windows-specific input types:

  • Windows event log, registry, perfmon, and WMI data
  • Active Directory data
  • Performance monitoring data
ADM

Important: You can index and search Windows data on an instance of Splunk Enterprise that is not running on Windows, but you must first use an instance running on Windows to acquire the data. You can do this using an agent (a Splunk forwarder) running on Windows. Configure the forwarder to gather Windows inputs then forward the data to your non-Windows instance.

For more information about acquiring and indexing Windows data with Splunk Enterprise, see "Get Windows data" in the "Getting Data In Manual."

Other input sources

Splunk Enterprise also provides support for scripted and modular inputs for getting data from APIs and other remote data interfaces. These kinds of inputs typically involve programming, and we discuss the capabilities and tradeoffs, later in this chapter.

ADM

Scripted and modular inputs are bundled as Splunk add-ons, and once installed, contain all the necessary code and configuration to add themselves to the Data inputs section of Splunk Enterprise for further customization and management.

For more information about all the other ways to get data into Splunk Enterprise, see "Other ways to get stuff in" in the "Getting Data In Manual."

Logging: Proven operational practices

When it comes to data collection, remember:

  • Log events from everything, everywhere. This includes apps, web servers, databases, networks, configuration logs, and external system communications, as well as performance data.
  • Monitor at the source type level.
  • Use Splunk Enterprise forwarders.
  • Use rotation and retention policies.
  • Use smaller log files. It's better to use the size of the log as opposed to the time markers to implement your rolling strategy.
  • Keep the latest two logs in text format before you compress; otherwise you may be missing delayed events.
  • Use a blacklist to eliminate compressed files from file monitor inputs and to avoid event duplication.
SEC

The NIST Guide to Computer Security Log Management contains many useful recommendations on log generation, storage, and effectively balancing a limited quantity of log management resources with a continuous, ever growing supply of log data.


Use timestamps

One of the most powerful Splunk Enterprise features is the ability to extract fields from events when you search, creating structure out of unstructured data. An accurate timestamp is critical to understanding the proper sequence of events, to aid in debugging and analysis, and for deriving transactions. Splunk Enterprise automatically timestamps events that don't include timestamps. However, when you have the option to define the format of your data, be sure to include timestamps. Some timestamp-related tips for adding value to your data include:

  • Use the most verbose time granularity possible, preferably with an event resolution granularity of microseconds.
  • Put the timestamp at the beginning of the line. The further you place a timestamp from the beginning of the event, the more difficult it is to distinguish the timestamp from other data.
  • Include a four-digit year and time zone, preferably a GMT/UTC offset.
  • Ideally, use epoch time—which is already normalized to UTC so no parsing is needed—for easy ingestion by Splunk Enterprise. But remember that this 10-digit string can also be mistaken for data.

The typical format of time-stamped data is a key-value pair. (JSON-formatted data is becoming more common, although time extraction is more difficult for JSON data.) Splunk Enterprise classifies source data by source etype, such as "syslog," "access_combined," or "apache_error," and extracts timestamps, dividing the data into individual events, which can be a single- or multiple-line events. Each timed event is then written to the index for later retrieval using a search command.

Format log data efficiently

When you define your data format, there are a number of policies and configuration settings that can improve the performance and reliable attribute/value pair extraction of your input data.

Employ the following policies to ensure that events and fields can be easily parsed:

  • Begin the log line event with a timestamp.
  • Use clear key-value pairs.
  • Use the Common Information Model (discussed later in this chapter) when using key-value pairs.
  • Create human readable events in text format, including JSON.
  • Use unique identifiers.
  • Use developer-friendly formats.
  • Log more than debugging events.
  • Use categories.
  • Identify the source.
  • Minimize multi-line events.

As a minimum, set the following props.conf file attributes to improve event breaking and timestamp recognition performance:

  • TIME_PREFIX: Provides the exact location to start looking for a timestamp pattern. The more precise this is, the faster timestamp processing is.
  • MAX_TIMESTAMP_LOOKAHEAD: Indicates how far after TIME_PREFIX the timestamp pattern extends. MAX_TIMESTAMP_LOOKAHEAD has to include the length of the timestamp itself.
  • TIME_FORMAT: Indicates the exact timestamp format. TIME_FORMAT uses strptime rules.
  • LINE_BREAKER: Indicates how to break the stream into events. This setting requires a capture group followed immediately by the breaking point. Capture group content is discarded. By defining LINE_BREAKER you're specifying a break on a definite pattern.
  • SHOULD_LINEMERGE: Indicates not to engage line merging (breaking on newlines and merging on timestamps), which is known to consume resources, especially for multiline events.
  • TRUNCATE: The maximum line/event length, in bytes. The default is 10KB. It's prudent to have a non-default value depending on expected event length.
  • KV_MODE: If you do not have input events as KV pairs, or any other similarly structured format, disable KV_MODE. It is helpful to indicate exactly what should be looked for.
  • ANNOTATE_PUNCT: Unless you expect punctuation to be used in your searches, disable its extraction.
SEC

Punctuation is useful in security use cases, especially when classifying and grouping similar events and looking for anomalous events. For more information, see the "Classify and group similar events" section of the Knowledge Manager Manual.

Following those simple guidelines can significantly improve your data ingestion experience.

Choose modular inputs over scripted inputs

For input data that cannot be loaded using standard Splunk Enterprise inputs, you need to acquire your data using custom data ingestion techniques. This usually involves writing code that pulls data using an API, makes network requests, reads from multiple files, or does custom filtering.

Splunk Enterprise offers two mechanisms for defining and handling custom input through custom programming: scripted inputs and modular inputs. Scripted inputs are a legacy Splunk Enterprise mechanism that have generally been superseded by modular inputs, which offer greater flexibility and capability. Both have SDK support for popular languages. If your programming language is not supported by the SDK, you might need to write a low-level modular input. Modular inputs support custom input types, each with its own parameters and a graphical UI for validation. A scripted input is a single script that has no parameterization capability unless you've created a custom UI, and it must be shell-executable with specific STDIN and STDOUT formats. Scripted inputs tend to be more restrictive than modular inputs but are not without their unique capabilities as the following table shows.

Feature Scripted Inputs Modular Inputs
Configuration Inline arguments

Separate, non-Splunk configuration

Parameters defined in inputs.conf

Splunk Web fields treated as native inputs in Settings

Validation support

Specify event boundaries Yes

But with additional complexity in your script

Yes

XML streaming simplifies specifying event boundaries

Single instance mode Yes

Requires manual implementation

Yes
Multi-platform support No Yes

You can package your script to include versions for separate platforms

Checkpointing Yes

Requires manual implementation

Yes
Run as Splunk Enterprise user Yes

You can specify which Splunk Enterprise user can run the script

No

All modular input scripts are run as Splunk Enterprise system user

Custom REST endpoints No Yes

Modular inputs can be accessed using REST

Endpoint permissions N/A Access implemented using Splunk Enterprise capabilities

Modular inputs can be used almost anywhere a scripted input is used.  However, while scripted input might offer a quick and easy implementation when compared with the development effort of a modular input, a scripted input might not be as easy to use as a modular input.

UX

Scripted inputs are quick and easy, but may not be the easiest for an end user. Modular inputs require more upfront work, but are easier for end user interaction.


Working with modular inputs

SDKs are available, including developer documentation, for working with modular inputs in the Python, JavaScript, Java, and C# languages. You might also find the Implement modular inputs section of the developer documentation useful for detailed information about how to create a modular input. For other languages, you would need to write a low-level modular input through the Splunk REST API, which requires more effort.

DEV

Splunk provides additional tooling for building modular inputs: the Eclipse Plug-in and the Visual Studio Extension.

Your custom input requires that you provide an input handling function and, as with other input mechanisms, register the input source through the inputs.conf file. Your input function parses events in your data and sends them to your app using the write_event() method. Review the Auth0 app to see how we used modular inputs with Node.js to add validation to the continuation token implementation.

ARCH

It is also possible to create custom UI setup pages for adding and editing your modular input configurations.


To summarize the main features of a modular input:

  • Each instance you create is separately configurable, and supported by a configuration UI.
  • It runs custom code that you write.
  • Parameters are automatically validated and you are alerted in the UI of a parameter violation.
  • It offers a rich programming model using the SDKs, which removes a lot of error-prone and redundant plumbing code.

Working with scripted inputs

If you are new to scripted inputs, it might be helpful to read the Splunk documentation:

Like modular inputs, you register your scripted input in the inputs.conf file. Any script that the operating system can run can be used for scripted input. And any output from the script, to STDOUT by default, also ends up in your index.

ARCH

Scripted inputs are often much easier to implement than modular inputs if all you are doing is retrieving data from a source and adding it to Splunk Enterprise.


DEV

A session key will be passed to the script if passAuth=true, so you can interact with the Splunk REST API.

Additionally, there are some guidelines that can make developing and debugging scripted input easier.

Do not hard code paths in scripts. When referencing file paths in the Splunk Enterprise folder, use the $SPLUNK_HOME environment variable. This environment variable will be automatically expanded to the correct path based on the operating system Splunk Enterprise is running on.

Use the Splunk Enterprise Entity class or position files as a placeholder. Often, you may be calling an API with a scripted or modular input. In order to only query a specific range of values, use either the Splunk Enterprise Entity class or a position file to keep track of where the last run left off so the next run will pick up at that position. Where your input runs will dictate whether you should use the Entity class or a position file. For position files, avoid using files that start with a dot, as operating systems usually treat these types of files as special files. For example, instead of .pos, use acme.pos.

Use error trapping and logging. The following example demonstrates how to use the Python logging facility:

import logging
try:
    Some code that may fail like opening a file
except IOError, err:
logging.error('%s - ERROR - File may not exist %s\n' % (time.strftime("%Y-%m-%d %H:%M:%S"), str(err)))
pass

Information logged with logging.error() will be written to splunkd.log as well as a special "_internal" index that can used for troubleshooting. Anything written to STDERR is also written to splunkd.log so you could use the following statement in place of logging.error(), above:

sys.stderr.write('%s - ERROR - File may not exist %s\n' % (time.strftime("%Y-%m-%d %H:%M:%S"), str(err)))
DEV

It's best to use a separate/dedicated file other than splunkd.log to log modular input errors or events of interest.

Test scripts using Splunk Enterprise CMD. To see the output of a script as if it was run by the Splunk Enterprise system, use the following:

On *nix, Linux, or OS X use:

/Applications/Splunk/bin/splunk cmd python /Applications/Splunk/etc/apps/<your app>/bin/<your script>

On Windows use:

C:\Program Files\Splunk\bin\splunk.exe cmd C:\Program Files\Splunk\etc\apps\<your app>\bin\<your script>

Use configuration files to store user preferences. Configuration files store specific settings that will vary for different environments. Examples include REST endpoints, API levels, or any specific setting. Configuration files are stored in either of the following locations and cascade:

$SPLUNK_HOME/etc/apps/<your_app>/default
$SPLUNK_HOME/etc/apps/<your_app>/local

For example, if there is a configuration file called acme.conf in both the default and local directories, settings from the local folder will override settings in the default directory.

DEV

A developer should not be thinking about the local directory.


SHIP

Never package/ship your app or add-ons with the local directory.

Use Splunk Enterprise methods to read cascaded settings. The Splunk Enterprise cli_common library contains methods for reading combined settings from configuration files. The following Python example shows how to use cli_common functions:

import splunk.clilib.cli_common
def __init__(self,obj):
    self.object = obj
    self.settings = splunk.clilib.cli_common.getConfStanza("acme", "default")

Use script methods to construct file paths. Here is a Python example:

abs_file_path = os.path.join(script_dir, rel_path)
Example (PowerShell):
$positionFile = Join-Path $positionFilePath $positionFileName

Managing initial data collection

Production servers often have large volumes of historical data that may not be worth collecting. When deploying your app, before starting to index, consider archiving, then deleting old data that should not be indexed and placing historical data in a separate location.

ADM

To avoid causing license violations, it may also be prudent to batch uploads of historical data and spread it over time.

Distributed deployment considerations

Many Splunk Enterprise apps can be deployed in a variety of environments—from a stand-alone, single-instance server to a distributed environment with multiple servers. The Deployment topologies web page describes the different levels of distributed deployment. Choosing a topology depends on an organization's requirements. When developing a Splunk Enterprise app, it's necessary to understand the implications of a distributed architecture on app design, setup, management, and performance.

An app is a unit of modularization and, as such, inherently supports distributed deployment. Like server functionality, which can be all-in-one or distributed, an app can also include all the functionality or have its knowledge objects divided into different sets, to be deployed on different Splunk Enterprise instances. Because you might not know ahead of time the context your app will need to run in, it's a good idea to keep a few considerations in mind when designing your app:

  • Plan for a distributed architecture. You need to understand the distributed topologies and the roles of each server instance. Typically, this means that data input needs to be one of the roles, on a particular instance, and separate from other types of knowledge objects that can reside on other instances.
  • Consider the scale of deployment. There is likely a part of your app, such as the UI, that resides only on the search head and a part of your app that needs to reside on the entire indexing tier.
  • Consider the rate of data growth. Expect that the amount of data your app needs to process will be significantly greater in the future than it is today.

A logical app decomposition strategy is to separate your app into those parts that run on the search head and those parts that run on the indexing tier. The parts that go on the indexers let you ingest the data. The parts that go on the search head let you view the data. Dividing your app among dedicated nodes also lets other apps benefit from your implementation if they need to handle the same interface or implement the same logic. Partitioned apps can share things like a generally useful search string or a parser that correctly ingests data from a particular source.

The main issue in handling large amounts of data is running out of disk space. When an indexer runs out of space, it stops indexing, possibly resulting in data loss. When a search head runs out of space, it stops searching. There are a few strategies and mechanisms you can keep employ in your design to take advantage of the Splunk Enterprise distributed architecture and mitigate against data loss.

ADM

You can view detailed performance information about your Splunk Enterprise deployment and configure platform alerts using the Distributed Management Console (DMC).

You can use data model acceleration, which moves the TSIDX idea to the indexing tier so disk space is not taken up on the search head. An advantage of data model acceleration over other acceleration methods is that if new data arrives that has not yet been summarized, it will be summarized automatically.

Use the same props.conf and transforms.conf files on each node to provide the field extraction and index time rules. This ensures uniformity of configuration across a large environment. An indexer will only read the part of the file related to indexing and a forwarder will only read the part of the file related to forwarding. In short, make distributed apps easier to create and manage by avoiding duplication.

As units of modularization, create small apps targeted for specific functionality:

  • Indexing rules. Anything search-time related and that uses the props.conf and transforms.conf configuration files for field extraction.
  • Data collection. The code that ingests data from a particular source.
  • Dashboards. Saved searches that populate dashboards and the dashboards, themselves, along with supporting JavaScript and CSS code.

By designing and implementing your app for a distributed architecture from the beginning, you are likely to produce a more useful and maintainable app. Some additional things to keep in mind:

  • Put all basic configuration in the indexes.conf file.
  • Be agnostic about the configuration management system for deploying and managing your app, which might be Deployment Server, a third-party tool like Chef or Puppet, or others.
  • On the search head, be deliberate about search optimization. In particular, delete as many columns and rows as possible before search pinch points (reports and the stats command).
  • On the indexer, use a focused index but use a macro to make it configurable so the index destination only needs to be changed in one place.

Choose the SDK over UDP/TCP or REST

While other input mechanisms are inherently a pull interface, SDK, UDP/TCP, and REST input mechanisms can both pull and push. In case of a pull, a Splunk Enterprise instance requests (pulls) the data. In case of a push, the data transfer is initiated by the external system and not by Splunk Enterprise. The discussion of both approaches and implementation details of the Auth0 reference application is included in the Journey.

ARCH

Favor pulling data using an SDK over pushing.

While a push-style interface can be responsive to the time-sensitive requirements of the application, it is also possible that the external system providing the data can be blocked if the data rate is too high and there is insufficient buffering. In terms of performance, UDP/TCP transfer ranks as the highest performing method, followed by using the SDK, and a basic REST interface as the lowest performing method.

Another disadvantage of these protocols is that authentication and authorization are needed for the data to reach the server but Splunk Enterprise does not currently support third-party authentication providers.

Use CIM to normalize log data

One task you might want your app to perform is to consider conceptually-related data that comes from different sources, or from the same source but whose representation has changed over time. While the data might look different, it is the same kind of data and you want to analyze and report on the data as if it were from a single source. The obvious way to solve this problem is to write a separate search for each different data representation. However, that approach is limited, especially when the number of sources or different representations is large.

For example, say you are monitoring antivirus program results produced by a number of different antivirus program vendors. Instead of searching the results produced by each vendor and then somehow associating those results, you prefer to normalize those results, where similar notifications and alarms map to the same event at the conceptual level, before handing the results to your app. Or, suppose the format or representation of the data changes so you have older data in one form and newer data in another form, but both old and new data contain essentially the same information. It would be helpful to have a mechanism that normalizes the results before they get to your app.

The Splunk Enterprise Common Information Model (CIM) is intended as an easy way to provide data normalization, and includes supported data models for common application domains. CIM is implemented as an add-on that normalizes your data and presents the data as knowledge objects for your app to process. See the "Knowledge Manager Manual" for an introduction to CIM and the "Common Information Model Add-on Manual" for a list of data models and more detailed information about how to use CIM. This document shows the tags you need for the model, how they're mapped, what each field and its data type are, and whether the field is required. You will also need to use the Pivot Editor, so read the "Data Model and Pivot Tutorial" to learn how to use the tool with data models. For an example of our experience with CIM in the Journey, review "Working with data: where it comes from and how we manage it."

CIM simplifies the work you and your app need to do by presenting the data at a conceptual level instead of needing to use all the data available to represent an entity. Some of that data might not be important to your app. CIM lets you ask simple, generalized questions about your data and generalizes your query across all data sources. It is easier to write a search against the generalization than against separate but similar data items. However, your data can often align with more than one model, so one of the first things you'll need to do is analyze your data and see which model is the best fit. This manual process requires you to inspect the data coming into your app.

An important fact about CIM is that the conceptual model you define for your data is applied at search time, not index time. So you can always go back to your raw data, if needed. You can still choose to accelerate your data model to get improved performance, trading off indexer load for fast access to search results.

A disadvantage of the CIM implementation is that the models are not structurally related entities, so you can't ingest the results of one data model into another data model. The models are separate entities populated at search time. Also, data is mapped to the least common denominator across all your data sources. CIM does not attempt to resolve all of the disparate data but, because associations are made at search time, the raw events are still available to your application.

A number of .conf files and search language constructs are involved in using CIM to define your data sources, tags, and data transformations. Your raw data sources are defined in the inputs.conf file. You perform extractions, lookups, define regular expressions, and apply tags to your data items using the props.conf, transforms.conf, eventtypes.conf, and tags.conf files. The next step is to use the models.conf file to apply a model schema, optionally using acceleration to improve performance, and using constraints to select data and add meaning to your data. Define your searches in the savedsearches.conf file. 

DEV

Appendix C provides a cheat sheet with the configuration file names associations to their functions.


DEV

Data model configuration is stored in datamodels.conf using a JSON structure. However, it's easier to manage this configuration in the UI.

Common steps for creating a CIM-compliant app include using the UI to:

  1. Create your new app. Be sure to make app configurations globally accessible. Also, it is easier to create CIM mapping for your app if you have the Splunk CIM add-on installed.
  2. Edit your transforms.conf file and map existing field names to applicable model field names. Optionally, provide regex definitions to extract fields. Choose the supported model that best matches your application. Alternatively, you can extract fields using the interactive field extractor or the props.conf file. A time-saving step is to verify your definitions using the Search app.
  3. Tag events by creating an eventtype and defining a tag for it. You might need to create field aliases for your data to match what the model requires.
  4. In the UI, click Data Models, and then Pivot to view results. If the results were not what you expected, click Missing Extractions to help diagnose the problem. You might have a missing tag that is required for the model.

In summary, CIM is a good choice if you want to design your app for interoperability. Benefits are that you can do searches by calling one macro using tags instead of doing separate searches for all the commands. CIM permits different vendors and data sources to interoperate and creates relevant data that works with apps dedicated to solving specific problems.

Use HTTP Event Collector

A new feature in Splunk Enterprise 6.3, HTTP Event Collector lets you send data directly to Splunk Enterprise over HTTP. HTTP Event Collector can accept data from anywhere, as long as it's sent over HTTP and enclosed in a series of correctly formatted JSON data packets. You don't need to run any extra software on the client side. Though Splunk logging libraries that automate the data transmission process are available for Java, C#, and JavaScript, the client does not require any Splunk software to send data to HTTP Event Collector (or "EC") on Splunk Enterprise. EC uses specialized tokens, so you don't need to hard code your Splunk Enterprise credentials in your client app or supporting files. You can also scale out EC by using a load balancer to distribute incoming EC data evenly to indexers.

The basics of HTTP Event Collector on Splunk Enterprise are relatively simple:

  1. Turn on EC in Splunk Enterprise by enabling the HTTP Event Collector endpoint. It is not enabled by default.
  2. From the Splunk Enterprise instance, generate an EC token.
  3. On the machine that will log data to Splunk Enterprise, create a POST request, and set its authentication header to include the EC token.
  4. POST data in JSON format to the EC token receiver.

HTTP Event Collector is great for data input scenarios like the following:

  • Logging application diagnostics during development: Build logging into Java or C# apps that send debug data directly to EC running on Splunk Enterprise.
  • Logging in a distributed indexer configuration: Index huge amounts of EC data by taking advantage of built-in Splunk Enterprise load balancing capabilities.
  • Logging in a secure network: No external or cloud-based services are required to collect data.
  • Logging data from the browser: Build logging into webpages using JavaScript that sends usage data directly to EC running on Splunk Enterprise.
  • Logging data from automation scripts: Add logging to different stages of automated IT processes.
  • Sending data with specific source type: Assign incoming data a source type based on data source, token, and so on, and then define search time extractions and event types based on the source type.

For more information about HTTP Event Collector, see "Building in telemetry with high-performance data collection" or see "Introduction to Splunk HTTP Event Collector" on the Splunk Developer Portal.

Testing and staging data inputs

Inputs are distinctive and some can be quite idiosyncratic. Therefore, you should designate a part of your Splunk Enterprise deployment to testing and staging. Use the sandbox for testing all new inputs (custom built or obtained from Splunk or other providers) before putting them into the production environment.

TEST

Make sure that your sample data set is large enough and robust enough to detect edge cases.


If you are unable to procure a sandbox Splunk Enterprise deployment, at least use a staging index for testing to avoid polluting your main indexes. The staging index can be deleted when tests have completed.

Simulate your event stream

You usually want to begin testing the input handling part of your app early in the development cycle and for that, you need data to be available. But, for some reason, a live feed isn't available, the data is not immediately available in the format or volume you need, or the data is available but to limit your testing time you only want the data from a certain time interval or with specific values. In such cases, you can simulate your data. Splunk provides Eventgen, an event generation tool for you to use. The tool can replay events in a file or series of files, or randomly extract entries from a file and generate events at random intervals and change particular fields or values according to your specification. You can use the Eventgen tool to not only generate known and random events for your app, but to configure the tool to generate events that reflect natural usage patterns.

For your sample data, it's often convenient to start with an existing log file and use the eventgen.conf specification to do token replacement.You can also create a file of tokens and let the eventgen.conf replacement settings generate data the way you want it to look.

DEVEventgen is included with the PAS reference app download and install script. We preconfigured it to generate events for the Off-Hours Access and Terminated Employee Access scenarios. Run the install_addons.sh script (or install_addons.ps1 script for Windows PowerShell) at $SPLUNK_HOME/etc/apps/pas_ref_app/bin/ to install the PAS app, and it will install and configure Eventgen at the same time. The script creates a symbolic link in the $SPLUNK_HOME/etc/apps/ folder to the $SPLUNK_HOME/etc/apps/pas_ref_app/appserver/addons/eventgen folder. When you restart Splunk Enterprise, Eventgen starts and immediately begins creating events according to the PAS settings.

Here are the basic steps you'll need to follow to install and set up Eventgen using your own custom settings:

  1. Download Eventgen from GitHub.
  2. Install the app by extracting the downloaded eventgen-dev.zip file to your $SPLUNK_HOME/etc/apps/ folder, and then rename the decompressed folder to eventgen.
  3. Create a /samples folder in the folder for the app that will use the Eventgen data: $SPLUNK_HOME/etc/apps/$MYAPP/samples (where $MYAPP represents your app folder).
  4. Set permissions for your app so it is accessible by all other apps.
  5. Put your sample data in the /samples folder you just created. This is the data that Eventgen uses to replicate data as input to your app.
  6. In the $SPLUNK_HOME/etc/apps/$MYAPP/local folder, create the eventgen.conf file. You can also copy and modify the $SPLUNK_HOME/etc/apps/eventgen/README/eventgen.conf.example file you downloaded earlier.
  7. Edit the first stanza of the eventgen.conf file to reference the sample data file in the /samples folder.
  8. Restart your Splunk Enterprise instance.

SHIP

If you intend to ship Eventgen configuration with your app, include it in the default folder, not local.


The Eventgen tool can be run as an add-on, or as a scripted or modular input inside your app. See the $SPLUNK_HOME/etc/apps/eventgen/README/eventgen.conf.spec file for a complete description of the options available to you for generating sample data. You can also view the eventgen.conf.spec file on GitHub.

In addition to the preconfigured Eventgen install provided by the PAS reference app as a part of the Journey, see a good, example-based description in the Eventgen tutorial on GitHub. Eventgen can also be run as a standalone utility, using eventgen.py in the $SPLUNK_HOME/etc/apps/eventgen/bin folder. The tutorial provides descriptions of the proper settings for the various modes of operation and for getting the data in the /samples folder into your app.

See how we used Eventgen in the "Test and sample data" section in the "Platform and tools: a kitbag for our Journey" chapter of the Journey. You can find other interesting examples in the Splunk Blog's Tips & Tricks category with a basic introduction to Eventgen, how to  create random data in events, and how to sample events randomly from the data set.

Appendix C contains tips for troubleshooting Eventgen.