Searching the data

This chapter introduces you to a search process model that can be applied to almost any application, and suggests some optimum search language facilities to use at each step of the process. This includes basic search operations and optimizations, advanced techniques, and basic troubleshooting guidance.

Search is at the heart of any app and is the fundamental tool available for extracting the knowledge you're interested in from the great amount of data available. The Splunk Enterprise search language provides powerful constructs for sifting through your data once you have ingested and indexed the data. It provides an extensive set of commands, arguments, and functions that enable you to filter, modify, reorder, and group your search results.

Additional documentation that you might find helpful includes the Search Manual for a comprehensive description of all the things you can do using the search facilities and the Search Reference for a description of search language syntax. A particularly handy cheat sheet is the search language Quick Reference Guide.

Search concepts

Here are key search terms and concepts.

Search

The Search Processing Language (SPL) is the language you use to specify a search of your data. Generally, a search is a series of commands and arguments, chained together with the pipe ("|") character. For example, the following retrieves indexed access_combined events (which correspond to the HTTP web log) that contain the term "error" and, for those events, reports the most common URI values:

error sourcetype=access_combined | top uri
DEV

The way commands are pipe-delimited is analogous to how a Unix shell or Windows PowerShell connects line-processing programs together with pipes.

Syntactically, searches are made up of 5 basic components:

  • Search terms - what are we looking for?
    Keywords, phrases, Booleans, and so on.
  • Commands - what should we do with the results?
    Create a chart, compute statistics, evaluate, apply conditional logic and format, and so on.
  • Functions - how should we chart, compute, or evaluate?
    Get a sum, get an average, transform the values, and so on.
  • Arguments - are there variables we should apply to this function?
    Calculate average value for a specific field, convert milliseconds to seconds, and so on.
  • Clauses - how should we group the results?
    Get the average of values for the price field grouped by product, and so on.

The following diagram represents a search broken into its syntax components:

Search command

The search command is the simplest and most powerful SPL command. It is invoked implicitly at the beginning of a search.

When it's not the first command in a search, the search command can filter a set of results from the previous search. To do this, use the search command like any other command, with a pipe character followed by an explicit command name. For example, if we augment the previous search, it will search for the Web log events that that have the term "error," find the top URIs, and filter any URIs that only occur once.

error sourcetype=access_combined | top uri | search count>1
DEV

Keyword arguments to the search command are not case-sensitive, but field names are.

Event

An event is a single data entry, or key-value pair associated with a timestamp. Specifically, an event is a set of values associated with a timestamp. For example, here is an event in a Web activity logfile:

173.26.34.223 - - [01/Jul/2009:12:05:27 -0700] "GET /trade/app?action=logout HTTP/1.1" 200 2953

While many events are short, a line or two, some can be long, for example: a full text document, a configuration file, or java stack trace. Splunk Enterprise uses line-breaking rules to determine how to delineate events for display in search results.

DEV

Splunk Enterprise is intelligent enough to handle most multiline events correctly by default. In rare cases it doesn't, see "Configure event line breaking" for information on how to customize line breaking behavior.

Field

Fields are searchable name/value pairs in event data. As events are processed at index time and search time, fields are automatically extracted. At index time, a small set of default fields are extracted for each event, including "host," "source," and "sourcetype." At search time, Splunk extracts a wider range of fields from the event data, including obvious name-value pairs, such as "user_id=jdoe," and user-defined patterns.

Host

A host is the name of the device where an event originates. A host provides an easy way to find all data originating from a particular device.

Source/Sourcetype

A source is the name of the file, stream, or other input from which a particular event originates. For example, "/var/log/messages" or "UDP:514." Sources are classified by sourcetype, which can either be well known, such as "access_combined" in HTTP Web server logs, or can be created on the fly when a source is detected with data and formatting not previously seen. Events with the same sourcetype can come from different sources. Events from the "/var/log/messages" file and events from a syslog input on udp:514 can both have "sourcetype=linux_syslog".

Eventtype

Eventtypes are cross-referenced searches that categorize events at search time. Eventtypes are essentially dynamic tags that get attached to an event if it matches the search definition of the eventtype. For example, if you defined an eventtype called "problem" that has a search definition of "error OR warn OR fatal OR fail", any time your search result contains "error," "warn," "fatal," or "fail," the eventtype value is "problem".

Tag

A tag is a field value alias. For example, if two host names refer to the same computer, you could give both host values the same tag, such as "hal9000", and when you search for the "hal9000" tag, events for all hosts having that tag are returned.

ARCH

Tags are useful when normalizing data at search time. See "Tagging our Events" in "Working with data: Where it comes from and how we manage it" for an example of tagging in action.


A process model for search

To better understand the way search works in Splunk Enterprise, consider the following figure, which represents a conceptual search process model.

All search methodology can apply this model to some degree, depending on data complexity and the particular knowledge you want to extract from your data.

When you add raw data, Splunk Enterprise breaks the data into individual events, timestamps the events, and stores them in an index. The index can later be searched and analyzed. By default, your data is stored in the "main" index, but you can create and specify other indexes for different data inputs.

You want to select only the data you care about from the large amount of available data, and can use a four-stage process model to do that. The stages are displayed along the horizontal axis: Filter raw, Distill, Transform, and Filter (transformed), again, with each stage refining your search results in a way that's efficient for the particular stage. Let's consider each stage in detail.

The first Filter stage represents the most basic level of data extraction, which itself has different levels of extraction along the vertical axis. Filtering rows before filtering columns is the most efficient approach, and should be done early to reduce the number of events you have to work with. A row is an event and a column is an field within an event.

Time, or a time range, is usually the most effective filter. After time, select an index or a list of indices appropriate for your organization. Your organization might have data allocated to different indexes according to data sensitivity, retention periods, or some other organizational grouping. Next, filter on the default fields of timestamp, which shows when the event occurred, sourcetype, which identifies the event, and host plus source, which identifies the source location of the event. Finally, select any other desired terms of interest to you, including field values, keywords, and phrases. You can apply Boolean or comparison operations to your terms to extract data of even greater value.

Use the Distill stage to further refine your search to keep only those fields you want. You need only use the fields command if there is some benefit for you. For example, use fields to enumerate only those fields you want:

sourcetype=access_combined | fields clientip, action, status, categoryid, product_name

This only applies if your search returns a set of events. Also, remember that field names are case sensitive and must exactly match the field name in your data.

The Transform stage gives you the opportunity to transform search result data into the data structures required for visualizations or compute additional custom fields. Example transforms are:

  • The stats and contingency commands.
  • Data manipulation commands, like eval, rename, and others.
  • Statistical visualization commands, like timechart, chart and table.

Optionally, the Transform stage may include a Data Enrichment substage that involves adding or cross-referencing data from an external source (for example, a .csv file) using the lookups.

Using the transformed data from the previous stage, the Filter Transformed stage will filter the data again, using the where command, to reduce the number of columns and get exactly the data you want and in the format you want. You could append another search command but the where command is considered the more powerful option.

The following sections about searching will provide you with more detail about the facilities and language constructs available to you in each stage.

Become familiar with fundamental search patterns

The following are some examples of constructs you will commonly use to search your data. These are only some of the most common features of the search language and it might be helpful to have the Quick Reference Guide at hand to see similar functions available to you.

Search on keywords

Everything in your data is searchable. Using nothing more than keywords and a time range, you can learn a lot about your data. The simplest search command is to simply specify one or more keywords you want to search for in your data. Search returns all events that include the keyword(s) anywhere in the raw text of the event's data.

Because a keyword-only search returns every event matching your keyword, the next obvious step is to narrow your search, unless you actually want to know the total number of occurrences. A first step in narrowing search results might be to use Boolean operators. The search language supports the following Boolean operators: AND, OR, XOR, and NOT.

The AND operator is always implied between search terms so you don't need to specify it.

DEV

For efficient searches use inclusion (AND, OR, and XOR) rather than exclusion (NOT), which is the more expensive operation.

Another simple way to narrowing your search results is to use comparison operators (such as !=, <=, >=).

PERF

Use the Search Job Inspector to see timing differences between various ways of formulating your searches and to optimize them.


Use the appropriate search mode

Splunk Enterprise supports several search modes:

The Fast Mode gives you performance over completeness. On the converse, the Verbose Mode gives you completeness over performance. The Smart Mode (default) is a combination of Fast and Verbose modes. It is designed to give you the best results for your search with the field discovery turned on (like in the Verbose Mode) to extract all possible fields while reporting as configured in the Fast Mode.

Avoid using "All time" searches

By default the Search app uses "All time" in the time range picker. While it might be fine when exploring your data in real time, it is something you'd want to avoid in your apps (at least as a default option) as it will drastically impact the performance.

PERF

Change the default value from "All time" to a specific preset or range in your app. Consider disabling the All Time search option all together to make sure that an inexperienced user doesn't bring the whole search head to its knees (negatively impacting the experience for all other users) with a very expensive search.


Distill searches

To get an even better understanding of your data, you'll need to use fields. When you run simple searches based on arbitrary keywords, Splunk Enterprise matches the raw text of your data. Often, a field is a value at a fixed position in your event line. A field can also be a name-value pair, where there is a single value assigned to the field. A field can also be multivalued, where it occurs more than once in an event and can have a different value at each occurrence.

Fields are searchable name-value pairs that distinguish one event from another because not all events have the same fields and field values. Searches with fields are more targeted and retrieve more exact matches against your data. Some examples of fields are clientip for IP addresses accessing your Web server, _time for the timestamp of an event, and host for the domain name of a server. Splunk Enterprise extracts some default fields for each event at index time. These are the host, source, and sourcetype fields. At search time, the search extracts other fields that it identifies in your data. A familiar multivalue field example is the email address. While the From field contains a single email address value, the To and Cc fields can have multiple email address values. Field names are case sensitive but field values are not. Here's an example of a search to check for errors, which doesn't use fields:

error OR failed OR severe OR (sourcetype=access_* (404 OR 500 OR 503))

And the same search using fields:

error OR failed OR severe OR (sourcetype=access_* (status=404 OR status=500 OR status=503))

The first example is likely to return more results than the second. When you run simple searches based on arbitrary keywords, the search matches the raw text of your data. When you add fields to your search the search looks for events that have those specific field/value pairs. The second example returns only those results that have those specified values for the status field.

DEV

While Splunk Enterprise is pretty intelligent about extracting known fields, there are mechanisms for you to guide it at search time with regular expressions (see rex and erex commands). You can also use the Interactive Field Extractor (IFX), which provides a nifty UI for doing so. See "Build field extractions with the field extractor" for more info.

When considering performance, note that searches that only operate on fields perform faster because a full-text search is no longer needed.

DEV

Apply powerful filtering commands as early in your search as possible. Filtering to one thousand events and then ten events is faster than filtering to one million events and then narrowing to ten.


Transform searches

So far, we've only obtained search results that match keywords we've provided, and applied some simple yet powerful filtering to the results. But, if you want to get even more insight into your data and start to abstract what the data is telling you, you'll want to transform your search results using search language commands.

Two general types of transformation commands available to you are transformation by aggregation and transformation by annotation. The stats command is an example of transformation by aggregation. Given a number of events that you want to summarize, you can use stats functions like avg(), min(), max(), and count() to aggregate the values of a particular field found in all those events.

The eval command lets you create a new field based on a computation that uses other fields. Examples include isnum(), isstr(), isnull(), and len() functions. The general form is eval <newfield>=<computed expression on existing field>. You can even have conditional logic on a field using the if() conditional to reassign the value of a field based on the current value. You can also use the eval command to annotate a field by evaluating an expression and saving the result in a field, when you use eval with stats you need to rename the field:

stats count, count(eval(action="purchase")) AS Purchases

Enrich data with lookup tables

You can map field values in your events to field values used in external sources using lookup tables. This allows to enrich your event data by adding more meaningful information and searchable fields to them. For example, say you have a mapping between error codes and textual descriptions of what they mean; or you have a mapping between email addresses and Active Directory usernames and groups.

Lookups let you reference fields in an external table that match fields in your event data. A lookup table can be a static CSV file, a KV store collection, or the output of a Python script. You can also use the results of a search to populate the CSV file or KV store collection and then set that up as a lookup table. For more information about field lookups, see Configure CSV and external lookups and Configure KV store lookups. Using a script you can connect to a relational database as well (the blog post "Enriching Data with DB Lookups" explains how).

After you configure a fields lookup, you can invoke it from the Search app or with the lookup command. Elements of a dashboard view can also be populated with lookup data.

Use subsearches prudently

A subsearch is a search with a search pipeline as an argument. Subsearches are delineated by brackets and evaluated first. The result of the subsearch is used as an argument in the outer search. The main use cases for subsearches include finding events that:

  • Have matching values for a common field in the results of a subsearch (for example, looking for the products that are selling in multiple sales regions, or looking for users who are logged into both the VPN and the corporate network).
  • Do not have matching values for a common field in the results of a subsearch (for example, identifying products that are selling in one region but not another, or looking for tailgaters—those users who logged into the corporate network but didn't badge in).
  • Have matching values for a field with a different name in the results of a subsearch (looking for failed login attempts on the corporate network and the web servers)

Here's a walkthrough of a scenario that uses a subsearch (depicted in purple). Here we want to find all tailgaters during the last hour:

sourcetype=winauthentication_security
(EventCode=540 OR EventCode=673)
NOT [
    search sourcetype = "history_access"
    EventDescription=Access
    | dedup User
    | fields User
]

4. Perform the outer search for users who are NOT found in the subsearch.
1. Perform the subsearch and find all users who badged into a building.
2. Remove the duplicates.
3. Return a list of only user names—this step is needed for performance. Without it, the outer search will match all fields.

DEV

To troubleshoot subsearches, run both searches independently to be sure events are being returned. Additionally, you can use the Search Job Inspector.


DEV

The rule of thumb is to use subsearches only as a last resort. In terms of performance, subsearches are expensive, and are limited by both time (by default, only the events found during 60 seconds are returned) and the size of the result set (up to 10,000 entries by default), which limits their use on large datasets.


Check for optimization opportunities

Here are a few things you'll want to keep in mind that are fundamental to making your searches as efficient as possible.

Filter early

Filter early in the pipeline on the fields you have before transforming to remove fields you don't want. If you compute new fields, then filter on those fields.

Eliminate rows before columns

Usually, it is faster to remove rows so remove rows before columns. When removing columns, use the table or fields commands. Note that the fields command performs better than the table command in a distributed search.

Avoid real-time searches

Real-time searches are expensive and should not be used if your needs are not real time. In performance-sensitive contexts, you can specify a periodic search with a small time interval in the dashboard, to approximate a real-time search.

Avoid expensive commands

Expensive commands include subsearches, append, appendcols, transaction, fillnull, and join. While these are intuitive, handy and powerful tools they are computationally expensive. When possible, use the more efficient eval and stats commands, instead.

DEV

For a good example of how to replace append and join with more performant commands, see this post.

Favor search-time field extractions over index-time field extractions

Search-time field extractions will yield better performance than index-time field extractions.

Use the TERM() operator

The TERM() operator can give you a significant performance boost by treating the operand as a single term in the index, even if it contains characters that are usually recognized as breaks or delimiters.

Minimize scope

Scope your search as minimally, temporally,and on as few indexes as possible.

Take advantage of advanced search techniques

Some of the more advanced search capabilities that we'll discuss here can be powerful mechanisms for getting even more meaningful information from your data.

Accelerate your data

To efficiently report on large volumes of data, you need to create data summaries that are populated by the results of background runs of the search upon which the report is based. When you next run the report against data that has been summarized in this manner, it should complete significantly faster because the summaries are much smaller than the original events from which they were generated.

Splunk Enterprise provides three data summary creation methods:

  • Report acceleration - Uses automatically-created summaries to speed up completion times for certain kinds of reports.
  • Data model acceleration - Uses automatically-created summaries to speed up completion times for pivots.
  • Summary indexing - Enables acceleration of searches and reports by manually creating summary indexes that exist separately from the main indexes.

The primary difference between report acceleration and data model acceleration is:

  • Report acceleration and summary indexing speed up individual searches on a report by report basis, by building collections of precomputed search result aggregates.
  • Data model acceleration speeds up reporting for the specified set of attributes (fields) that you define in a data model.
  • Report acceleration is good for most slow-completing reports that have 100KB or more hot bucket events that meet the qualifying conditions for report acceleration.

Report acceleration is preferable over summary indexing for the following reasons:

  • Kicking off report acceleration is as easy as clicking a checkbox and selecting a time range.
  • Splunk Enterprise automatically shares report acceleration summaries with similar searches.
  • Report acceleration features automatic backfill.
  • Report acceleration summaries are stored alongside the buckets in your indexes.

In general, however, data model acceleration is faster than report acceleration.

For a complete overview of the acceleration mechanisms, read the "Overview of summary-based search and pivot acceleration."

Report acceleration

Report acceleration is used to accelerate individual reports. It's easy to set up for any transforming search or report that runs over a large dataset.

When you accelerate a report, Splunk Enterprise runs a background process that builds a data summary based on the results returned by the report. When you next run the search, Splunk Enterprise runs it against this summary instead of the full index. Because this summary is smaller than the full index and contains pre-computed summary data relevant to the search, the search should complete much quicker than it did without report acceleration.

For a report to qualify for acceleration its search string must use a transforming command, such as chart, timechart, stats, and top. Additionally, if there are any other commands before the first transforming command they must be streamable, which means that they apply a transformation to each event returned by the search.

To learn more, read the Accelerate reports documentation.

Data model acceleration

Data model acceleration creates summaries for the specific set of fields you want to report on, accelerating the dataset represented by the collection of fields instead of the dataset represented by a full search. Data model acceleration summaries take the form of time-series index files (TSIDX), which have the .tsidx file extension. These summaries are optimized to accelerate a range of analytical searches involving a specific set of fields, which are the set of fields defined as attributes in the accelerated data model.

Use the tstats command to perform statistical queries on indexed fields in TSIDX files.

Data model acceleration makes use of High Performance Analytics Store (HPAS) technology, which is similar to report acceleration in that it builds summaries alongside the buckets in your indexes. Also, like report acceleration, persistent data model acceleration is easy to enable in the UI by selecting the data model you want to accelerate and select a summary range. Once you do this, Splunk Enterprise starts building a summary that spans the specified range. When the summary is complete, any pivot, report, or dashboard panel that uses an accelerated data model object will run against the summary instead of the full array of raw data whenever possible, and you should see a significant improvement in performance.

To learn more, read the Accelerate data models documentation.

Summary indexing

Use summary indexing on large datasets to efficiently create reports that don't qualify for report acceleration. With summary indexing, you set up a search that extracts the precise information you frequently want. It's similar to report acceleration in that it involves populating a data summary with the results of a search, but with summary indexing the data summary is actually a special summary index that is built and stored on the search head. Each time Splunk Enterprise runs this search it saves the results into a summary index that you designate. You can then run searches and reports on this significantly smaller summary index, resulting in faster reports.

Summary indexing allows the cost of a computationally expensive report to be spread over time.

To read more about summary indexing, see the Summary Indexing documentation.

Use prediction

What if you have data with missing fields or fields whose values you suspect might not be accurate, such as with human-entered data or otherwise noisy data? You would like to use a predicted value to validate an actual value or fill in a missing value For such cases, you can train search to predict the missing or suspect fields, using the train command.

Your first step is to train search to learn what a field value is expected to be based on known fields. For example, to train search to predict the gender given a name:

index=_internal | fields name, gender | train name2gender from gender

Behind the scenes, this search builds a model, name2gender, which you can use in subsequent searches to predict missing or inaccurate gender from the name field. Once trained, you can use the model in subsequent searches to guess the suspect field:

index=_internal | guess name2gender into gender

SPL has basic predict, trendline and x11 commands to help with your prediction and trending computations.

For a example of how to use search to predict fields, download the Predict App from Splunkbase.

Use correlation

Finding associations and correlations between data fields, and operating on multiple search results, can provide powerful insights into your data. The following commands can be used to build correlation searches:

Search command Description
append Appends subsearch results to current results.
appendcols Appends the fields of the subsearch results to current results, first results to first result, second to second, etc.
appendpipe Appends the result of the subpipeline applied to the current result set to results.
arules Finds association rules between field values.
associate Identifies correlations between fields.
contingency Builds a contingency table for two fields.
correlate Calculates the correlation between different fields.
diff Returns the difference between two search results.
join SQL-like joining of results from the main results pipeline with the results from the subpipeline.
lookup Explicitly invokes field value lookups.
selfjoin Joins results with itself.
set Performs set operations (union, diff, intersect) on subsearches.
stats Provides statistics, grouped optionally by fields. See also, Functions for stats, chart, and timechart.
transaction Groups search results into transactions.

Use custom search commands

You can use the Splunk SDK for Python to extend the Splunk Enterprise search language. The GeneratingCommand class lets you query any data source, such as an external API, to generate events. The custom search command is deployed as a Splunk Enterprise application and invoked by piping events from the application into search. The general search command syntax is,

| <customCommand> [[<parmameterName>=<parameterValue>] ...]

If your custom search queries an external API, you'll need credentials to access the API.

You need to implement the generate() function in your command, deriving from the GeneratingCommand class and adding logic that creates events and outputs the events to Splunk Enterprise. Be sure to install and run setup for the SDK. A template is provided with the SDK that you can use as a starting point for creating the new custom command. See the "Building custom search commands in Python part I - A simple Generating command" blog post for an example of how to create a new command starting from the template.

Like all apps, the custom search command must also reside in its own folder in the $SPLUNK_HOME/etc/apps folder. The yield line is where you specify your event:

yield {'_time': <timeValue>, 'event_no': <eventNumber>, '_raw': <eventData> }
  • _time. The event timestamp.
  • event_no. An example of a generated field, which is a field that can be selected in the field picker, this is the message count of the event.
  • _raw (optional). The event data.

Once you've implemented the custom search command, you can test it from the Python command line, passing the parameters your command expects and the data source:

python <customCommand>.py __EXECUTE__ <parameterName>=<parameterValue> < <dataSource>
TEST

A convenient data source for testing is a CSV file.



Troubleshoot your searches

A convenient way to test and troubleshoot your search string is to use the Search app. If you have a complex search with multiple piped segments, try removing pipe segments one at a time until you find the source of the error. Divide and conquer!

Another handy tool to use is the Search Job Inspector. It allows to:

  • Examine overall statistics of the search, events processed, results returned, and processing time
  • Determine how the search was processed
  • Troubleshoot a search's performance
  • Understand the impact of specific knowledge object processing, such as tags, event types, lookups and so on, within the search.

Here's an example of the execution costs for a search:

index=pas | dedup customer_name

The search commands component of the total execution costs might look something like this:

Note: You can inspect a search as long as it exists and has not expired even if it hasn't completed.


The command.search component and everything under it, gives you the performance impact of the search command portion of your search, which is everything before the pipe character. The command.prededup gives you the performance impact of processing the results of the search command before passing it into the dedup command.

PERFTo evaluate performance, don't use results count over time. A more telling metric is the scan count over time. In the example above: resultsCount/time = 93 / 4.387 = 21.20 eps (events per second; scanCount/time = 65,507 / 4.387 ≈ 15K eps.

10-20K eps is considered to be good performance.