How to work with data models and pivots in the Splunk SDK for Java

In Splunk Enterprise 6.0 and later, you can use data models to create specialized searches of your datasets. Data models allow you to produce pivot tables, charts, and visualizations on the fly, based on column and row configurations that you select, and without necessarily having to know the Splunk Enterprise search language.

The Splunk SDK for Java version 1.3 and later includes support for data models and pivots. Using the SDK, you can enable your Java application to read and use data models that have been created on your Splunk Enterprise instance. You can also use the SDK to create data models, though support is limited to sending raw JSON containing the data model specification to Splunk Enterprise. The easiest way to create new data models continues to be logging directly into Splunk Enterprise with a browser.

This topic contains the following sections:

About data models

Data models map semantic knowledge about one or more datasets. The data model encodes the domain knowledge that is necessary to generate specialized searches of those datasets. Data models are what enable you to use pivots to produce useful reports and dashboards without having to write the searches that generate them. Data models contain data model objects, which are essentially specifications for a dataset. Each data model object represents different datasets within the larger set of data that Splunk Enterprise indexes.

Data model objects inherit from other data model objects. Each object inherits either from one of three base objects built into Splunk Enterprise (BaseEvent, BaseTransaction, and BaseSearch) or from another object in the same data model. Allowing inheritance only from a fixed set of base classes or in the data model prevents complications involving access control lists (ACLs).

To learn more about data models, see the following topics:

Retrieve a data model

To retrieve an individual data model, you first get the collection of all data models that are accessible to a user with your credentials. After you've connected to Splunk Enterprise, use the getDataModels() method of your Service object to retrieve a DataModelCollection containing the set of data models. Then, use the get("<data_model>") method of the DataModelCollection object to retrieve the data model with the name specified by <data_model>.

Following is an example that retrieves the data model called "internal_audit_logs," which is part of every standard Splunk Enterprise install.

// Connect to splunkd. (See http://dev.splunk.com/view/java-sdk/SP-CAAAECX.)
Service service = ...; 
// Get the collection of data models.
DataModelCollection dataModelCollection = service.getDataModels();
// Get the specified data model.
DataModel dataModel = dataModelCollection.get("internal_audit_logs");

See the contents of a data model

A data model consists of several metadata fields (its internal name, its human-readable name, and a description), a value that indicates whether acceleration has been enabled (For more information about accelerating data models, see "Accelerate data models and pivots."), and a collection of data model objects.

Retrieve the data model's internal name using the DataModel class' getName() method. Retrieve its display name using the getDisplayName() method. Retrieve its description using the getDescription() method. For example:

System.out.println("Data model named " + dataModel.getDisplayName() + " (internal name: " + dataModel.getName() + ")");
System.out.println("Description:");
System.out.println(dataModel.getDescription());

Retrieve data model objects

To iterate over the data model objects within a data model, call the DataModel class' getObjects() method, as shown here:

for (DataModelObject object : dataModel.getObjects()) {
    System.out.println("Object: " + object.getDisplayName() + " (internal name: " + object.getName() + ")");
}

You can also use the DataModel class' containsObject() and getObject() methods to check for and retrieve individual objects by name. For example:

assert dataModel.containsObject("searches");
DataModelObject searches = dataModel.getObject("searches");
System.out.println("Object: " + searches.getDisplayName() + " (internal name: " + searches.getName() + ")");

Work with data model objects

Data model objects are hierarchical; they are arranged in parent-child relationships. The top-level object in any object hierarchy is referred to as a root object. Any object that descends from a root object is a child object.

Child objects inherit calculations and fields from their parent objects. Calculations narrow down the set of data represented by the object, while fields are name/value pairs associated with the object dataset. Each child object can add calculations to the ones it inherits. Fields are used by Pivot designers to define pivot tables and charts. Child objects can optionally have new fields in addition to the fields they inherit from their parent object.

A data model object is uniquely identified by the full list of ancestors from which it inherits. You can get the lineage of an object with the DataModelObject class' getLineage() method, as demonstrated here:

int offset = 0;
for (String entry : searches.getLineage()) {
    for (int i = 0; i < offset; i++) System.out.print(" ");
    System.out.println(entry);
    offset += 2;
}

Fields are represented in the Splunk SDK for Java by instances of the DataModelField class. You can retrieve a field by name with the getField() method of the DataModelObject class, or retrieve an iterable collection of all the fields defined on the object with the getFields() method. Each field contains a type (represented by values of the FieldType enumeration) and one of the following values:

  • string
  • number
  • IPv4 address
  • timestamp
  • Boolean value
  • one of two internal types: object count and child count

The following example iterates through the fields of a data model object, printing its lineage (using the DataModelField.getOwnerLineage() method), its type (DataModelField.getType()), whether it has multiple values (DataModelField.isMultivalued()), whether it must appear on an event in the object (DataModelField.isRequired()), whether it is displayed to the user in UI (DataModelField.isHidden()), and whether it is editable (DataModelField.isEditable()).

// Iterate over all fields.
for (Field field : searches.getFields()) {
     System.out.println("Field " + field.getDisplayName() + " (internal name: " + field.getName() + ")");
     System.out.print("  defined on ");
     boolean first = true;
     for (String entry : field.getOwnerLineage()) {
          if (!first) 
               System.out.print(" -> ");
          System.out.print(entry);
          first = false;
     }
     System.out.println("  of type " + field.getType().toString());
     // Will this field potentially contain multiple values?
     System.out.println("  multivalued: " + field.isMultivalued());
     // Must this field appear on an event in this object?
     System.out.println("  isRequired: " + field.isRequired());
     // Should the field be displayed to the user if you are writing an
     // interface to a data model?
     System.out.println("  isHidden: " + field.isHidden());
     // Can you edit this field? Typically system fields or fields
     // inherited from other objects cannot be edited.
     System.out.println("  isEditable: " + field.isEditable());
}

// Or fetch a single field.
Field time = searches.getField("_time");

A group of people who are working with sets of similar Splunk Enterprise queries can define a data model object that encapsulates the shared prefix of their various queries, and then use the data model's search command to execute it followed by the rest of their particular query. You can think of a data model object as a kind of stored procedure.

For example, the following code uses the getQuery() method of the DataModelObject class to retrieve the five users who have run the most search jobs using the searches data model object we fetched in "Retrieve data model objects."

DataModelObject searches = ...; 

String query = searches.getQuery() + "| stats count by user | sort -count | head 5";
Job mostActiveUsers = service.createSearch(query);

This is a common enough scenario that the data model object provides the runQuery(java.lang.String querySuffix) convenience method (on the DataModelObject class) for it:

DataModelObject searches = ...;

Job mostActiveUsers = searches.runQuery("| stats count by user | sort -count | head 5");

You can also pass a JobArgs object to runQuery() as you would to JobCollection.create(), or omit the additional query string to fetch only the unmodified events that appear in the data model object.

Work with pivots

The Splunk SDK for Java provides a pivot table interface to the events in data model objects. The SDK gives you the same control over pivots that the pivot tool does within Splunk Enterprise. A pivot is created in several stages.

Note: For a full working example of pivots using the Splunk SDK for Java, see the fluent_pivot example in the examples directory of the SDK.

First, you define a PivotSpecification instance, setting the fields to use to split rows, split columns, or calculate aggregates for each cell in the table. The PivotSpecification's constructor takes a data model object as its sole argument, and the pivot is defined on that data model object. Then, you call the pivot() method on the PivotSpecification object. That sends a request to the Splunk server to get a set of SPL queries that represent that pivot, which you can then use however you want. For example, using the searches data model object we defined in "Retrieve data model objects.":

// Create a specification of a pivot on the searches data model object
// we retrieved previously.
PivotSpecification pspec = searches.createPivotSpecification();

... // configure pspec

Pivot pivot = pspec.pivot();
System.out.println("The query corresponding to this pivot is: ");
System.out.println("  " + pivot.getPrettyQuery());

Job pivotJob = pivot.run();

Configuring the pivot consists of adding four kinds of entities to it:

  • Filters restrict the events to be calculated on in the pivot.
  • Cell values describe an aggregate calculation to be done.
  • Row splits describe how to split the data along one axis before aggregating it in the cell values.
  • Column splits describe how to split the data along the other axis.

The entities are added to a PivotSpecification by calling the overloaded methods addFilter(), addCellValue(), addRowSplit(), and addColumnSplit() on your PivotSpecification object. The arguments to each depend on the type of field to be added. Each of these methods is examined in detailed in the following sections.

Filters

Filters restrict the events that will be processed by the pivot. They are added by invoking one of the overloaded PivotSpecification.addFilter() methods. There is one method each for type Boolean, string, IPv4 address, and number, and one that restricts the number of values of an aggregated field that will be allowed into the pivot. Each of these methods is demonstrated here:

PivotSpecification pspec = ...; // Defined previously.

// Examples of adding filters of each of the four types.
pspec.addFilter("??", BooleanComparison.EQUALS, true).
      addFilter("??", StringComparison.STARTS_WITH, "pic_").
      addFilter("??", IPv4Comparison.CONTAINS, "0.0").
      addFilter("??", NumberComparison.AT_MOST, 3);

// Example of limiting the number of distinct values of host
// to allow, sorted by aggregating the number of users from
// each host. This filter counts the distinct users that have
// produced searches from each host, sorts the hosts from
// largest number of distinct users to smallest, and only
// admits events with the top 50 hosts into the pivot.
pspec.addFilter("host", "user", SortDirection.DESCENDING,
                50, StatsFunction.DISTINCT_COUNT);

Cell values

The cells of a pivot table consist of aggregate calculations done on the events that pass through the filters, and which are assigned to that cell by row and column splits. There is only one method for adding cell values (PivotSpecification.addCellValue()), but not all the aggregating functions defined by the StatsFunction enumeration work with all cell values. The functions that can be used with each type are:

  • String: LIST, DISTINCT_VALUES, FIRST, LAST, COUNT, DISTINCT_COUNT
  • IPv4: (same as string)
  • number: SUM, COUNT, AVERAGE, MAX, MIN, STDEV, LIST, DISTINCT_VALUES
  • timestamp: DURATION, EARLIEST, LATEST, LIST, DISTINCT_VALUES
  • childcount or objectcount: COUNT
  • boolean: none. Trying to use a Boolean valued field for a cell value will raise an IllegalArgumentException.

The addCellValue() method has the following signature:

PivotSpecification addCellValue(
    String field, // the field to aggregate
    String label, // a human readable label to display for this aggregate
    StatsFunction statsFunction, // the aggregate function to use
)

Here are two examples of adding cell values to the PivotSpecification object pspec, which we retrieved in "Work with pivots."

pspec.addCellValue("host", "Relevant hosts", StatsFunction.DISTINCT_VALUES, false).
      addCellValue("exec_time", "Longest running job", StatsFunction.MAX, true);

Row splits

Row splits divide the data in a pivot table into rows before aggregates are calculated for each cell. A PivotSpecification object with more than one row split will result in separate rows for each combination of distinct splits for each row split. So if we have one row split that produces two rows, abcd=0 and abcd=1, and another that produces two, wxyz=a and wxyz=b, then there would be four rows in the split, (abcd=0, wxyz=a), (abcd=1, wxyz=a), (abcd=0, wxyz=b), and (abcd=1, wxyz=b).

Row splits are added to PivotSpecification objects with one of the four overloaded addRowSplit() methods.

To split the row for each distinct value of a field, for fields of type string or number, use the following method, where <field> is the name of the field to split and <label> is a human-readable label to display in a visual representation:

addRowSplit(String <field>, String 

You can add row splits on timestamp-valued fields using the following method, which splits the field's values into ranges of a precision specified by binning:

addRowSplit(String field, String label, TimestampBinning binning)

The values of the enumeration TimestampBinning are AUTO, YEAR, MONTH, DAY, HOUR, MINUTE, and SECOND.

Boolean-valued fields take the same field and label arguments in their method, but the method also requires two String arguments for the labels to display in each row if the field value is true or false:

addRowSplit(String field, String label, String trueDisplayValue, String falseDisplayValue)

Finally, number-valued fields can be split into ranges similar to the way in which timestamp-valued fields are. The method for this case takes four optional arguments in addition to field and label arguments. Any arguments omitted should be passed as null.

addRowSplit(String field, String label,
    Integer start, // The start of the first range
    Integer end,   // The end of the last range
    Integer step,  // The width to use for ranges
    Integer limit  // The maximum number of ranges to split into
)

For example, if you want to only bin values between 0 and 100 and have no more than ten bins, you would call:

addRowSplit("??", "??", 0, 100, null, 10);

If you want bins that are 15 wide and started at 12, you would call:

addRowSplit("??", "??", 12, null, 15, null);

Column splits

Column splits are the complement to row splits. They divide events that pass through the filters into sets before aggregates are calculated for each cell. The methods for column splits are identical to those for row splits, but they lack a label argument.

Accelerate data models and pivots

Data models take advantage of the built-in support for accelerated searches and aggregations in Splunk Enterprise 6. Within Splunk Enterprise, acceleration entails running a search job on a regular schedule and caching its results for use in the data model and any pivots on the data model. Acceleration is enabled with the Splunk SDK for Java by calling setAcceleration(true) on a data model (a DataModel object), and then pushing the changes to the server. Be aware that only public data models (that is, where sharing != "user") can be accelerated.

DataModel dataModel = service.getDataModels().
                              get("internal_audit_logs");

dataModel.setAcceleration(true).
          update(); // enable acceleration and then push the changes to the server

This global acceleration switch has two options: the earliest time relative to now that the acceleration cache should be maintained (for example, the last week, the last two months, or the last three hours), and the cron schedule on which the acceleration job should be run. The following example specifies that the acceleration cache for the last two months should be kept, and that the acceleration job should be run every minute:

dataModel.setEarliestTime("-2mon").
          setAccelerationCronSchedule("* * * * *").
          update();

Enabling acceleration on a data model will accelerate all objects that inherit from BaseEvent in the data model. Objects that inherit from BaseTransaction or BaseSearch cannot be accelerated, and will be unaffected by enabling acceleration.

You can also do ad hoc acceleration, running and managing the caching job yourself. Call the createLocalAccelerationJob() method on a particular data model object to return an acceleration job, as demonstrated here on the data model object we retrieved in "Retrieve data model objects." Note how the ad hoc job is canceled using Job.cancel() when you are finished querying the results:

DataModelObject searches = ...;

Job accelJob = searches.createLocalAccelerationJob();
// or createLocalAccelerationJob("-2days") or similar to set an earliest time

PivotSpecification pspec = searches.createPivotSpecification();
... // configure pspec
Pivot pivot = pspec.pivot(accelJob); // pivot using accelJob as the cache

... // use the pivot queries here

accelJob.cancel(); // Cancel the job when you have finished using its cached results.

If you want, you can also specify an arbitrary time-series index file (tsidx) namespace to the pivot method to use for a cache. Enabling acceleration on a whole data model is equivalent to calling pivot with the name of the data model as a namespace:

pspec.pivot(searches.getAccelerationNamespace());

// is the same on an accelerated data model as

pspec.pivot();

Data model example

The following code demonstrates how to perform a few basic actions with one of a data model's objects. The example:

  1. Connects to Splunk Enterprise. (The full details of this step are not shown, but they are provided in "How to connect to Splunk Enterprise.")
  2. Retrieves a data model (in this case, "internal_audit_logs," which is included with each standard Splunk Enterprise install).
  3. Retrieves the "searches" data model object from the query.
  4. Runs a query on the data model object, appending "| head 5" to the query to return just the first five events.
  5. Creates an XML results reader using the results of the query, and then prints the results using the reader.
  6. Creates a pivot by using the data model object as input.
  7. Splits the pivot's events into groups with a distinct user and no more than four execution time ranges, specifying a list of distinct search queries for each cell.
  8. Retrieves the pivot's search queries.
  9. Prints the human-readable Splunk Processing Language (SPL) query that implements the pivot.
  10. Runs the pivot.
  11. Creates an XML results reader using the results of the pivot, and then prints the results using the reader.
// Connect to Splunk Enterprise.
// Complete code: http://dev.splunk.com/view/java-sdk/SP-CAAAECX
service = Service.connect(...); 

// Retrieve a data model to work with.
DataModel dataModel = service.getDataModels().get("internal_audit_logs");

// Retrieve a data model object to work with.
DataModelObject searches = dataModel.getObject("searches");

// Run a query, appending "| head5" to get just the first five events.
Job firstFiveEntries = searches.runQuery("| head 5");
while (!firstFiveEntries.isDone()) {
     Thread.sleep(100);
}

// Print results using an XML reader.
ResultsReaderXml results = new ResultsReaderXml(firstFiveEntries.getResults());
for (Event event : results) {
    System.out.println(event.toString());
}

// Create a pivot on the data model object.
PivotSpecification pivotSpecification = searches.createPivotSpecification();

// Split the events in the object into groups with:
//   distinct user
//   ranges of execution time (but no more than four bins)
// Produce a list of the distinct search queries for each cell.
pivotSpecification.addRowSplit("user", "Executing user").
                   addColumnSplit("exec_time", null, null, null, 4).
                   addCellValue("search", "Search Query", StatsFunction.DISTINCT_VALUES, false);

// Retrieve the pivot's corresponding queries.
Pivot pivot = pivotSpecification.pivot();

// Print the pivot's SPL query.
System.out.println("Query for binning search queries by execution time and executing user:");
System.out.println("  " + pivot.getPrettyQuery());

// Run the pivot.
Job pivotJob = pivot.run();
while (!pivotJob.isDone()) {
    Thread.sleep(100);
}

// Print results using an XML reader.
results = new ResultsReaderXml(pivotJob.getResults());
for (Event event : results) {
    System.out.println(event.toString());
}