Download | Support
Splunk.com | SplunkBase | dev.splunk.com

lbitincka

February 22nd, 2008

Delimiter base KV extraction - advanced

If you’ve read my previous post on delimiter based KV extraction, you might be wandering whether you could do more with it (Anonymous Coward did). Well, yes you can, I am going to cover the “advanced” cases here. Before covering the capabilities, as in other posts, I would first go over some observations and examples.

Observations
1. Header-body. Some applications, for different reasons, choose to format their log files using a header and a body section. The header usually describes the way the fields are organized in each logged event, while the body consists of logged events, usually one per line, with field values delimited as described in the header. W3C, CSV etc come to mind, see examples
2. Single-delimiter. Other applications choose to use a single delimiter to delimit keys from values and values from keys, while this is not very common it’s been observed in the field.

Data Examples
The following header-body sample, as you can probably guess, is from an exchange server. There is a header section which among other things has the list of field names, delimited from each other using the delimiter used to delimit values in the body section, in this case a tab character is used (even though our blogging platform chooses to mangle tabs to spaces - gotta love it !!!).

# Message Tracking Log File
# Exchange System Attendant Version 6.5.7638.1
# Fields: time client-ip cs-method sc-status
14:13:11 10.1.1.9 HELO 250
14:13:13 10.1.1.9 MAIL 250
14:13:19 10.1.1.9 RCPT 250
14:13:29 10.1.1.9 DATA 250
14:13:31 10.1.1.9 QUIT 240

The following example shows how a single-delimiter can be used to list fields, it is pretty easy for us, as humans, to recognize the key value pairs:

"url http://splunk.com referer http://dev.splunk.com ip 10.10.10.10"

Enabling header-body kv/extract
The delimiter based KV extraction solves the header-body problem by

Read More...

February 12th, 2008

Delimiter based key-value pair extraction

As described in my previous post, key-value pair extraction (or more generally structure extraction) is a crucial first step to further data analysis. While automatic extraction is highly desirable, we believe empowering our users with tools to apply their domain knowledge is equally important. To this end, this post introduces one of the simplest forms of key-value pair extractions (KV-extraction) - delimiter based extraction.

Observation

Most logged events usually contain a list of key-value pairs (e.g. attribute list, method call values etc) in a context-dependent well-defined format. An example of well-defined format: ” key-value pairs are separated from each other using ‘;’ while the key is separated from the value using ‘=’ “. More generally, well defined attribute listing formats are not confined to logging, they’re part of every event-driven, flexible attribute order, application: e.g. URL get parameter list, HTTP request/response headers, email headers etc… In most application the delimiters are single characters which are least likely to be part of the key or value, whenever the key/value contains any of the delimiters it is normally enclosed in literal-defining characters usually double-quotes (”).

Definition: delimiter based KV extraction
Let’s first define three character classes:
1. [pairdelim] - non-empty list of characters used to separate key value pairs from each other. (chars after value, before next key)
2. [kvdelim] - non-empty list of characters used to separate the key from the value. (chars after key, before next value)
3. [quoter] - list of characters used to enclose a literal - currently *only* quotes are supported and this variable is not configurable

Thus we can formally define a key-value pair list as follows:

kvlist = <key>[kvdelim]<value>([pairdelim]<key>[kvdelim]<value>)*
key = <string>|<quoter><string><quoter>
value = <string>|<quoter><string><quoter>
quoter = "

Thus, delimiter KV-extraction can be achieved by a two layer tokenization/splitting process:
1. Split on the pair delimiter to extract candidate

Read More...

January 18th, 2008

Key-value pair extraction definition, examples and solutions….

Most of the time logs contain data which, by humans, can be easily recognized as either completely or semi-structured information. Being able to extract structure in log data is a necessary first step to further, more interesting, analysis. While it would be great to be able to automatically extract the structure from all log data, splunk cannot rival the brain’s performance at this time, however it is able to tap into your brain for help :) Read on ……

Problem definition:
Extract structured information (in the form of key/field=value form) from un/semi-structured log data.
Note: for the purpose of this post key or field are used interchangeably to denote a variable name.

Problem examples:
Splunk debug message (humans: easy, machine: easy)

12-03-2007 13:51:55.114 DEBUG SearchPipelinePerformance - processor=save queryid=_1196718714_619358 executetime=0.014secs
ideal structured information to extract:
processor=save
queryid=_1196718714_619358
executetime=0.014secs

Splunk tries to make it easy for itself to parse it’s own log files (in most cases)

Output of the ping command (humans: easy, machine: medium)

64 bytes from 192.168.1.1: icmp_seq=0 ttl=64 time=2.522 ms
ideal structured information to extract:
bytes=64
from=192.168.1.1
icmp_seq=0
ttl=64
time=2.522 ms

An interesting pattern to note here is that there is no consistent field-value delimiter, nor field-value order. In the “from” field the authors have chosen to use a space as a delimiter, while for “icmp_seq”, “ttl” and “time” they’ve chosen the equal sign. For the “bytes” field they’ve chosen to place it after the value (yes, they might have also intended for it to mean bytes - the data unit) while for the rest they’ve chosen field-name followed by field-value. Admittedly, some might think the current format is prettier than the following consistent log line which could easily be parsed by machines. (Who thought log files were optimized for prettiness !?)

bytes=64, from=192.168.1.1, icmp_seq=0, ttl=64, time=2.522 ms

NetScreen log (humans: medium, machine: hard)

%MD% %DD%

Read More...


Close
E-mail It