python SDK _raw value part missing

ruivapps · ‎03-19-2013

I'm using python SDK to query splunk.
below are how data looks like:

I'm running query from as following on web, _raw was displayed correctly.
index=vgw "Session 25907" source="20130315.log" "end reason"|table _raw

result:
2013-03-15 08:42:41 : Session 25907 VGWSession:: end reason: ep disconnect

however same query running from python SDK (I'm following example for "oneshot search" and "normal search" at http://dev.splunk.com/view/SP-CAAAEE5#oneshotjob

I was running same query (without table), it returns:

OrderedDict([('_bkt', 'vgw~490~1EF8E9B1-5238-48F9-8B5A-2B768B4DB0E8'), ('_cd', '490:29401397'), ('_indextime', '1363362162'), ('_raw', '2013-03-15 08:42:41 : '), ('_serial', '0'), ('_si', ['splunk4', 'vgw']), ('_sourcetype', 'vgw'), ('_time', '2013-03-15T08:42:41.000-07:00'), ('host', 'vgw5'), ('index', 'vgw'), ('linecount', '1'), ('source', '20130315.log'), ('sourcetype', 'vgw'), ('splunk_server', 'splunk4')])

the _raw filed did't have everything, it's only part of it.

anyone know why? or experience same? how to fix it?

hexx · ‎03-19-2013

This is caused by the insertion of special tags in the event raw data to highlight matched search terms in Splunkweb. This is not an appropriate default behavior for the SDK result-fetching method and there is currently a bug opened to fix this (internal reference DVPL-1519).

Fortunately, avoiding this problem is fairly trivial: One simply needs to pass segmentation='none' as an argument to the job.results() method. Here's a code example that will fetch the 5 most recent events from the _internal index matching the term "queue" but won't truncate the _raw field right where the matched term appears:

#!/usr/bin/python

import splunklib.client as client
import splunklib.results as results

service = client.connect(username='admin',password='b33rm3')

kwargs_blocking = { "field_list": "_raw", "earliest_time": "-461", "exec_mode": "blocking", "max_count": "80" }
query = "search index=_internal source=*/metrics.log group=queue queue | head 5"

job = service.jobs.create(query, **kwargs_blocking)

rr = results.ResultsReader(job.results(segmentation='none'))
for result in rr:
    print result

The important part here is job.results(segmentation='none').

hexx · ‎03-20-2013

Which version of Splunk are you running this code against? The 'segmentation' argument for the /services/search/jobs/{sid}/events endpoint is only available on Splunk 5.0 and onwards.

ruivapps · ‎03-19-2013

added segmentation='none' still shows the same.
below are my code for testing it:

    #!/opt/python/bin/python
    import sys, ConfigParser
    from splunklib.binding import HTTPError
    import splunklib.client as client
    import splunklib.results as results

    def getconf():
        configuration = {}
        config = ConfigParser.RawConfigParser()
        config.read('splunk.cfg') 
        for session in config.sections(): 
            if not configuration.has_key(session): 
                configuration[session]={} 
            for options in config.options(session): 
                if not configuration[session].has_key(options): 
                    configuration[session][options]=config.get(session, options) 
        return configuration['splunk']

    def format_time(timestamp):
        """
        splunk format: 2013-03-18T12:00:00.000-00:00
        """
        datetime=timestamp.split('/')
        timestamp={'earliest_time' : '%s-%s-%sT00:00:00.000-00:00' %(datetime[2], datetime[0], datetime[1]), 
                   'latest_time'   : '%s-%s-%sT23:59:59.999-00:00' %(datetime[2], datetime[0], int(datetime[1])+1)}
        return timestamp

    def search(configuration, query, timestamp):
        configuration.update(timestamp)
        service = client.connect(**configuration)
        try:
            job=service.jobs.create(query, **configuration)
        except HTTPError, e:
            print ("query '%s' is invalid:\n\t%s" %(search, e.message))
            return
        rr = results.ResultsReader(job.results(segmentation='none'))
        return rr

    if __name__ == "__main__":
        sys.argv.append('42248278')
        sys.argv.append('3/15/2013')
        if len(sys.argv)<3:
            print "%s callid, datetime (1/1/2013)" %sys.argv[0]
            sys.exit()

        query="""search index=vgw "Session 25907" source="/opt/ec/vgw/logs/vgw_g2m_live_vgw5.sjc.expertcity.com_2-20130315.log" "end reason" """
        result = search(getconf(), query, format_time(sys.argv[2]))
        for item in result:
            print item['_raw']
        query="""search index=vgw "Session 25907" source="/opt/ec/vgw/logs/vgw_g2m_live_vgw5.sjc.expertcity.com_2-20130315.log" "end reason" | rex field=_raw "reason(?<reason>.*)"|table * """
        result = search(getconf(), query, format_time(sys.argv[2]))
        print '*'*80
        for item in result:
            print item['reason']

I replace original "service.jobs.oneshot" to "service.job.create" and added segmentation='none'
the result still shows same. the 2nd query is the "fix" I use now to get data out. I regex and create a new field to get the data.

here is the output:

[root@asg1-mpostgres splunk]# ./xx.py 
2013-03-15 08:42:41    : 
********************************************************************************
: ep disconnect
[root@asg1-mpostgres splunk]#

the 1st result is _raw data (with segmentation='none')
the 2nd line is by using regex to fix it.

python SDK _raw value part missing

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

They're back! Join the SplunkTrust and MVP at .conf24

Enterprise Security Content Update (ESCU) | New Releases