All Apps and Add-ons

Multi-line field extraction in props.conf

phemmer
Path Finder

I have a cluster consisting of a single master and 2 indexers (peers). I am trying to add a field extraction for haproxy's logs. The field extraction is very long and so I would like to split it up over multiple lines for legibility. However when I do this it doesn't seem to work.

The interesting part is that splunk cmd btool props list syslog shows exactly the same when broken up over multiple lines and when a single line. It also shows up the same in the web UI under manager -> fields -> field extractions.

Any ideas why it's not working when split up?


Multi-line (doesn't work)

# props.conf
EXTRACT-haproxy_httplog = haproxy\b.*? (?<client_ip>\d+\.\d+\.\d+\.\d+):(?<client_port>\d+)\
 \[[^\]]+\] (?<frontend_name>\S+) (?<backend_name>[^/]+)/(?<server_name>\S+)\
 (?<request_time>\d+)/(?<queue_time>\d+)/(?<connect_time>\d+)/(?<response_time>\d+)/(?<total_time>\d+)\
 (?<status_code>\d+) (?<response_size>\d+) \S+ \S+ (?<flags>\S{4})\
 (?<process_connections>\d+)/(?<frontend_connections>\d+)/(?<backend_connections>\d+)/(?<server_connections>\d+)/(?<retries>\d+)\
 (?<server_queue_size>\d+)/(?<backend_queue_size>\d+)\
(?: \{(?<request_headers>[^\}]*)\})?(?: \{(?<response_headers>[^\}]*)\})?\
 "(?<method>\S+)\s+(?<uri>[^"]+?)(?: HTTP\S+)?"

.

# splunk cmd btool props list syslog | grep EXTRACT-haproxy_httplog
EXTRACT-haproxy_httplog = haproxy\b.*? (?<client_ip>\d+\.\d+\.\d+\.\d+):(?<client_port>\d+) \[[^\]]+\] (?<frontend_name>\S+) (?<backend_name>[^/]+)/(?<server_name>\S+) (?<request_time>\d+)/(?<queue_time>\d+)/(?<connect_time>\d+)/(?<response_time>\d+)/(?<total_time>\d+) (?<status_code>\d+) (?<response_size>\d+) \S+ \S+ (?<flags>\S{4}) (?<process_connections>\d+)/(?<frontend_connections>\d+)/(?<backend_connections>\d+)/(?<server_connections>\d+)/(?<retries>\d+) (?<server_queue_size>\d+)/(?<backend_queue_size>\d+)(?: \{(?<request_headers>[^\}]*)\})?(?: \{(?<response_headers>[^\}]*)\})? "(?<method>\S+)\s+(?<uri>[^"]+?)(?: HTTP\S+)?"

Single line (does work)

# props.conf
EXTRACT-haproxy_httplog = haproxy\b.*? (?<client_ip>\d+\.\d+\.\d+\.\d+):(?<client_port>\d+) \[[^\]]+\] (?<frontend_name>\S+) (?<backend_name>[^/]+)/(?<server_name>\S+) (?<request_time>\d+)/(?<queue_time>\d+)/(?<connect_time>\d+)/(?<response_time>\d+)/(?<total_time>\d+) (?<status_code>\d+) (?<response_size>\d+) \S+ \S+ (?<flags>\S{4}) (?<process_connections>\d+)/(?<frontend_connections>\d+)/(?<backend_connections>\d+)/(?<server_connections>\d+)/(?<retries>\d+) (?<server_queue_size>\d+)/(?<backend_queue_size>\d+)(?: \{(?<request_headers>[^\}]*)\})?(?: \{(?<response_headers>[^\}]*)\})? "(?<method>\S+)\s+(?<uri>[^"]+?)(?: HTTP\S+)?"

.

# splunk cmd btool props list syslog | grep EXTRACT-haproxy_httplog
EXTRACT-haproxy_httplog = haproxy\b.*? (?<client_ip>\d+\.\d+\.\d+\.\d+):(?<client_port>\d+) \[[^\]]+\] (?<frontend_name>\S+) (?<backend_name>[^/]+)/(?<server_name>\S+) (?<request_time>\d+)/(?<queue_time>\d+)/(?<connect_time>\d+)/(?<response_time>\d+)/(?<total_time>\d+) (?<status_code>\d+) (?<response_size>\d+) \S+ \S+ (?<flags>\S{4}) (?<process_connections>\d+)/(?<frontend_connections>\d+)/(?<backend_connections>\d+)/(?<server_connections>\d+)/(?<retries>\d+) (?<server_queue_size>\d+)/(?<backend_queue_size>\d+)(?: \{(?<request_headers>[^\}]*)\})?(?: \{(?<response_headers>[^\}]*)\})? "(?<method>\S+)\s+(?<uri>[^"]+?)(?: HTTP\S+)?"
0 Karma

bmacias84
Champion

@phemmer, Ok I will answer question on why it your regex does not work. Regex will explicitly match against spaces or line breaks. While you are trying to make it more read able Regex is see that you want to match on those line returns. To get around this use free-spacing mode in regex by adding (?x) at the beginning of your statement. The implication here is you have to now use the whitespace (spaces, tabs, and line breaks) character \s.

But here is how I would approach your extraction with delims. I built this tyring based on your regex and not having the raw data. So I'd expect a few gliches with the delim, but they are fix able. Basicly I am doing multikv.


#transforms.conf
#main DELIM for EVENTs
#can't tell if you are spaced or tab delim. I try to name them as best I can but you can change them
[haproxyfields]
DELIMS = " "
FIELDS = haproxy_id,client_info, date_time,frontend_name,backend,request_info,status_code,response_size,val1,val2,flags,connection_info,queue_info,req_header,resp_header,method,uri_info
CLEAN_KEYS=true


#the following is used to extract values from the previous extraction
[clientinfofields]
SOURCE_KEY=client_info
DELIM = ":"
FIELDS = client_ip,client_port
[backendfields]
SOURCE_KEY=backend
DELIM = "/"
FIELDS = backend_name,server_name
[requestinfo]
SOURCE_KEY=request_info
DELIM= "/"
FIELDS=request_time,queue_time,connection_time,response_time,total_time
[connectioninfo]
SOURCE_KEY=connection_info
DELIM= "/"
FIELDS=process_connections,frontend_connections,backend_connections,server_connections,retries
[queueinfo]
SOURCE_KEY=queue_info
DELIM= "/"
FIELDS=server_queue_size,backend_queue_size


#You can still use regex on those extraction that still need it.
[uriinfo]
SOURCE_KEY=uri_info
REGEX=(?[^"]+?)


#props.conf
[haproxy]
MAX_TIMESTAMP_LOOKAHEAD=40
NO_BINARY_CHECK=1
SHOULD_LINEMERGE=false
TZ=US/Pacific
REPORT-haproxyfieldextract=haproxyfields,clientinfofields,backendfields,requestinfo,connectioninfo,queueinfo,uriinfo

Additional Reading:

phemmer
Path Finder

The performance is why I added the haproxy\b at the beginning of the regex. That way the regex would just die right there instead of trying to match all the other components.
Unfortunately the haproxy log format cannot be modified. The next version of haproxy supports it, but it's not out yet.

0 Karma

bmacias84
Champion

Long Regexes can be very costly if they are not written effectively. Also if you are using field discovery every field extraction and transform containing a regex for your source will be ran. Also you could try changing your haproxy log file delim settings to use a pipe "|" instead of spaces. I'll be talking to one of the Spunk SDK developers tonight, So see if there is better way to make field extractions more readable.

0 Karma

phemmer
Path Finder

It's very close. The only problem is the req_header and resp_header fields. The fields in the curly braces ({}). They aren't always there, so when they're not the fields after them shift over. Also the enclosed values can contain spaces.
What I might do is use a regex, but only extract the groups of data (client_info for example), and then use a SOURCE_KEY section on that to split it. I'm not sure if this would be more efficient, or since it has to be a regex anyway, to just do it all in one shot.

0 Karma

bmacias84
Champion

@phemmer, Let me know if this help or if I am totaly off base about the regex freespace mode or extracting your fields.

0 Karma

phemmer
Path Finder

@bmacias84 while I am still interested in why breaking the extraction regex over multiple lines doesnt work, if you were to provide an alternate answer explaining how to accomplish the same goal I would accept it. Though keep in mind part of the difficulty with a delimiter is that the log fields in braces ({}) don't always exist.

0 Karma

bmacias84
Champion

@phemmer, I going to say yes its. I have very similar complex data for another system. I would be happy to show you how. You would be delim once then use SOURCE_KEY to delim again againts your first search time extraction. You would use SOURCE_KEY multiple times.

0 Karma

theouhuios
Motivator

You might want to try doing it this way,

EXTRACT-xxx-system = (?i)^\S+\s+\S+\s+(?P\S+\s+\S+\s+\d+\s+\d+:\d+:\d+\s+\d+):\s+(?P\S+)
EXTRACT-ip_address = (?i) host (?P[^ ]+)

I am not sure on why your part wouldn't work, but when ever I have to do it in a multiline way, then I split it up like above.

0 Karma

phemmer
Path Finder

@bmacias84 heh, it's nowhere near capable of being parsed by delimiter. You can see a lot of examples at http://code.google.com/p/haproxy-docs/wiki/Logging . Regex extraction really is the only way.

0 Karma

bmacias84
Champion

This data looks space or tab delim. If this is correct why not use field extract using delim in your transfroms.conf. Then use SOURCE_KEY in transforms to extract Multi-valued feilds. Will result in performance increase.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...