I'm on a standalone Splunk environment. I've got some .csv files, and I'd like to use indexed extractions for them as well as pseudo-/anonymize the field "Meldender" contained in them (either via SEDCMD or with a transform).
The way I understand it, indexed extractions take place before a SEDCMD/transform is applied (based on the detailed diagram here), which is why I end up with masked data in my raw event while the (pre-SEDCMD extracted) field still contains the initial unmasked data if I simply use INDEXED_EXTRACTIONS = csv
and SEDCMD
.
As I would prefer not to change to search-time field extraction, I would like to change the indexed field with a transform as well. I thought this could be done with a simple props and transforms, so I tried the following. Here is my props.conf:
[stoer_csv_meta]
FIELD_NAMES = ...
KV_MODE = none
NO_BINARY_CHECK = true
PREAMBLE_REGEX = ...
SHOULD_LINEMERGE = false
TIMESTAMP_FIELDS = ...
INDEXED_EXTRACTIONS = csv
SEDCMD-meldender = a working sed
TRANSFORMS-meld = meld
and here my transforms.conf:
[meld]
REGEX = (.{0,2}).*?(.?)
FORMAT = Meldender::$1-X-$2
WRITE_META = true
SOURCE_KEY = field:Meldender
[accepted_keys]
Meld = Meldender
This made the field "Meldender" multivalued, containing the original data as the first entry and my masked data second. I had a look at the actual content of _meta, and indeed it seems the above settings just add the entry "Meldender::masked data".
I also tried changing WRITE_META
to false and using DEST_KEY = _meta
, one time adding $0
to my FORMAT
in order to keep the existing metadata and once leaving it out entirely. The result of the first way was no change in the field "Meldender" (it still contained the unmasked data) while the second method did, just as you would expect, erase any and all fields except for "Meldender". So neither of these three attempts so far solved the problem.
I believe this question is in the same vein as this or this one, which didn't get any satisfying answers so far. I have one ugly solution coming up, but please share your thoughts!
I would also try (not tested):
[meld]
REGEX = (.{0,2}).*?(.?)
FORMAT = Meldender::$1-X-$2
WRITE_META = true
SOURCE_KEY = field:Meldender
DEST_KEY = field:Meldender
[accepted_keys]
is_valid=field:Meldender
I would also try (not tested):
[meld]
REGEX = (.{0,2}).*?(.?)
FORMAT = Meldender::$1-X-$2
WRITE_META = true
SOURCE_KEY = field:Meldender
DEST_KEY = field:Meldender
[accepted_keys]
is_valid=field:Meldender
I downvoted this post because the answer does not solve the problem. the accepted answer should be moved to jeffland's below.
This is actually a really good solution. It didn't occur to me that you can access fields with field:field_name
both as SOURCE_KEY and DEST_KEY at the time I wrote the question, but this should work well.
@jeffland @Dan Did you get this to work? I've the exact same problem, but following the approach above I always end up with a multi-value field containing the original field value and the replacement value specified by the FORMAT command. If I put FORMAT = $1, then I just get the original value. I can successfully use a SEDCMD to remove the value from _raw, so that part of the problem is fixed, but I'm struggling with the indexed field. This seems to be much harder to do than it should be! Any help would be much appreciated! I'm using 7.0.2. What I'd ideally like to do is drop the field completely, but I'm happy if I can at least mask it.
The above worked for me (accessing the field in metadata with field:name
and applying REGEX and FORMAT to it).
If you get a multi-valued field, you're probably using both KV_MODE
and INDEXED_EXTRACTIONS
in your sourcetype at the same time. Make sure that KV_MODE = none
to avoid search time field extraction.
If you want to remove a field from indexed fields, you'll have to re-write the metadata information like this:
REGEX = (?m)^(.*)<your_field_name>\:\:<regex matching your field values>(.*)$
FORMAT = $1$2
WRITE_META = false
SOURCE_KEY = _meta
DEST_KEY = _meta
You can probably optimize that regular expression. If you're okay with just replacing the value of your field, this should be faster:
REGEX = .
FORMAT = <your_field_name>::-
WRITE_META = true
SOURCE_KEY = field:<your_field_name>
DEST_KEY = field:<your_field_name>
[accepted_keys]
is_valid = field:<your_field_name>
If that doesn't work, you should probably ask a new question with more details about your settings. Feel free to tag me in it.
@jeffland Thank you very much for taking the time to answer this, it's really appreciated. The first approach works - I end up with the field missing from both the indexed fields and the _raw which is exactly what I need, so thanks for that. The second method doesn't work though. I get the original value of the field as a single-value. Here is the props.conf for that test:
[MySourceType]
DATETIME_CONFIG =
INDEXED_EXTRACTIONS = TSV
KV_MODE = none
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = false
TIMESTAMP_FIELDS = DateTime
TIME_FORMAT = %Y-%m-%d %H:%M:%S
category = Structured
disabled = false
pulldown_type = true
TRANSFORMS-dropMaskedCLI = dropMaskedCLI
SEDCMD-maskedCLI = s/^([\S ]+\t)(\S+\t)(\S+\t)(.+)/\1\2\4/
The transforms.conf:
[dropMaskedCLI]
REGEX = .
SOURCE_KEY = field:Masked_CLI
DEST_KEY = field:Masked_CLI
FORMAT = Masked_CLI::-
WRITE_META = true
[accepted_keys]
is_valid = field:Masked_CLI
I'm assuming you meant the hyphen in the FORMAT command as a string literal in your example? The data I'm loading looks like this (tab-separated):
DateTime DialedNumber Masked_CLI WithheldFlag
2018-02-24 00:00:02 4789226712 07123456789 N
If you've any idea why this does not work I'd be interested to hear, but you've given me a working solution. Once again thanks very much for your help!
You're right, it doesn't work as I said it would - when you use WRITE_META=true
, it doesn't overwrite any existing fields. It just appends to _meta, same as if you used DEST_KEY=_meta
with FORMAT=$0<something>
. I'm sorry for the confusion.
@jeffland No probs. Thanks again for your help.
So I took to a drastic way and changed my transforms to this:
[meld]
REGEX = (?m)^(.*Meldender\:\:)(.{0,2}).*?(.?)(\s.*)$
FORMAT = $1$2-X-$3$4
WRITE_META = false
SOURCE_KEY = _meta
DEST_KEY = _meta
This changes _meta instead of adding to it. I don't really know what this does to the time needed to index data, using a SEDCMD and a very ugly regex on the entire metadata on top. I'm lucky I only have to do this once in a while with small amounts of data... this can't be the solution.
This answer is the correct solution for this not the accepted answer above. I was able to get it work with the following:
[mask_ssn01_cs_uri_query]
SOURCE_KEY = _meta
REGEX = (?i)(.*(?:ssn|SearchValue)=)\d{0,5}(\d{4}.*)
DEST_KEY = _meta
WRITE_META = false
FORMAT = $1XXX-XX-$2