Refine your search:

3
1

In order to identify web content that hasn't been pulled in a while, I thought I would use Splunk since a) my Apache logs are in Splunk already, and b) I can easily create a scripted input to get a list of files under the various directories. Initially, I'm going to do this for our .cgi's and .pl files

So, I have one index for the standard Apache access logs. I do have a field extraction for this called file. More on that later.

I then created a scripted input that runs once per day to pull a list of files under our content sub-directory (we're talking 13,000+ files). An example of the input looks like this:

09/29/10 15:42:46 -0400,file=actDefaultAccSet.cfm,app_root=public,dir=/cfmx_files/cfmx61/public
09/29/10 15:42:46 -0400,file=liferayLogin.html,app_root=public,dir=/cfmx_files/cfmx61/public
09/29/10 15:42:46 -0400,file=favicon.ico,app_root=public,dir=/cfmx_files/cfmx61/public
09/29/10 15:42:46 -0400,file=favicon.gif,app_root=public,dir=/cfmx_files/cfmx61/public
09/29/10 15:42:46 -0400,file=Cps_Doc_Upload_Rules.doc,app_root=public,dir=/cfmx_files/cfmx61/public
09/29/10 15:42:46 -0400,file=ordocs-index.jsp,app_root=public,dir=/cfmx_files/cfmx61/public
09/29/10 15:42:46 -0400,file=contact_me2.cfm,app_root=public,dir=/cfmx_files/cfmx61/public
09/29/10 15:42:46 -0400,file=orprefs-index.html,app_root=public,dir=/cfmx_files/cfmx61/public
09/29/10 15:42:46 -0400,file=ppsathanks.html,app_root=public,dir=/cfmx_files/cfmx61/public

I can do a query that looks like this:

index="prod_ohs_logs" [search index="prod_coldfusion_files" file="*\.cgi" OR file="*\.pl" | fields file ] | table file | dedup file

Which only returns 36 out of the 125 .pl / .cgi files out there, which is not exactly what I'm looking for.

Basically, I'm looking to take a list of files from a specific query, check to see how many of those files are found in the Apache logs, including ones with zero results.

I've spent a couple of days trying to get this working, and I haven't been able to. Any ideas on how to do this? Is it even possible?

asked 29 Sep '10, 19:48

Brian%20Osburn's gravatar image

Brian Osburn
2.8k14
accept rate: 22%


One Answer:

Your best strategy here is to use an OR search, to load data from both prod_ohs_logs and prod_coldfusion_files at the same time and see, for each file, whether it is in one, the other or both of the indexes. For example:

index="prod_ohs_logs" OR (index="prod_coldfusion_files" file="*\.cgi" OR file="*\.pl") | chart count by file index
link

answered 29 Sep '10, 20:55

Stephen%20Sorkin's gravatar image

Stephen Sorkin ♦
8.1k47
accept rate: 52%

Great, it's a starting point. I need to figure out how to only list the files that have 1 as the results under prod_coldfusion_files..

(30 Sep '10, 15:31) Brian Osburn
1

Just add "... | search prod_condfusion_files=0" to your search.

(30 Sep '10, 16:06) Stephen Sorkin ♦

Pure awesomeness Stephen. Thank you!

(30 Sep '10, 17:27) Brian Osburn
Post your answer
toggle preview

Follow this question

Log In to enable email subscriptions

RSS:

Answers

Answers + Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×210
×131
×39

Asked: 29 Sep '10, 19:48

Seen: 299 times

Last updated: 29 Sep '10, 20:55

Copyright © 2005-2012 Splunk, Inc. All rights reserved.