Splunk is a leading discovery platform used by majority of small-to-medium companies as operational and/or application discovery service.
Last week; I was trying to get login stats exposed to BI dashboard for one of the client by extracting the events from application logs; so that business/product has more insights on how many successful/failed login attempts are happening on day to day basis (in fact close to real-time).
But login events are written only to web application log and is not logged/tracked through any other event management; and they have tens of application servers; and when you have tens and hundreds of servers (if not thousands); things can get real complicated.
Without implementing a new queue service and introducing any additional code base to write events to this new service; we only left with two options on how we can extract the login events from the log files (it can be any event for the matter of fact):
- From individual servers; have log processing using filters for known events and write to a central queue service; this can be easily implemented using logstash + rabbit MQ/ Amazon SQS
- Log all events to central logging system (using rsyslog or syslog-ng) and then filter known events from central logging system and write to the central queue service (same as above, but in a single place using logstash + output to rabbit MQ/ Amazon SQS ).
You don’t really need a queue if you don’t have complex event types that gets consumed by different services, instead can be written directly to analytical data store. The advantage of having a redundant distributed queue makes it easy for real-time events that gets written directly by the application layer. In general, one should always avoid application layer writing events directly to analytical store instead it should simply write to queue service for optimal performance and scalability.
Once the events are available in the central queue system, depending on the size/scale of events; one can load them into either HDFS or analytical stores like (Vertica, GreenPlum, etc.) or MongoDB (logstash output has mongodb plugin) or even to MySQL/PostgreSQL; Both MongoDB and Vertica fits perfectly for log events storage and retrieval with minimal efforts; and MySQL/PostgreSQL can also be used especially for small data-sets with right partitions and indexes in place.
Here is the visual representation of both the models:
- Log Parsing from Individual Servers:
- Log Parsing from Central Log Server:
But instead of implementing yet another log transportation for event extraction; we decided to use splunk; as splunk is already in place and is indexing all un-structured logs from all servers close to real-time (few seconds delay) and also has efficient REST API; that supports extraction of events by a simple search.
Log parsing from splunk REST API:
- It is simple to use (you can use either curl or use splunk’s SDK, which is available for all major languages)
- You can search on any keyword or multiple keywords to match and also use regex
- You can specify time series using earliest_time (its important not to read old events)
- You can limit how many events you can fetch in a single call using count
- You can filter the fields using field_list (in general in most cases, you only need selective fields for a given event like IP, user, geo, operating system, etc.)
- You can sort the events on any given field using sort
- You can fetch in csv, xml or in json format by specifying output_mode, even raw in some cases
- It has few analytical functions built-in; in case if you wanted to get count/group/order etc. (count can be useful for real-time counters without fetching actual events)
- You can even create a search job for a frequent access
Apart from this, splunk also has unique ID for every event; which can also help to load the events into any analytical store easily using upsert mechanism for any old overlapping events.
More than all these; it solves the hassle of maintaining another log forwarder and log processing tools and/or services.
Here is a simple CURL example on searching for “DefaultAuthenticationLog && Successful login” in last 10 minutes (limited to 1 event)
venu@ ~ 00:35:25# curl -k -u user:password -d 'search="search UserLogin | \
search Successful login | head 1"' -d "earliest_time=-10m" -d "output_mode=csv" \
"04/05/2013 00:34:54,459 [INFO ] [UserLogin] \
Successful login attempt by email@example.com (xx.xx.xx.xxx) on 04/05/2013 00:34 AM
Another example using search.py from python SDK:
venu@ ~ 00:43:19# ~/tools/splunk-sdk-python/examples/search.py --username=user --password=password \
--earliest_time="-10m" --host=splunkhost --field_list "_cd,_time,_raw" --count=1 \
--output_mode=json "search UserLogin | search Successful login | search firstname.lastname@example.org"
"_raw": "04/05/2013 00:43:00,459 [INFO ] [UserLogin] Successful login attempt by \
email@example.com (XX.XX.XX.XXX) on 04/05/2013 00:43 AM"
Which is pretty good and API response time from splunk is also quick, especially if you looking for latest data served from the “hot bucket”
With outmode_mode=json, the data can be directly loaded into mongodb using mongoimport or to MySQL if the mode is csv using LOAD DATA.
Now; these event extraction process can be scheduled for every 5/10/30 minutes through ETL and loaded into any analytical store (with appropriate earliest_time parameter, so that we don™t re-read the same events again and again); and these events can be directly consumed by the reporting service or another ETL process if data is big and need to be re-aggregated on daily/hourly basis.
Basically if you have splunk deployment, then splunk can be used as single source for all log events (un-structured) in big data analytics; which will avoid log processing needs as described in big data architecture; and more than that, splunk only charges for storage and not for API usage.