Göteborg, 25 Oct, 2011
Welcome to SIEM
As of late here at Secode, we have been rolling out our new SIEM solution. Having worked for a major SIEM vendor for 5 years, I’m personally excited to be on the other end of things for once. So for the next few weeks I’m going to be writing a series of blog posts explaining SIEM and trying to take the mystery out of what they are, how they work and the considerations you need to remember when looking for one.
For those who have heard of SIEM but might not know exactly what SIEM is, let me take a second to explain it. SIEM is the convergence of SIM and SEM. SIM stands for “security information management”, or basically a log management solution. It’s that big computer you send all your security log data to, from syslog streams on your Unix systems to the alerts coming off your IPS devices. It’s good for going back and seeing who logged into what devices 12 months ago. Then on the other side we have SEM, which stands for “security event management”. Back in the day, it used to be done by a very pale person in a dark room who sat behind a wall of screens, but is now accomplished usually by a clever application. This program basically looks at your security events as they are produced and tries to make correlations across them. Say a firewall allows some traffic through, it then hits your IPS sensor that registers a possible attack signature, and then a log on the targeted device shows someone gained access. That’s correlation and SEM. SEM is more real-time, while SIM is more historical. SIEM is simply a combination of both technologies, and leveraged correctly, SIEM can be the most useful tool in your infosec toolbox.
I’m going to guess if you’re interested in this blog post, then you probably have some driver that is causing you to consider a SIEM. That’s ok, it’s not very often someone wants to tackle something as complex as SIEM just for fun. For large SIEM deployments we used to tell the customers it’s similar to rolling out an ERP system. If you think about it for a minute you start to see the similarities: with an ERP you need to touch all the devices and systems that are going to feed it, the same goes with a SIEM. Yes the data is different, but think about going to every device and application on your network and configuring them to send their logs to a central place. You not only need to configure syslog feeds, install agents on your windows systems, use custom retrievers for other network devices, but you also have to reconfigure routers and security devices to allow the traffic. And yes, you’ll run into all kinds of unexpected issues along the way. A normal SIEM deployment from my experience takes several months from start to finish; so as a word of advice, if anyone ever tries to sell you a SIEM “appliance” and tries to pitch it like you just drop it in and walk away, I’d seriously wonder if they really understand what they are talking about.
But back on topic, I’m going to focus first the components of SIEM and what they really are going to do for you. In specific we’re going to talk about the “SIM” side of things and naturally start with the data collection and storage process.
A SIEM is designed to take data feeds from pretty much any device you can throw at it as we mentioned before, this can include syslog feeds from Unix systems, Windows logs, Firewall, IPS logs, and even devices you may not think of, like say your PBX that runs your VoIP phone system, or your system that tracks when people swipe their access cards to get into and out of your building. A good SIEM should be able to eat anything without complaint. This feature is something the vendors will always tout in their pitches, but there is one thing a lot of them leave out – storage. So say you have 10 firewalls, each producing a 1gig of data a day. Those devices alone produce 10gigs of raw data. Combine that will all the other devices on your network and all of a sudden you’re capturing 30gigs of data per day, 24 hours a day, 7 days a week, 365 days a year. At the minimum you’re looking at almost 11 terabytes of raw data per year. All of a sudden you have what’s called “big data” – that’s data you’re not going to store in a conventional relational database, like MySQL or Oracle (unless you enjoy pain and torture).
So how should your SIEM solution address this issue? First off, compression – they should compress the raw data. Even if your SIEM is using open source algorithms, the data could be compressed from 10% to upwards of 50%. I’m sure you can see the advantage of that, so we’ll quickly move onto the second issue with storage, and that’s the engine they actually use.
So second, a flat key value database (like HBase) or columnar data store (like SenSage or MonetDB), or a hybrid of the two (like Cassandra or Splunk) should be used in place of a row based relational database. Conventional row based databases use indexing, and this will cause it to fall to its knees and die a painful and slow death. Why? As the indexing grows, it causes search times to grow exponentially. Yes, exponentially. A search that would take 10 minutes could now take 100 minutes with twice as much data. Double the data again and you’re looking at 10,000 minutes. Of course this is just a simplistic example, but you get the idea. The advantage with flat or columnar databases and data stores is that the search times stay linear. Why? With those types of databases you narrow down the range of your search to a particular part of the data you’re interested in, and then just searches that. To make a practical an example, say you want to know every time user X has logged into your web server over the last year. In a relational database, you would have to traverse every – single – row – one – at – a - time. While in the flat and columnar databases, you would narrow the data down to first the time range you are interested in (1 year), then further narrow it down to the device you are interested in (web server), and then you would search for that user within that subset of data. So that same search that takes 10 minutes, then 100 minutes with twice the data in your relational database, would take 10 minutes, then 20 minutes respectively. Again, these aren’t perfect examples, but you get the idea (exponential growth is f(x) = 2^x while liner is simply f(x) = x(2))
So what’s the first thing you're going to remember when talking about SIEM? What type of datastore the SIEM uses on the backend. Sure they have a fancy GUI with blinking lights and nice graphs, but it’s the non-glorious back end that will make or break a SIEM.
Well I think this is enough information for this post. Next time we’re going to go into the “SEM” side of SIEM, which will dive into what to actually do with that fire hose of data being shot at you, and the mountain of data now that its already been collected.