Log Management & SIEM at Threat Stack
In this webinar, we’ll find out why Threat Stack adopted Graylog, how a log management tool improved their security operations, and how it was used to find many solutions to common problems. We will have a look at how to roll these programs out and some actual real-use cases of how Graylog and ThreatStack combined together through various workflows. In fact, Graylog eventually turned into Threat Stack’s SIEM platform to leverage automation and benefit from a larger, more advanced architecture that could cope with all kind of threats.
What does Threat Stack do?
Threat Stack provides continuous security monitoring to customers operating in the cloud. They check every single one of their servers in real-time to see what software they are executing, what are users doing, and what happens with their file system and customer data. Threat Stack looks at customers’ AWS cloud trail information to understand the relationship between that infrastructure control plan and what’s actually happening inside the workloads.
As a consequence, Threat Stack is an interesting target for a broad range of malicious actors who want to steal data from its customers. A security team is in place to protect customers, and to ensure maximum security, Threat Stack has set up automated log aggregation and a Security Information and Event Management (SIEM) platform by combining itself with Graylog.
Why did Threat Stack need Graylog’s advanced log aggregation?
The journey of Threat Stack began in 2014 when the company and the product were launched by a small team of just four people. Then, in 2015, they got the chance to go on stage at AWS Re-Invent to launch the product and company with Werner Vogels, the CTO of AWS. They had to deal with terabytes of raw customer data per day at that point, and had to figure out how to handle much larger problems with their SAS platform. They needed to find a way to leverage log data, and their first approach was to stream them into a centralized, SAS-based log aggregator.
However, that was just a temporary solution, and as they entered in 2016, the team at Threat Stack realized that SAS-based solutions were not enough customizable, secure, and granular. Streaming, storing and performing some core log aggregation operations wasn’t enough anymore. Outsourcing to a SAS-based provider also involved a risk of potentially weakening customer data and internal platform. Clients won’t be comfortable knowing that a third-party SAS provider was dealing with their log data, so pulling it inside their environment was the only option at that point.
Being a SAS company on their own, the guys at Threat Stack didn’t had virtual clients and didn’t install anything into their customers’ environments other than a little agent. They figured out how to operationalize technology and employed their own internal DevOps and security teams. However, most engineering teams start digging on log data only when there’s a problem with a customer that needs to be fixed. Finding an efficient solution to work on logs automatically beyond just debugging operations was necessary. Therefore, they needed an on-prem solution that could aggregate logs more efficiently without running into cost, visibility and granularity issues. Graylog was able to deliver a lot of incremental benefit other than the core log aggregation functionality, and it was able to do it in a way that allowed Threat Stack to keep control over all operations with no risk of accidental leaks.
Why are larger companies switching to self-hosted log management solutions?
When a company is still young and small, starting with something that runs in the cloud such as a SAS-log solution is certainly a good idea. At that stage of a company, you have other problems than your logs. But once a transformation is done and the company start growing into a more serious business, SAS-based solutions are not enough. Getting the chance to work with an on-premise solution that operates on servers and data centers owned by the company itself seems the answer to all problems. Here are the four main reasons why larger companies eventually switch to self-hosted solutions such as Graylog:
Many SAS-based solutions offer pricing sliders that start really cheap making you feel like you’re saving a lot of money. However, the costs of storing large amounts of log data in a software service solution escalate rather quickly. So, once a company wants to collect more logs or handle larger amounts of data, the costs can become prohibitive.
Lack of options and flexibility
When the logging solution is completely hosted, there’s no real way to extend the system with plugins or hook it into other systems. You’re always limited to what the vendor offers you, depriving your company of its much-needed agility and flexibility.
Handling data efficiently
When the amount of data is sizeable enough, handling it can be complicated due to physics and bandwidth problems. Getting get large amounts of logs up into the cloud or into another cloud can be impossible if your company is a good software service provider for log-ins living in AWS or the Google Cloud.
Fear of the open source ecosystem
Today, many SAS-based companies are afraid to embrace more open ecosystems fearing that people are just going to pull everything out of their product. Other companies like Threat Stack, instead, kindly embraced this openness in terms of open APIs, web hogs, and external automation, They did it because they realized that the more that they customers build on top of them, the more value that they get, making Threat Stack more important.
Implementing a SIEM – What’s the correct way to do it?
Back in 2017, the Threat Stack team had to run a type 2 gap analysis across the company for security and availability. Auditors tend to like to SIEM solutions, and more than often the include it as a requirement. But Threat Stack didn’t simply decide to implement one such platform the auditors told them they needed to. Although everyone seems to be eager to share its rattling horror stories about faulty SIEM platforms (especially in more traditional enterprise settings), they knew there were really beneficial. Their philosophy was “if we’re going to do a SIEM, let’s do it well” so rather than pulling a traditional enterprise one they started wondering how to operationalize a SIEM inside their environment.
Their principal applications and requirements for the platform to be effective and efficient were:
Detection – They wanted to generate useful alerts based on application logs rather than on generic systems. Log data contains much more useful and pertinent data that can be used to hone and refine the system.
Alert Management – A good system should store data from multiple sources and only provide an alert when a known pattern is detected. A classic example is multiple failed logins followed by a successful one.
Analytics – A good SIEM need to rely on actionable data that is used to set security program priorities and determine control efficacy. Decision making should rely on robust data instead of “magic” or “mysticism”.
The first step was to use their own product and system as the bigger use case. Threat Stack itself acted as an experiment since they worked and lived inside the product every day. They pulled data from the product as well as logs and other information from across the entire infrastructure. This data had to be used to enable data-driven security decisions rather than driving programs and strategies off of mysticism. When solving a single issue or de-risking a behavior may require up to 30 days, priorities must be set up correctly, and decisions must be made using a solid input base extracted from application logs.
How to leverage your data correctly
One of the main mistakes made by companies who jump into the SIEM’s space, is to expect the system to be some sort of Oracle of Delphi. You don’t just dump your data into a massive churn, create a data log, add some machine learning, a pinch of data science, and then some valuable advice magically comes out like a block of cheese.
You can’t expect a small and constrained DevOps team that is already putting lots of efforts inside multiple reliability panels to spend its precious time harvesting useless data. If you keep throwing out every single piece of detection technology into your system, your team will be blasted by noise and alert fatigue will be around the corner. Only specific pieces of data which are actionable enough should be collected and used. If your dedicated security team isn’t able to work with it, there’s no point in storing it and streaming it into the platform.
Establishing your internal rules
A structured approach is mandatory. A simple way to establish your internal rules is to define a hypothesis and then collect data from a signal that can prove or disprove it. Another effective method is to make assertions about your environment.
For example, one of the favorite IBS rules at Threat Stack is checking when someone runs the ping command. Four years ago, they disabled ICMP Echo or the ping protocol in their environment because we have better availability tracks and nobody had a reason to run ping anymore. However, ping is a very common tool that can be ran when someone that doesn’t know the environment accesses it. When a ping command is ran either by a malicious actor or by a junior engineer, an alert is sent out warning about a practical issue.
Using the right security tools
When a problem comes up, enterprises often just throw more tools at it to solve it. However, buying new tools is a good idea only if you got the wrong ones. Since you need to have people and processes to back new tools, duct-taping multiple open source solutions together will just lead to shellfire. A good security tool doesn’t necessarily come with the word “security” on its label.
Most of them will simply run more of the same commands such as doing host config checks or assessing if your filesystem is configured according to netstat standards. And since your Ops team is probably living inside the tools, they are already collecting all the data needed by your (usually smaller) security team – so there’s no need to deploy anything else.
Carving out specific alerts
One of the issues that worries both companies and auditors alike, is making sure that when an alert is fired, only the right people can see it. For privacy and security reasons, you don’t want the entire company or everybody to be able to see those alerts. For example, when an employee starts acting as an insider threat or bad actor, you don’t want him or her to see every alert that you’re firing about any potentially harmful behavior. It is therefore necessary to carve out specific indices and dashboards restricted just for security.
The natural lifecycle of data
Putting data into Graylog is quite simple – you just need to hook up your tools and ship data into it. However, at Threat Stack, the goal was to set a series of automated actions based on the most highly valuable data that could be correlated and digested, in order to build actionable and practical insights. So, they established a natural workflow cycle that pull data out of Graylog to look at what caused a given alert, how the responder eventually reacted, and then put that data back inside to enhance it even more.
For example, a use case is setting up a python script that triggers when a certain alert is sent to Graylog. Once this script starts, a memory is automatically pulled from the machine that generated the trigger and then the whole system is secured and shut off from the rest of the network. In a nutshell, the system is isolated immediately and aa state that can be investigated later is pulled on the spot. This way, you don’t end up forcing your whole system to stop and can keep all your standard business traffic coming in flowing through your architecture. Just make an API call to AWS, apply a new security group rule that blocks all network egress, and if malware landed on that box, it won’t be able to do command and control or do anything else back out.
Integrating SIEM and SOAR together
In what is often viewed as an anti-pattern, the Threat Stack team decided to adopt an additional new tool at the same time – a security orchestration (SOAR) platform. One of the reasons of this choice was the presence of many programmers and security engineers who knew a lot about the code of the system in their security team. However, all the commercial SOAR products felt too cookie cutter, and they wanted a much more flexible solution.
They started pulling Cloudflare logs for DDOS active mitigation, as well as Threat Stack alerts, and featured data sources. Whenever something bad happens, Slack alerts are sent in an automated way. But everything that that the orchestration system sees, it drops into Graylog together with internal logs. This way, all relevant info coming the application, host, or infrastructural layer is centralized in one place. Here, the security, ops and engineering teams can collaborate to get a full picture before it even hits the server, and route any potential issue more efficiently across the organization.
Real examples and day-to-day use cases
Let’s walk through a couple of real, ground truth examples coming from Graylog and Threat Stack that blue teams are living every day.
Analyzing privilege escalations
The privilege analysis histogram is a pretty straightforward one. In some cases, engineers need to get into production because they help run the application. The first dashboard helps understanding what they do when they access production. As you can see in the graph, there are some spikes, so the question is why are they escalating privileges? The answer can simply be an update requiring a series of debugging steps. So, it can be something bad or something completely fine, but it is really important to tell the difference.
Using the quick value functionality in Graylog off of that search, we can pull out the users and find out all those who were escalating privileges across that 7-day period. As you can see in the graph, just two users represented nearly 79% of all the privilege escalation in the environment.
We need dig even deeper, so we then modified it to aggregate on arguments from the Threat Stack alerts and then also stack it with the username. This way we can determine whether or not different users were escalating privileges for legitimate operations or maintenance (as in this example), or as a smoke screen to cover suspicious activities.
Config files being edited
In our second example, the team at Threat Stack started receiving a lot of alerts because the config files and the SC file system were being edited. The first step is to check the commands that were editing the file system. The Threat Stack application usually listens for file system events so they can see when a file is edited and when it is opened. Once again, it can be something serious, like a malware on a host scanning the file system to get information, or just a summer intern logging into production and poking around where they shouldn’t because they’re “learning”. In this case, files were being edited, the spike was coming from ldconfig.real, and there’s some manual file editing happening in the environment (below).
The next step is to check on Threat Stack and get do a quick forensic analysis to find out what ldconfig does and why it is being edited. As the heat map shows, the ldconfig is running out of SPN and modifying the lb.s.cache in a fairly repeatable and deterministic way. This means that this is likely a normal automation in the environment that is repeatedly performing this action as scheduled. By starting on Graylog and pivoting over to Threat Stack, we got the answer we needed in roughly a minute. If we want, we can suppress the alert to keep the count down.
Knowing your system better
One of the biggest benefits of living inside the environment with these tools, is that an analyst or responder will get to understand it at a much deeper level. When malware does land, you will know much better how your environment behaves, and what is not okay almost immediately. If customer data is accidentally leaked and a forensic analysis is necessary, for example, having a better understanding of the environment allows to dive exactly where it is needed and be much more effective on that incident response. Instead of reviewing tons of alerts and indicators of compromise, you can dismiss most of them and focus only on those that matter.
Questions & Answers
After the webinar is over, let’s have a look at a couple of questions.
In addition to the internal orchestration map at Graylog, what other monitoring tools are implemented at Threat Stack for visibility?
At a high level, multiple experiments with multiple different open source or commercial tools at the same time across the environment are frequently ran. In terms of core functionality and alerting/visibility stack, they use Threat Stack, Graylog, Glasswire, and some customized app logic. They also leverage Sensu, Grafana, and Graphite.
Using an orchestration app and log aggregator seems like one way of applying DevOps to security. How else, if possible, can companies set up their tech stack to achieve similar results?
If you have a more traditional information security team that is not specialized on coding, it might make sense to use more of the Graylog’s internally-built functionalities. You can forego the automation route, and work inside the tool by customizing it as you deem fit.
The bottom line is that Graylog is a very generic and flexible log management platform, so you’re not stuck with security use cases alone. You can use it to monitor your development environment or the health of the application. Or you can integrate other tools to leverage the expertise of your operations team using Threat Stack and Graylog as the center of your ecosystem. If you want find more examples of real Graylog use cases, you can check our marketplace or blog.
I recently implemented Graylog and I am quite happy with its capabilities. I’ve been experiencing a corrupt journal as of late and I had to delete it often. Have you experienced this issue? If so, can you advise on a permanent solution?
The Graylog journal is a feature that allows every message that comes in to be written and persist on your disc for a long time. It’s coded with Kafka, so it’s great at writing the pend-only files to disc extremely fast.
The main reason why a journal can get corrupted is when you ran out of space. In some occasions, the journal can also get corrupted if you’re mounting it from a remote location, and the file system is not getting mounted properly for some reason (such as if the connection isn’t reliable). If you’re still experiencing issues, you can check our community here and we will find a solution together.