Log Management & SIEM at Threat Stack
Hello everybody and welcome to today’s presentation: Security TestedOpt to Prove Log Management at Threat Stack, brought to you by Threat Stack.
Before we get started, I’d like to take care of a couple housekeeping items. First off, I’d like to make sure y’all can hear me okay. So, if you would just send me a quick message in chat panel to let me know---if you can hear me. (PAUSES). Great. If you have any questions for our presenters today, please post those into the chat panel or into the question panel; sorry, so we can get to them after the presentation. If you experience any—any technically difficulties today like the audio dropping out or the screen sharing not displaying properly, please post a message in the chat panel and I’ll work with you to resolve them. Lastly on this list, we are recording today’s—presentation and will be making it available shortly after today’s event conclude.
Now---now, that we’re through the housekeeping items, I’d like to introduce you to our presenters. Sam Bigsby. Sam Bigsby is—Threat Stack Chief SecurityOfficer and is responsible for the security and compliance for the company and its cloud security product. The ThreatStack security team also functions as an incubation center developing new techniques and tools for defending rapidly involving cloud infra—infrastructure. Sam is the security and technology executive with proven success, scaling fast startups, joining Threat Stack early to design and develop Threat Stack platform. Bringing the product to market, previously, he was CXO at Clouded, which was acquired by IBM in February 2014. Now, I’d like to do a quick sound check forSam. So Sam, if you would say hello and chat let me know if you can hear Sam by posting a message in the chat panel.
Perfect. Our (INAUDIBLE) today is Lennart Koopmann. Lennart founded the Graylog Project in 2009and had since worked with many organizations on log management and security related projects. He has extensive background in software development and architecture. His skills include Java, Ruby, Ruby On Rails,PHP, MySQL, MongoDB and Elasticsearch. Lennart attends many in—INFOSEC conferences and has spoken at many including DerbyCon. Lennart attend—excuse me. Now, I’d like to do a quick sound check for Lennart. So Lennart, if you would say hello to our audience; and audience, please let me know if you can hear Lennart by posting a message in the chat panel.
Hey, good morning everyone.
Awesome! And with that, we’re gonna turn things over to—Sam and Lennart to begin the presentation.
Great; thank you very much! Well, good afternoon or whatever time zone you’re in; hello and thanks for joining. Today, we’re going to be talking about theLog Management journey and kind of how we—adopted Graylog here at ThreatStack. Lennart and I are gonna be kind of swapping in and out here and telling some stories and---talking about our points of view on kind of how and why we---approach the problems the way that we do. One interesting fact that happened here at Threat Stack is we kind of turned Graylog into our Siem. And we leveraged automation and have a larger, more advance architecture than most shops of our size, because we have very real and interesting threats that are attacking at Threat Stack platform.
To give you an idea, Threat Stack—you know—we provide continuous security monitoring to our customers, who are operating in the cloud. But we’re looking inside every single one of their servers to see what software is executing, what are users doing, what happen with their file system and---customer data. We look at their AWS cloud trail information, the meta data to understand the relationship between that infrastructure control plan of the cloud and what’s actually happening inside the workloads. We do all this in real-time. So, as you can imagine, we’re a—interested target for a lot of people, who want to get at our customers. And so, we take security very seriously here and we’ve been investing on a security team, both to protect our customers, but also in cases like this, where, you know, our customers also need log aggregation. Our customers also need Siems and---more advance tools and automation. So, we’re gonna go through this and kind of look at the journey of how we roll these programs out, some lessons that you could learn and then we’re gonna walk through actual real-use cases from the Threat Stack’s Talk and looking at—actually screenshots they’ve pulled over the last couple days from our production environment, so you can get an idea of how Graylog and ThreatStack—combined them and go through various workflows.
But this journey really began in 2014 when we launched the company and the product. Back then, you know, there was about four of us building and scaling—the SAS platform where we were bringing on a couple of customers. And we were in this unique position where we had an early product and we got a call December of 2014 saying that we had a chance to go on stage at AWS Re-Invent to launch the product and company with---Werner Vogels—Werner Vogels, the CTO ofAWS. And so we were in kind of a “hair on fire” mode and coming out of Re-Invent that year and going into early 2015, we actually started to experience, you know, the first terabyte of raw data per day from our customers. And so, while we were busy kind of building and scaling this multi-_______ SAS platform and figuring out products and figuring out market (INAUDIBLE) and all those, you know, early “A-round” company problems, we also needed to get insight into what was happening in our logs and in our servers. And that summer of 2014, we weren’t really doing anything. You know, we just kind of had _______, sitting on boxes and maybe sometimes we would stream them to different locations, but they weren’t really being leveraged. Pretty quickly in 2015, we knew that we needed to start leveraging those logs. And we started to stream them into (INAUDIBLE) centralized SAS-based, log aggregator. And that was because we were “hair on fire”, building product and—getting our company out there in our—bringing on our first customers and we really didn’t really have the time at that point to think about installing and building log aggregation. However, that kind of changed, as we entered in 2016 when things started to stabilized; right? And we had multiple reasons, which I’ll begetting into, of why we needed to get off of the SAS-based (INAUDIBLE). Where that kind of cheaper---SAS-based product, it wasn’t customizable for us, there were some concerns about what we were doing with those logs and we were at the point where we needed to bring it on (INAUDIBLE). And when we looked at the market, you know, we knew that we didn’t want to build our own, we looked at Graylog, a couple of (INAUDIBLE) outliers, brought them in and adopted them for our log aggregation on the opts. team. We’ll be talking through how security then begin to leverage that investment.
But it was kind of a—a very straightforward progression for us. And if you look at a lot of our internal monitoring here at Threat Stack in the ops world, a lot of them go down to similar path. You know, such as with Metrix. We started with a hosted Metrix provider, we ran that as long as we could until we were running into cost issues, visibility and granularity issues. And then we were ready, we brought that investment inside or our “for awhile in the cloud”
read ourselves and we got a lot of benefits from there.
Now, this might seem kind of weird, because we are SAS---company, we provide a solution only as (INAUDIBLE) to our customer. We don’t have virtual clients, we don’t install anything----into our customers’ environments other than our little agent. So, it might seem kind of weird that—that some of these places were deciding to pull some of our infrastructure into our own environment. The first reason is being a SAS Company, you know, we’ve figured out how to operationalize technology. You know, we do the (INAUDIBLE) thing and we like to think that we do it very well. And as we, you know, succeeded operationalizing our own internal---software and systems, it started make sense to bring some of these systems On-prem—or into the cloud, I should say. I’ll be using On-prem (INAUDIBLE) our environment interchangeably.
So, we have the people and the process ready to kind of take this tool. The other core piece was that, you know, the core functionality of log aggregation---around this time in 2015 and2016, it started to become commoditized where you have people, you know, starting to embrace (INAUDIBLE) maybe (INAUDIBLE) Elasticsearch and all these different (INAUDIBLE) and if you spend enough time on Google, I’m sure you could have 20 other providers and ways of just doing basic core log aggregation functionality. Streaming, storing and performing some simplicity search on those logs.
So, we knew that, you know, kind of---we got the people; that’s good, this functionality is not so---unique that we need to outsource this to aSAS-based provider. We didn’t feel that we were going to get a lot of really strong, incremental benefit, especially when we found vendors like Graylog, who were delivering a lot of that incremental benefit off the core log aggregation functionality, except in a way that we could control and brought inside of our shop. And I think those last four points given what we do as a company was one of the most critical parts. Where—since we have the people, and since that core functionality was seen as commoditized by our team, it didn’t make sense for us to be put in a position where we were potentially weakening our customer data, internal platform, (INAUDIBLE) intellectual property out to a third party. You know, this something that Threat Stack works with our own customers own and we’re---able to make them quickly feel comfortable that that’s not going to be leaked out to ThreatStack, but we couldn’t get that comfort with some of the other SAS providers that were out there in the log aggregators space. So, we decided to pull it inside of our environment and started to control that ourselves. And now, you know, a lot of teams will say, well you know, you can pad-log data, you can go through and kind of manage it yourself and understand it. But the reality is most engineering teams when they’re building applications, they’re viewing logs as---deep, deep log functionality as they’re building and scaling in their environment. And they really do need or want that functionality and production so that when a customer hits a problem, they can quickly resolve that problem for that customer that makes, you know, here at Threats, that we talk about raving versus raging customers. Any opportunity for us to optimize that support cycle and that time to customer value, that’s gonna be a benefit. So, we don’t want to put our engineers in the position where they have to be worrying about, oh, is it okay for me to log this out? Or oh, if I have an exception or I hear an error state, in this part of my application, do I need to worry then uncaught exception or uncaught error is—going to accidentally leak out internal information or customer data? We want to remove all those concerns and just provide this logging functionality and debugging functionality internally, as kind of a utility that they did need to think about and make our customers feel safer that we were treating their data---properly.
And Lennart, I think---you’ve had some experiences with this with sort of your other customers, as well?
Yeah, absolutely. So, when we—when we choose Siem when two started talking about this topic—a good while ago—I—I thought that what you just described is way of—kind of you’re starting—your younger company---start with something that runs in the cloud; a software service log solution. That makes total sense, because of that stage of a company, you kind of, I think—I think you have other problems than—than your logs. But we see this—we see this transformation quite often where people say, we have grown and we are a serious business now. And we—we—we feel like we can’t do everything with our logs that we wanna do because in the—it’s in the(INAUDIBLE) service and they also run into other problems. And usually that’s—that’s really around three things. We see a lot of people say that, it can be very, very cost prohibitive to store large lumps---large amounts of logs in a software service solution out there. And they be—these—these pricing sliders that they have tend to kind of start really cheap and you feel like you’re saving a lot of money, but when you as a company grow and you have more data and you wanna collect more stuff, this can get pretty cost prohibitive at some point. And---then we also hear people saying that because it’s completely hosted, they don’t have a real way to extend the system with plugins or hook it into other systems. You’re always limited to what the vendor offers you. And of course, there’s another problem that comes with the simple amount of data. They can be a problem with physics,I think, where the question is how do you get large amounts of logs up into the cloud or into another cloud, if you’re running an (INAUDIBLE) for example and you’re---a good software service provider for log-ins living in AWS or the GoogleCloud. You—you are running into bandwidth problems, traffic costs---all kinds of stuff. So, I just wanted to say that I think I agree100% with you, it’s—it’s—it’s—it’s a great starting point, it’s really good products out there. The moment you grow or you’re getting more serious about your logs, we see a lot of people switching to self-hosted solutions. And when you hear—when you hear Graylog or—or another on On-premise solution, you kind of hear the word: On-premise--and you---kind of think, oh, I have to run to this (INAUDIBLE) data center, but we’re already running in AWS! And—how’s that gonna work together? Something that we learned very early on is the real difference is software service versus “you own it”! And we see a lot of people running Graylog on a---public (INAUDIBLE). Because in the end, it’s—it’s pretty much like your own data center, it’s just someone else’s server. But as long as you have a Lennox system, you’re—you’re completely fine to run this stuff yourself. So, the transformation that you did internally, I think is something that we see a lot and there’s good reasons todo that at some point.
Yes, absolutely. And you know what? It’s fine you bring up some of those points where, y’all—do volume of data and customization. And you know, as a SAS provider, you know, we are operators ourselves and, you know, a lot of us came from environments where we could always just plug & play and do everything we wanted and then we---all these staffs, black boxes, started to popup on the market. And at Threat Stack, we saw that happening and we were consumers of SAS and non-SAS and we knew the pains of that and I think that’s why---was on the agent side where, you know, we’re building data and we’re consuming all the telemetry from our customers. We work really hard to shrink that, make that small as possible. You know, we’re—monitoring 100,000’s of workloads and I’ve never had a customer notice anything about bandwidth because the data is so much different than log data. You know and then on the extensibility side, you know, it still baffles me how many SAS-based products, especially in the log space to your point, they don’t embrace that openness because they’re afraid that as soon as they embrace openness, people is just going to pull everything out of their product, which kind of says more about their product—I think—then they want to admit. Whereas here at Threat Stack, you know, we kind of embraced that open ecosystem, open APIs, web hogs, you know, external automation ‘cause we realize that the more that our customers build on top of us, the more value that they get and---it just makes Threat Stack more important to them and it (INAUDIBLE)speaks that---way of doing business that so many—people in the (INAUDIBLE)community have really embraced. That’s great.
So, I think—you know—there’s a very common question that we get from(INAUDIBLE), which is kind of, you know, why did you have to—decided to bring in a 7-20-17, you know? Everybody’s got horror stories of Siems and, yeah, this is basically the rough mean transcript that we had back in 2017 where, you know, we were starting to become a more mature company. We were---had done our soft-two, type-two---gap analysis across the company for security and availability. And we’re putting all these pieces together to be able to support our type-two audit period. And one of the pieces was, you know, auditors like to Siems and often times that’s a requirement depending on what you’re doing. And so, you know, the kind of initial reason why we were gonna do a Siem ‘cause we never wanted to was, well the auditors told us we needed to. And that both was and wasn’t true ‘cause there are real benefits to doing Siem, it just that you quickly fall down off the second bullet point of this meeting, which is everybody kind of rattling off a million different horror stories about Siems. And different experiences especially in more traditional enterprise settings. And so, kind of off of that, you know, we’ve always have this philosophy here at Threat Stack that if we’re gonna do something, we may as well do it right. So, if we’re gonna go do a Siem, let’s not just default off to go pull a traditional enterprise Siem, though we did look at them. But let’s kind of look around our environment see what existing investments we have and see realistically, how are we going to operationalize the Siem inside of our environment? Now, in 2017, there was really only two of us, who were doing security full-time here at Threat Stack. That number has now grown quite a bit. But real—you know. If we were to go make a six-figure investment in a tool that was going to be far outside what was reasonable for us to live in every single day. So, we knew that we needed to kind of philosophical rethink this and hone in. And these were our core use cases. We needed to detect, we needed to do alert management and we wanted to do analytics. But the key is that we wanted to do it from many different data sources including our own product. You know, increasing we had customers, growing sophistication size where they did not necessarily want to do 100% of the alert management inside of Threat Stack itself. We still have plenty of customers, who do all of the alert management directly inside of Threat Stack because that make sense for them and how they work. And—whether it’s because they’re small or they’re bigger, but they kind of—their cloud is a separate from their traditional IT Security. Whatever their reason might be, they’re living inside that product everyday doing a work management. But we looked at it and said, well, let’s try ‘em with that bigger use case, because we’ve been living the prior couple of years inside the product every day, dog fooding. So now, let’s start the dog food, the bigger and more advance use cases. If we’re gonna do that, we need to put, obviously data from our own product in. But we also wanna put in logs and other information from across our infrastructure. And if we’ve got all this data sitting in there, we may as well try and get some longer-term use out of it---rather than just, you know, triggering alerts to slack though that’s a perfectly valid use case. But we wanted to drive in alerts because in security too often programs are driven off of mysticism instead of data. Where you got a security crowd and traditional practitioners, who have a hard time explaining what they need to do, why they need to do it and understanding the investments that they’re making. And so, they just get really good at painting a picture of blood in streets and that creates contemptuous relationships inside the organization. Here at Threat Stack, you know, we’ve got the internal of transparency---cultural kind of points that we really strive and care about. But we also didn’t wanna be operating that way. We liked to make data-driven decisions internally including on security. So, it drives our priorities, you know, if we have somebody that needs to go and address---(INAUDIBLE) escalation and(INAUDIBLE) or what software is running in our environment. They wanna go in, look at the data to see what’s actually happening and then we say, okay, if we’re gonna go spend 30days to solve this problem or de-risk this behavior, what do we want this graph to look like after 30 days; right? And how can we do that across multiple inputs? Whether it be from our CDN, from our application logs or Threat Stack itself. We’re gonna go through some use cases like that at the back here.
Now, I think it’s important to know—you know—a lot of companies just kind of jump into the Siem’s space. And I think, you know, in prior life as—working in kind of, you know, larger data systems and distributive systems, you always have this kind of---preference to just start throwing data whether you knew its value, whether you knew what it was doing, and we’re just gonna keep stilling data into this massive (INAUDIBLE) cluster where you gonna create a quote, data leg or log or whatever people call them now. And then we’re just gonna rub some machine learning, we’re gonna rub some data science on it and the article of(INAUDIBLE) is gonna rise out of it and tell us all these important things and prove this is valuable. So, we knew that that was---bad strategy and that wasn’t gonna work for us. So, we went into this eyes-wide open that we were only gonna pull in specific piece of data that we knew was gonna be actionable and that we knew that we were gonna be able to work with. If it wasn’t actionable, if we weren’t able to work with it on some regular (INAUDIBLE) on team, there’s no point in us streaming it in and storing it and doing anything with it.
We also had a---dedicated security team; you know. I think a lot of---Ops teams that are small and constrained, their having enough time---hard enough time spending their life inside of multiple ops availability, reliability panels that are telling them interesting things for their day job. Thinking that they’re then gonna spend a lot of time often security(INAUDIBLE) doing the same thing can get tricky. Some teams are able to do it and they do it very well. But not all of them are and I think it comes back to the kind of old adage that sure, everybody is responsible for security, but not everybody can own security. And so, we had a core group of people, who truly owned security, were living it every single day and they needed their dashboards; and so that made sense. But we also had a mature detection program where, you know, we were not just kind of throwing out every single piece of detection technology, you know. When we were deploying IBSs and other components in cyber environment, we would never just turn on every single rule and just get blasted by noise and just say, oh well, this isn’t gonna work and we’re just gonna have alert fatigue and let’s not do anything about this. We take more kind of a—structured approach. We have a hypothesis; let’s deploy at the tap—piece of detection on that signal to prove or disprove that hypothesis and see what happens. Or we’ll make assertions about our environment. So, one of the—one of our favorite IBS Rules internally here at Threat Stack is we set one page even to a(INAUDIBLE) doesn’t matter, if somebody runs the command (INAUDIBLE). And that’s because four years ago, we disabled ICMP Echo or the ping protocol—in our environment because we have better availability tracks than that. We have—you know—advance Siems (INAUDIBLE) and all these different checks, so nobody ever had a reason to run Ping. And we could find our servers very easily with automation. But Ping is a common tool that somebody might run if their---don’t know that. And so, that’s a quick assertion for us to make. And the two most common things that that catches in our environment is either more junior engineers and operators, who know---who don’t know our too lstack as well. And so, if they were running Ping and just kind of messing around production of 2 a.m., we definitely wanna know about that, so we can correct that behavior, teach them the proper tools. Or, you know, in our last Pin Test, that was one of the rules that fired within seconds of the Pintester once we gave him access onto a host intentionally. They were on our host and they started running all these commands including Ping. That was an assertion of our environment that failed and so we got alerted to it immediately. Alright.
So, we have these different components already in place before we brought a tool in; alright. Too often people just throw more tools at the problem. Like, oh,I’ve got more alerts, oh, I’ve got more things let me just—buy more tools, tools, tools! It’s like well, if you have the wrong tools, then buying new tools is the right idea. But you always need to remember to have people and process to back those tools, otherwise, you quickly get shellfire.
We knew that we did not want to build a (INAUDIBLE), multiple, open source solutions together. You know, we run a enough Elasticsearch in our lives, we run enough Cassandra, you know, we got this very large platform. You know, we’ve got bigger problems to solve than running more Elasticsearch and running more tools to do---log aggregation. Too often, people will say, oh, you know, security tools like a Siem or something. If it doesn’t say security on the label, it’s not good for security. You know, it’s kind of like a lot of tools out there will do host config. checks, you know. Is your filesystem configured according to (INAUDIBLE) standards? Well, I don’t need to go buy a product for that. I run Chef (??) in my environment and I can just have Chef(??) basically assert my environment (INAUDIBLE). Or I could use Siems, who to already do that.The point here is that we were leveraging investments that our ops teams had (INAUDIBLE). With Ops team was larger than the security team, they’d already text selected, deployed, operationalized, they were living in those tools and we looked at them and said, oh, we can get exactly what we need from them without going buying anything more, without going deploying anything else and then we get the added benefit that both teams are living and breathing in the same tool, so that’s pretty cool because we can start to share a lot of value.
Lennart, I think, this is where you’ve—been working with some of your customers? (PAUSE)
(NO VERBAL RESPONSE HEARD).
Nope—we may have lost Lennart; alright. No worries; alright. Going on to the next one. So, as were progressing in the Siem, we decided, this is kind of how we were gonna roll everything out; right? We carved out (INAUDIBLE) in Graylog where our auditors and—kind of in general companies are very worried about, you know, as alerts are firing, as you’re doing security monitoring your environment, do the right---can the right people see them; that’s most important. But then there are occasionally cases where you don’t want the entire company or everybody to be able to see those alerts. There could be HR issues where—you know---(INAUDIBLE) a rule that says, hey, we’re about to decommission and terminate this employee. So, that gets logged out or that gets alerted or your starting to see different behavior. So,(INAUDIBLE) HR applications. Or, you know, if somebody’s acting as an insider threat or a bad actor, you might not want them to see every alert that you’re firing about their behavior internally. So, there’s always a bit of transparency question and we kind of---walk that line very carefully internally here at Threat Stack, but we knew that we knew that we needed to carve them out. So, we have special space inside of Graylog for us.
You know, then we looked at the data and kind of went through that, you know, middle school science---lab or---process of having a hypothesis, proving those hypothesis and moving on them over time. But as we did that with our application logs, we then started bringing it to (INAUDIBLE) Stack. And over time, we just keep adding more and more data sources so that we can look at them, understand them, prove them out and do a lot of those analytic pieces that we talked about.
So far as the actually how we did it on a technical side, so we already had the existing application logs and there from—multiple vendors. And we could have kind of gone in and done data, Graylog or there’s a lot of different ways you can ship data into Graylog with these other solutions, but we were looking—and we knew that, not only didwe want to put data in to Graylog, we wanted to do automated actions based on what was in Graylog. Whether that’s doing correlation across multiple event streams, reacting to alerts; whatever that might be, we also wanted to pull data out of Graylog, enhance it even more and then either route those insights to our responders or maybe the employee, who caused that alert or we might want—put data back in; and so—into Graylog. So, it’s just kind of a natural workflow cycle that we knew that we didn’t wanna just kind of—just tackle the streaming data in the Graylog problem. It is often viewed as an anti-pattern is to adopt multiple new techniques or multiple new tools at the same time. One of the reasons we decided to do it though is because we index very heavily in programming in security engineering on our security team. We have not invested very heavily in more traditional information security analyst-type of roles. Most of the team is writing code and living these systems every day. So, it made sense for us to leverage automation, invest heavily there at the same time that we’re investing heavily in—in the log aggregation in the Siem side. And I think I actually messed up on the last slide. I think this where—ah, Lennart, you want to—talk about some of the little—pieces that you’re working on with your customers and—their stories.
Absolutely. And---so, I wanted o—I wanted to say (INAUDIBLE)---on the previous slide, that I think that was for the next slide. So, go to—go to webinar, which just show me when I click the unmute button, it just showed me this empty dialogue box. So, computers are still on our site today barely; ---(LAUGHS)—it seems to work now! So, ---so we do have—so, I think this isa—this is a fantastic story and I think this something where—where a lot of---E-software as a service solutions really falls short. That is simply because if you want to automatically interact with things in your environment based on things that happened in your logs, you really need to have something that runs in your environment. You don’t want to—for example, for a use case that I really like is that---one of our users is based on alert in Graylog, he is triggering a python script, which is sent automatically pulling a memory—from the machine that triggered the alert. So, they have a bunch of system on deployed and work stations, on Window’s workstations and they—automatically—you log into that box and immediately pull a memory that will secure the whole system and then shut it off—from the-–from the rest of the network. So, they basically pull a state that they can investigate later and then immediately isolate the system. You can do this on things that you have very high confidence on, obviously. You don’t wanna do this for every kind of low-confidence alert and shut peoples’ work stations down (GRINS), but these kind of integrations I think are really interesting and I can say that Graylog as a product, we are about to come out with our 3.0 version, which—which makes a big leap forward, visualizations and—especially security workflow is making a security analyst life easier. Besides a lot of other things—but we’ll be—pretty much right after that, we’re gonna be---start---or we will start to work on new alerting functionality. And the plan is to end this alerting functionality, kind of move more to a concept of triggers that means that you will be able to automate things automatically out of Graylog basically. So, I really like these use cases, this is where On-premise software really shines ‘cause you integrate it and then, like you mentioned, keeping your data open and keeping it accessible to the rest of the organization, if we can really build this super nice huge cases and do stuff like that.
Absolutely. And so then, this is kind of the architecture that we ended up landing on. And---I—I—I like that use case that you mentioned of going and doing a (INAUDIBLE) dump on that endpoint. You know, one of the---versions about this often talked about in—a cloud environments where you supposed to treat all your servers (INAUDIBLE) not (INAUDIBLE) ‘cause people say, oh, you just go terminate the server. And I think that’s a very cool use case that a lot of people talk about and when you get really sophisticated, that’s a great one. I think most ops people hear that and they just hear availability nightmares allover. Which is---you know---to be fair, it’s really no different than when back in the 90’s, our routers and firewalls started put in active D-DOS prevention. And then, you know, you’ll get a---your web page would be featured on(slash starter??) whatever, and you would suddenly be taken off, why, because the business was working. So, you know, for example internally, we play with different use cases or with our customers where maybe instead of just terminating that server and, you know, just going around your infrastructure terminating the servers and creating a DOS event of yourself, maybe instead, you just go update a security group on that—you see two incidents and you say, hey---you just did something weird. So, if they’re going into the host where the malware is running or the actor is and—then then being able to potentially block you from doing that, just make an API call to AWS, apply a new security group rule to that (INAUDIBLE) that blocks all network egress. Now, I still got my standard---business traffic coming in flowing through my architecture, but I’m not---if malware landed on that box, it’s not able to do command and control or anything back out. Lot of different use case where you can go in there. But this is the kind of architecture that we landed on where we built our own internal orchestration system. Again, the primary reason we did that was because we indexed so heavily on coding abilities that we really wanted that flexibility because when we looked at pretty much every commercial security orchestration product or SOAR the unfortunate acronym, but—for that market, but the SOAR market, a lot of it was kind of cookie cutter and(INAUDIBLE) and---square peg round hole and we just didn’t really wanna do that. There are ways to do it very successfully, we just didn’t feel like that was the place that we wanted to plug in commercial products that---were too cookie cutter.
But we started pulling cloud-flare logs, you know, they do all of ourD-DOS---active mitigation, our Threat Stack alerts, featured data sources that it keep bringing in. Depending on what we see in an automated way, we can—issue slack alerts or PHP people. But everything that that orchestration system sees, it drops into Graylog's. And internal logs are landing there too. That’s how you---we can get a full picture from out at our very edge before it even hits our server or the (INAUDIBLE) player, all the way into the application, our host, our infrastructural layer, all of the—all this rich telemetry and alerts and indicators are all landing in one place where the security, ops and engineering team can collaborate. And then when things are seen and issues happen, ‘cause—issues always happen—we’re able to route them more efficiently across the organization.
So, now we’re gonna walk through a couple of real examples and—when I say real examples that means that we’re not, you know, looking at---Hollywood movie—nation states, organized crime, craziness. (INAUDIBLE) look at kind of some day-to-day, ground truth examples that blue teams are living every day, because they don’t live in L.A. in a Hollywood movie. These are also screenshots that I took directly from our production, Graylog andThreat Stack instances; so, I would love to show you the full UI of every shot, but---we had to kind of scope it down a little bit to just specific areas, so,I apologize I can’t show you the full beauty of Graylog’s UI and Threat Stack’sUI. Our designer will be sad, but understanding.
So, here’s an example where we were analyzing privilege escalations. And this was a project where, you know, on a regular basis, we look in and we say, hey, okay. In some cases, we need to let engineers into production because they help run this thing. And so, when they do access production, what do they do? And kind of---how often are they escalating privileges, why are they escalating privileges and---you know, as you can see over a timeline like this, there’s actually not that much happening. But then you see a spike in your graph. I say, okay; well, that’s interesting. There’s a lot of privilege escalation on update. The next question is why. You know again, this is not Hollywood, we don’t have some magical hologram saying, well, you know, it’s these people and this is what’s happening. Anything an incident response is just like in ops, it’s a series of debugging steps. So, okay; we see this jump, what’s going on? It could be something bad or it could be completely fine.
So, using the quick value functionality in Graylog off of that search, we said, okay; for all these Threat Stack alerts about privilege escalations, let’s start to pull out the users. And so, here’s a quick value of all the users that were aggr—that were escalating privileges across that 7-day period. And you can see exactly therewith the red arrows the top two users were representing 78, 79% of all the privilege escalation in the environment. So, it’s a good bet that that relates to that big—spike in the previous graph. So, now this is where we’re gonna punch in gain. And here’s where we then modified it to aggregate on arguments and then also stack it with the username; right? So, off of the quick values in—Graylog say, okay, this is an interesting search, pull out the aggregate arguments from the Threat Stack alerts, aggregate them and then re-aggregate the user, who is performing those escalations. So that we can determine whether or not different users were escalating privileges for different reasons in that period. One user might have been going in and doing perfectly legitimate operations or maintenance; the other one, might have been using that availability event as a smoke screen to go in and start to fiddle internally in the systems. But when we look at it, it’s all pretty straight forward. It’s fairly flat distribution across what they were doing, they were having to go in and do some manual installation where there is probably a---bug in a the Chef cookbook or something(??) and—they had to pause the automation for a moment, go in, make the syst---make the servers act correctly to restore some functionality for a user, make sure there is no business impact before they turn the automation back on. But this to me is a (INAUDIBLE). Look at this, I said, this is perfectly reasonable. This is why we have great glass access into our environment. I might decide if I see this happening on a regular (INAUDIBLE) where that having to go in and break glass daily, weekly? Then I might have an issue and might decide, you know what? I need to invest some resources here in more automation, so that they don’t have to break glass as frequently. Maybe I create an automated task and easy button from the jump(??) post where they can log in the jump(??) post, mash the button and say, hey, stop automation, make things calm---and then---move on. But that’s—I can make an informed decision about that investment as an executive because I see this right here. And it does not happen that frequently.
So, here’s another fun one where—at some period, we started to get a lot of ThreatStack alerts because config. files and the SC file system were being edited. And that’s never really a good feeling and when we saw that spike, I said, alright, what is that? Go in, run quick values on the---commands that were editing the file system and the reason we were able to do this inThreat Stack is because we’re not hashing your file system, kind of like a traditional product would. We actually listen for file system events. So, not only can we see when a file is edited, we can also see when a file is opened. That gets really powerful because one, most event adversaries are intelligent enough not to edit too many files, but you also tend to find things like malware on a host that’s scanning your file system for interesting information. Or the far more common occurrence, a summer intern logging into production and poking around where they shouldn’t because they’re quote, learning. But in this case, that’s not what was happening. We actually saw files being edited. And when you---we looked at it, the spike was coming from LD config. Real. Before I drill in on that, I also see whileI’m going through this activity of understanding what’s happening in my environment, also see down below that there looks like there’s some manual file editing happening in my—in my environment. So,I’m going to make a mental note to myself that that’s another piece of analysis that I’m gonna come back and do later. So, whoever it is in my company that runs Nano, that’s somebody’s on there—go and talk to and say, hey, here’s your real (INAUDIBLE).
But I then punch in to—Threat Stack because, you know, I could do some additional recon and analysis in Graylog, but this is weird enough event of all these config. files being edited by LD config. Let’s assume, that I don’t know what LD config. does, this feels really bad to me, I wanna get to the forensic view of the deeper (INAUDIBLE) really quick. I pivot over into the Threat Stack interface in my alerts and I look in and I see, okay, here’s my LD config, it’s running out of---SPN and it looks like it’s modifying LB.SO.Cache. Alright. Let’s also assume they don’t know what that means. I look down the heat map and I see okay, this is actually happening a fairly repeatable and deterministic way. Yeah, it’s spread across my environment, but my environment is spinning up, spinning down, expanding, Chef’s always running. So, maybe there is something happening at that time, on those days, on those systems. Quickened through the interface, this is the actual (INAUDIBLE) that actually caused the event that you get with your Siem event. But I see, okay, yes, LD Config. ran, it was running as a root, it was not in—an interactive session and it (INAUDIBLE) a dash-P operator. I don’t remember everything about LD config. so I run Man LD config. and I look at the documentation and apparently what that does is it prints a list of directories in (INAUDIBLE) (INAUDIBLE) store in the current cache; right? So, all of this likely is is automation in my environment that’s going in and---performing this action, probably in a “chef cookbook”, it’s probably in some part of our automation. At this point, all of this pivoting that I just did for this one alert that was happening so frequent in my environment that looks so scary, you know, magical (INAUDIBLE), (INAUDIBLE), having to goin to the (INAUDIBLE) whatever; but I can actually, this took me roughly a minute; right? From the time I saw it in Graylog to pivoting over to Threat Stack and then probably another minute or two, I got my answer. And now I know all this is---perfectly normal. I could maybe go spend the next hour, depending on the type of analysis and responder I am, (INAUDIBLE) spend the next hour or two digging through Chef to figure out what’s actually causing this so, I can fully understand it. But probably not because we’ve all got very busy lives. And so in Threat Stack, I can just suppress, hey, it root running out of the---root directory, runs LD config., with a dash-P Argument and it’s not an interactive session, please do not alert me. Just suppress that right out of environment. And then you see your alert count go down; right? And so, that’s just another example where you (INAUDIBLE) signatures you’re making assertions in understanding your environment and suppressing and honing it as you go. I will also comment that this has a great advantage for when something does go bad and does go wrong, as an analyst or responder. This is kind of my week-to-week; not even the same day-to-day, but even it’s my week-to-week or month-to-month life, being able to understand so quickly my environment, make these assertions and understand how it behaves, when malware does land. When I do get a letter that says, hey, customer data was leaked from you. Please go in and do forensic; whatever that might be, I’m gonna understand my environment so much better and I will be so more effective on that incident response, that’ll be able to---most likely dive exactly to where I need to go or if I need to review alerts, I’ll be able to dismiss and—(INAUDIBLE) through a lot more data a lot more quickly. ‘Cause I’m not looking for indicators of comprise. Not looking for those big positive, scary things that probably never happened in my environment. I’m actually living my life trying to understand my environment and understand the security and operations.
So---that’s um---that’s our webinar. That’s how we kind of using Graylog and Threat Stack here at Threat Stack. At this point, you know, we’re happy---Lennart and I—are happy to answer any questions that you have. We covered a lot of material, it looks like we left exactly 10 minutes for questions, if anybody has them.
Alright. Our first question is: In addition to the internal orchestration map at Graylog, what other monitoring tools are implemented at Threat Stack for visibility?
Yep, another great question. So, ---at a high level, we get to play with a lot of different things. So, we probably are running---multiple experiments with multiple different tools at the same time across the environment. Whether it’s open (INAUDIBLE) or commercial. But when you kind of look at our core, alerting visibility fact, it’s obviously Threat Stack, it’s obviously Graylog. It is glassware. We also do a lot of work to actually modify our application logging to act as kind of a pseudo (INAUDIBLE), internal business lodge indicator where somebody might be attempting to access another customer’s data from their account. So,(INAUDIBLE) more customized (INAUDIBLE) there. But then the rest of it is kind of that note that I may have, we leverage (INAUDIBLE) tools. We leverage(INAUDIBLE), we leverage Grafana, Graphite and (INAUDIBLE) and pretty much the exact same---those building tools that our Ops team is using. We’re actually looking to bring on one or two other commercial vendors in the (INAUDIBLE) space pretty shortly here; hopefully I’ll be able to talk about that---pretty soon—coming months.
Great. Our other question is using it—an orchestration app and log aggregator seems like one way of using applying (INAUDIBLE) to security. How else, if possible, can companies setup their tech. stack to achieve similar results?
Yeah. So---I think it kinds of goes back to those core decisions about how your team is structure, how you’re making those investments and understanding what makes sense. You know, if you have a more traditional information security team, who does not index very heavily on coding and, you know, Chef or whatever it is that you’re using in your environment, it might make sense to use more of the internally built functionality inside of Graylog, you know. As Lennart was talking about, you can customized it very heavily and bring in all those different functionalities. So, maybe it make sense to kind of forego the automation route and just kind of live and breathe inside of that tool itself. You still(INAUDIBLE) Threat Stack directory and Graylog and all this. It’s just—without less custom code. I don’t know, Lennart, if you want to talk about some other common deployment models and your customer base and—if they’re not kind of rolling automation on top of it, how (INAUDIBLE) typically leveraging Graylog.
Um—yeah. So, we really see a lot of people---running automation on top of that. I think that there is—we also see a lot of people kind of using Graylog across the stack to kind of not only monitor for security for also monitor for their---and their development environments, monitor kind of the health of the application all of that together and we see a lot of things where people actually used automation even in these other use case. Because Graylog in the end is a very generic, I would say, log management platform and it’s very popular with security use cases. But it’s in no way limited to only that. Like you said, that—that you already had a big investment internally in Graylog in the----the operations team; we see that a lot that is—that it’s being used really across the stack and then there’s so many ways that you can Graylog with other integrations. For example, what secure—once we see a lot of---we see lot of integrations also with other security tools. For example, another system or maybe even another (INAUDIBLE) would send logs into Graylog and then people will go into Graylog for kind of the incidents response and stuff like that. We have a few good examples on our—on-on our log and we also have the Graylog market place, which is our market place on Graylog.org. Where people can link to their (INAUDIBLE)depositories with a guides, plugins and stuff like that. So, there’s a lot of ways that you can---that you can kind of build a whole ecosystem there with something like Threat Stack and Graylog in the middle.
Great! That looks like it was our last question. Is there anything you guys would wanna add before I close things off?
I’ve seen one question in the question form. I don’t know if you guys can---(CROSSTALK).
Oh—yep. So---question is: I recently implemented Graylog and I am quite happy its capabilities. I’ve been experiencing a corrupt journal as of late and I—had to delete it often. Have you experienced this issue? If so, can you advise on a permanent solution?
Yeah, and I think I can take that. I think(STAMMERS)—you (INAUDIBLE) will be great to kind of post this into one of our community channels so we can—we can help you with that. What is—I think interesting for everyone though, is the journal is what’s sitting in front of Graylog. And basically every message that comes in is being written to this journal, which is a (INAUDIBLE). So, if you have issues with Graylog itself or the architecture behind it, you ‘ll never lose your messages there and they’re basically persisted on disc for a very long time. We are actually using (INAUDIBLE)internally. We’re using code from(INAUDIBLE) because (INAUDIBLE) is great at writing the pend only files to disc extremely fast and this is—this is—this is what journal is doing. And---that is of course, if that gets corrupted, that is of course very unfortunate and is what this question is about. I can see maybe as a starting point. We usually see that when people run out of disc space, you—you would end up with a corrupted journal. The other reason could be if you’re somehow mounting that journal from a remote location and maybe you look into if that isa reliable connection, if this a proper---if this the proper—file system that’s being used and the proper way to mount it. If—if none of that is the case, then just post this question to our community (INAUDIBLE) and then we’ll be—be able to help you there. In general, we only see a corrupted journal when people are on a (INAUDIBLE) space or—or the—there’s an issue with the mounting. This should not be a permanent case, so, I’m sure we’ll find a solution.
Awesome! Well—and with that, I’d like to thank Sam and Lennart for a great presentation today. I’d like to thank today’s sponsor, Threat Stack—for providing our audience with a great presentation. And lastly, I’d like to thank the audience for attending and we hope that you learn something that’ll help you in your developer career! Have a great day and we’ll see you next time.
Thank you very much; you have a good day.
Thanks. Bye everybody.