Talos Takes

When synthetic logs don’t lie: Generating coherent attack stories for better detection

Cisco Talos

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 19:29

Are your detection rules failing because your test data lacks the nuance of a real-world network?  In this episode of Talos Takes, Amy sits down with David Bianco to discuss why traditional synthetic data often falls short and how his new open-source project, EvidenceForge, is changing the game.

Synthetic datasets often look like telemetry but lack the critical causal links and realistic background noise that define actual adversary activity. EvidenceForge solves this by creating data that tells a coherent, causal story. From simulating complex attack chains to modeling realistic, "bursty" human behavior, this tool helps threat hunters and detection engineers to sharpen their skills with reproducible, high-quality telemetry.

EvidenceForge blog: https://blog.talosintelligence.com/introducing-evidenceforge-synthetic-security-logs-that-dont-look-as-fake/

PEAK Threat Hunting Assistant episode: https://www.buzzsprout.com/2018149/episodes/18825324 

Welcome to the Talos Takes Podcast, where we discuss Talos' latest research and security news. This podcast is for everyone, from the C-suite to the front lines. You're hunting through a data set, looking for the telltale signs of an adversary, but the logs feel hollow. The DNS query doesn't line up with the connection, and the process tree is missing its parent. It's the classic synthetic data problem. It looks like telemetry, but it doesn't behave like a real network, leaving your detection rules untested and your analysts frustrated. Welcome to Talestakes. I'm Amy Simonisi, and today David Bianco is joining me to talk about why current synthetic data like LANL and OpTC are failing our security teams. We discuss how his new open source project, EvidenceForge, is changing the game. EvidenceForge creates logs that both look real and tell a coherent causal story that gives your analysts a better foundation to sharpen their skills. David, welcome to the podcast. Hi, thanks for having me. Absolutely. Happy to have you back. David, I'm sure a lot of practitioners have been there, you know, trying to test a new detection rule and realizing that the only data that they have is either stale or super anonymized or just wrong. Um, was there a specific moment that led you to build EvidenceForge? Yeah, there actually was. And it's it's kind of funny. So I I have uh another project, which is also an open source project that we released recently, a little bit earlier this year, called the Peak Threat Hunting Assistant. The Peak Threat Hunting Assistant is an agentic experiment to help threat hunters research and plan for their threat hunts. Yes. If this is sounding familiar to you listeners, we did an episode with David a few weeks back about the assistant. I will put a link to that and the blog in the show notes as well. Part of the planning in the Peak framework is you have to kind of understand what data that you have available and you check it to make sure that the types of things you're going to hunt for will actually show up in that data. And I was using some big data sets that had been published. You mentioned the OpTC, the operationally transparent cyber data. I forget what it stands for, but it's a fairly well-known data set. And it's large. I mean, it's like four terabytes of data or something like that. And so I had ingested it into our Splunk instance when I was dealing with testing enterprise scale data. Does this thing work against the enterprise scale data? And it did work, but for development purposes, I needed smaller data sets that I could iterate through more quickly. And I realized that there were not very many good actual publicly released data sets that would fit exactly what I wanted. So I just decided to look into creating, or I actually I decided to look into the state of synthetic data to see if there were good synthetic data generators around. There are a number of them that are pretty good that only produce one kind of log, and they produce mostly random formats or random fields. So it looks good if you look at a particular event, but it doesn't really tell a story. They're more for, hey, can I ingest all this data at scale, like, you know, just testing to see if my SIM pipelines will keep up or something like that. But they didn't really do what I needed for threat hunting, which was have the same kind of activity show up in different logs and supporting pivoting between the logs, and specifically, not even most of them didn't really have the ability to have both malicious uh you know activity going on as well as realistically generated user background noise. And so I realized that I had my next project in front of me. Yeah. You talk about in your log, which I'll put a link to in the episode notes, that most synthetic log generators are a collection of independent emitters. And this is kind of what you were just referring to, where each one knows how to produce its own format. Yeah, it doesn't share state with others, but evidence forge is different. So can you explain the model that it uses? You know, how does that change the way that logs relate to each other and how does it make a difference for the person on the other end? I think the biggest thing is the design started with knowing that we needed to have both the simulated malicious activity and the simulated user background noise, and that we were going to be generating multiple types of uh evidence. So you could get Windows event logs for uh several different event IDs, but also Windows Sysmon for also several event IDs. Um, not all not every single one, of course, but uh the most common that you wouldn't use for threat hunting various things. Uh but also like uh Zeek sensor logs, Snort IDS, uh Cisco ASA firewalls, Linux syslogs, all you know, it's a bunch of different kinds of ones. Uh and then given that I knew that I needed to make all those, I realized that it was going to be difficult if I approached it as a log generator, like, okay, I'm gonna do the syslogs now, and then I'm gonna do the the Windows event logs and just as all separate. So we we built this world model where you have an event that happens, and the event that happens is kind of like an abstract view of that event, and then it attaches individual contexts to that event, depending on which log sources would naturally see that, and then each of the contexts shares its like core of information with that event. So if you have, I don't know, like a network connection, you will see it in Zeek connection logs, but you probably will also see it in our simulated EDR, and it will have both this, you know, all the same uh ports and IP addresses and protocols on both sides because they're all built from the same record. No matter which log source it appears in, it'll be format correct for that log source, but the details of the fields will match up with the other appearances. Got it. One of the most interesting parts of this blog is your focus on causality. Um, you know, the idea that a logon shouldn't just exist, but it should have the Kerberos exchange that preceded it. Can you speak to that a little bit, maybe with regard to creating the kill chain? It's half of the evidence forge promise that you're gonna get some coherent story. That's the half that that fulfills is if the same event shows up in multiple logs, like different viewpoints, they will agree with each other. And the other part of it, though, is that's that's one event, but now look at all the events. An attack or some kind of malicious activity isn't just one event, right? It's usually a life cycle, like a whole cyber kill chain kind of thing. And so the other part of it that EvidenceForge provides is this AI-assisted creation of the scenario. The scenario being what the environment that you're simulating is, like what kind of organization, what is the technology stack, what are the computers and the users and the files or data involved, what is the simulated internet environment that it's in, where are the sensors? There's a whole bunch of things. You don't have to really remember all these because the AI assistant kind of guides you through it. It'll give you like an interview, like we're doing actually, and it'll ask you questions like, hey, what would you like to do with respect to, you know, what kind of sensors do you want, or what visibility would you like them to have, or or anything like that. Uh it will it can make things up if you want it to. So you don't have to be specifying every single thing. You could literally say, hey, I would like to model a phishing attack in a branch office of a financial institution, and it will do the right thing and ask you, like, here's what I'm have in mind. What would you like to correct? Or should we go ahead and write out this scenario file, which is uh just a config file in like YAML format? The nice thing about that is you have a file, the AI helps you create the config file, but when it comes to generating all the data, it's a deterministic Python script. So actually, the same config file will give you the same output every time you run it. So it's it's random, but it's seeded, and the seed is in the config file, so it's reproducible. Like I could give you the config file and you could reproduce the same data set. That's so sick. Um, so why keep the AI out of the generation phase? Um, like what are what are the are there any risks letting an LLM write the logs directly? The main two really are first of all, there's no way the the LLM is going to do the job correctly. It's just too large. Um like we can generate liter gigabytes, several gigabytes of of log data, and that would just overfill your your context window. And whatever you said at the beginning of the scenario, it would have totally forgotten by the time you got like 10 minutes into the scenario, let alone by the end of it. And it wouldn't look coherent, wouldn't look like a good story. But even if you could do that, it's just an inefficient use. It's slower, it costs a lot of money uh because of all the tokens. Yes. So it's just terrible. So, but we targeted AI for the piece that AI is good at, and that is helping you create the scenario so it can either know or even do research on like how does this typically work? Like if you have this kind of scenario, this malicious attack that you want to model, what are the typical phases of that? You don't have to know. If you do know, great, you can tell it, but it'll figure it out and ask you, hey, did I get this right? So the AI is really good at that and it keeps you on track. One of the key safeguards that we have is built-in automatic validation to make sure the AI produced a good scenario file and with good information about if it's not a valid scenario file, what was wrong with it, so that the AI can self-correct itself. But you can generate the scenario for like pennies and it'll take minutes rather than thousands of dollars of tokens, at which at which point just pay a red team. Yeah. Yeah. So this synthetic data is purposefully not perfect. With EvidenceForge, you built in a way to simulate gaps in network visibility, like missing spanner tap data. Why is it important for this data to be imperfect? I I will back up for just a second and say there's two kinds of imperfection. One is it's a simulation, so we can never simulate ever a real network perfectly, right? Right. Like there will be mistakes. We are explicitly not trying to make perfect data. We're trying to make data that's good enough for the purpose, but an experienced analyst looking closely at it would probably still tell that it was synthetic data. But uh good enough for training, good enough for detection engineering, and and maybe a little bit more than just good enough, right? But the the part I think you're talking about is I have this built-in evaluation loop that helps me. I on the development side, I also have the AI help me evaluate the quality of the generated data because it's very hard to do uh when you're generating several gigabytes of data on every iteration to see if you fix the problem or whatever. So I have the AI look at each iteration of the data from four lenses. So, like four different agents, really. Um, one is a network, one is a host forensics, one is a detection engineer, and one is an experienced threat hunter. One thing that was surprising to me at the beginning of this, I set these up and they worked for a while. And every time it goes through, it would point out like the things that it found were the strongest tells that it was synthetic data, and then I we could fix them. But uh, it eventually got to the point where one of the strongest tells coming over and over again from most of the reviewers was you have every event. And that's unrealistic. You're not, yeah, you're not missing any. And I was like, well, yeah, we're simulating uh something. I would like to get all the data, but they kept coming back and saying this is clearly synthetic because in the real world you have things that drop pipelines that don't work, they work 99.9%, maybe, or uh, but not a hundred percent. And so, yeah, I had a a model in there where you could optionally tell it, yeah, I I want to be smart about how I drop. I don't necessarily just drop random events, although you could, but you know, you could drop like a few minutes of data out of one of the logs or something like that. Uh it you basically it's configurable via profiles uh and and it ships with uh default not doing anything like that, but you can turn it on like the the typical enterprise uh protocol and it would it would fix that too perfect data a little bit. Yeah. I was really shocked. I was like, at first it was a compliment. It's like, oh, the the AI says my uh my program's data is too good. Yeah, I don't know how to take that. Yeah, I mean, I guess take it as a compliment, but that's really funny. Okay, um, so I'm going to get even more into the nitty-gritty because I'm a nerd and I found this really interesting. Uh, in the blog, you talk about Hawk's processes for background noise. Um, for people who aren't data scientists, how does that make background noise feel more human compared to just random traffic? Yeah, uh, I'm not data scientists either. I just pretend to be one as a threat hunter. There's I always tell people there's a big overlap. Data science and threat hunting. Threat hunting is almost like just data science applied to cybersecurity. So yeah, there's a there's a lot of overlap. The it's not that important that you know like exactly what the the the Hawks algorithm is or how it works or whatever. But the the basic idea is we have certain things that we're simulating. On the computers that we're simulating, they usually have things that are driven only by the computer, like automation, schedule tasks, or whatever. They recur on a schedule. So, you know, five minutes after the hour, every hour, or at 8 a.m. on Monday mornings or whatever it is, right? And they're kind of metronomic unless you lose events. But the users are not like that. Right. They come in in the morning, they let's let's just say you're typing commands at the like the Linux command line, or maybe you're clicking on commands or clicking on icons in the in the Windows environment, but you're doing some things, launching applications, sending emails, browsing the web or whatever. You don't do that like a metronome. Right. It's in burst, basically. I'm gonna type this command, I'm gonna type this command, I'm gonna type this command, but in the between I have to review the outputs and see what I need to do next to make a judgment. It takes a little bit of time, and it's not always the same amount of time. Um, but also it's bursty because I might be doing things and then get distracted, get some coffee for a little while. So the the Hawks algorithm is just trying to model that. It's the short form of it is every action that a user takes makes it slightly more likely that it will take another action shortly after. So it's making a burst. That's it. I mean, it doesn't snowball to be like, oh, eventually the user has to take all the actions, you know, without a break. But it's just a way of trying to to model that burstiness. It's not perfect. Again, we you can't like model uh the whole thing perfectly, but it's good enough uh that it doesn't obviously look synthetic just because of the timing. Yeah, and it's I it kind of reminds me of um a blog article that we put out several months back about good enough fuzzing. Um and just like, you know, like you need to know what your intention is when you're creating these things or when you're using them. Like it it doesn't have to be perfect for your use case in order for it to be a successful element of what you're doing. Um, so to close out, if someone is listening to this and wants to start using EvidenceForge today or tomorrow, what do you think would be a good place for their first experiment? Where should they start it? It's all available on GitHub, so they could just check out the GitHub repo. And in fact, there is a quick start uh in the README file there that we have published a scenario that is in the repo. And you could just ask your favorite AI, Claude, or Codex, it works with either of them, to say, hey, please generate the this example scenario and just see like what the process is like. Or you can literally just say, I want to generate an evidence forge scenario, and the AI will lead you through it. So it's really easy to get started making the scenarios. There's hopefully very little learning curve if I did my job right. There might you might have to install the skills and and whatever, but I tried to make it really easy. I really encourage everyone to go and check it out. David, thanks so much for joining. Hey, thanks for having me. It's always fun to talk about my projects. I'm a typical researcher. Same, same, same. Yeah, me with my knitting all the time. Just like, let me show it off to you. Thanks for tuning in to Talos Takes. We'll see you in two weeks with a new episode. And until then, stay safe out there.