1 00:00:00,000 --> 00:00:05,120 If you really look at the modern digital world, it's just this massive web of event 2 00:00:05,120 --> 00:00:05,680 chains, right? 3 00:00:05,680 --> 00:00:06,320 Totally. 4 00:00:06,320 --> 00:00:11,040 A customer clicks something over here, and that triggers a process way over there, 5 00:00:11,040 --> 00:00:14,400 which then, you know, alerts some monitoring system. 6 00:00:14,400 --> 00:00:20,320 Yeah, and for anyone in DevOps or for the SREs listening, managing all that, it can 7 00:00:20,320 --> 00:00:21,760 feel impossible. 8 00:00:21,760 --> 00:00:24,800 Exactly. You're just trying to stitch it all together and make sure the right thing 9 00:00:24,800 --> 00:00:28,150 happens every single time. And you're definitely looking for a shortcut through all 10 00:00:28,150 --> 00:00:29,040 that complexity. 11 00:00:29,040 --> 00:00:32,080 And that is, well, that's exactly what we're diving into today. 12 00:00:32,080 --> 00:00:36,000 Our mission here is to really unpack StackStorm. 13 00:00:36,000 --> 00:00:38,480 The famous IFTTT for Ops. 14 00:00:38,480 --> 00:00:43,520 That's the one. We're going to get past the jargon and explain this really powerful, 15 00:00:43,520 --> 00:00:48,280 event-driven automation platform so that, you know, beginners can get it. Not just 16 00:00:48,280 --> 00:00:48,880 what it is, 17 00:00:48,880 --> 00:00:52,240 but how it actually fundamentally changes IT operations. 18 00:00:52,240 --> 00:00:56,160 We're basing this on its GitHub docs and the core features overview. 19 00:00:56,160 --> 00:00:59,230 And before we jump in, we have to thank the people who make these deep dives 20 00:00:59,230 --> 00:00:59,760 possible. 21 00:00:59,760 --> 00:01:00,480 Of course. 22 00:01:00,480 --> 00:01:05,200 This deep dive is supported by SafeServer. SafeServer assists with digital 23 00:01:05,200 --> 00:01:06,320 transformation, 24 00:01:06,320 --> 00:01:11,280 helping you move to the future of IT infrastructure. They also provide, well, 25 00:01:11,280 --> 00:01:14,800 amazing hosting for software like the platform we're discussing today. 26 00:01:14,800 --> 00:01:21,680 So if you're serious about modernizing your stack, find out more at www.safeserver.de. 27 00:01:21,680 --> 00:01:27,200 Again, that's www.safeserver.de. Okay, let's start with that core idea. 28 00:01:27,200 --> 00:01:33,760 IFTTT for ops. We all know if this, then that for like our personal apps. 29 00:01:33,760 --> 00:01:36,640 Right. Thinking your smart lights to your email or whatever. 30 00:01:36,640 --> 00:01:40,720 Yeah. But applying that same simple logic to massive enterprise level 31 00:01:40,720 --> 00:01:41,600 infrastructure, 32 00:01:41,600 --> 00:01:43,440 that seems like a total game changer. 33 00:01:43,440 --> 00:01:48,080 It is. And it's mostly because of the scale and the criticality of the work. 34 00:01:48,080 --> 00:01:51,470 StackStorm is fundamentally a platform that's built to integrate all your different 35 00:01:51,470 --> 00:01:52,400 services and tools. 36 00:01:52,400 --> 00:01:53,040 All of them. 37 00:01:53,040 --> 00:01:56,080 All of them. Your monitoring systems, your ticketing platforms, cloud providers, 38 00:01:56,080 --> 00:02:00,000 deployment tools, you name it. And its whole focus is on event driven automation 39 00:02:00,000 --> 00:02:00,560 for the really 40 00:02:00,560 --> 00:02:05,830 tough tasks. Think auto remediation, really sophisticated incident response, or 41 00:02:05,830 --> 00:02:07,280 these complex 42 00:02:07,280 --> 00:02:11,120 multi-stage deployments. It basically brings structure to the chaos. 43 00:02:11,120 --> 00:02:15,040 And the scale. I mean, the sources mentioned them huge numbers, something like 160 44 00:02:15,040 --> 00:02:16,240 integration packs. 45 00:02:16,240 --> 00:02:20,400 With over 6,000 actions available on what they call the stack storm exchange. 46 00:02:20,400 --> 00:02:23,280 So no, this is not just about running a few simple bash scripts. 47 00:02:23,280 --> 00:02:27,360 And it has a full rules engine, workflow capabilities, and it even supports chat 48 00:02:27,360 --> 00:02:27,680 ops, 49 00:02:27,680 --> 00:02:29,840 right? So you can manage everything from Slack. 50 00:02:29,840 --> 00:02:33,520 Exactly. But what's really crucial here, and this is a thing I think you, 51 00:02:33,520 --> 00:02:36,960 the learner, should really internalize, is the philosophical shift. 52 00:02:36,960 --> 00:02:37,520 Okay. 53 00:02:37,520 --> 00:02:41,430 All the contents, the rules, and the workflows that dictate all the automation 54 00:02:41,430 --> 00:02:41,840 logic, 55 00:02:41,840 --> 00:02:44,080 it's all stored as code. 56 00:02:44,080 --> 00:02:48,800 Wait, let me stop you there. If the operational logic, the if this, then that, is 57 00:02:48,800 --> 00:02:50,000 stored as code, 58 00:02:50,000 --> 00:02:53,840 how does that actually help an SRE? I mean, isn't that just adding more complexity? 59 00:02:53,840 --> 00:02:57,600 It's actually the opposite. By treating your automation logic just like you treat 60 00:02:57,600 --> 00:02:57,760 your 61 00:02:57,760 --> 00:03:01,830 application code, you immediately get all the benefits of the modern DevOps 62 00:03:01,830 --> 00:03:02,640 lifecycle. 63 00:03:02,640 --> 00:03:06,160 Ah, so you're talking about version control with Git, code reviews? 64 00:03:06,160 --> 00:03:11,410 Code reviews by peers, testing environments, auditability, clear governance, all of 65 00:03:11,410 --> 00:03:11,840 it. 66 00:03:12,560 --> 00:03:16,320 You're not relying on that one engineer who knows the magic fix at 3 a.m. anymore. 67 00:03:16,320 --> 00:03:20,240 You're relying on a process that's been written down, reviewed, and versioned? 68 00:03:20,240 --> 00:03:23,520 Precisely. It just elevates your operations to the same 69 00:03:23,520 --> 00:03:26,160 standard of reliability as your actual code base. 70 00:03:26,160 --> 00:03:30,320 That makes so much sense. Moving from like hero knowledge to a codified process, 71 00:03:30,320 --> 00:03:34,480 that's a huge unlock. Okay, so let's move from the philosophy to the real world. 72 00:03:34,480 --> 00:03:34,640 Give 73 00:03:34,640 --> 00:03:38,510 us some concrete examples of how this works. Absolutely. Let's start with something 74 00:03:38,510 --> 00:03:38,640 that 75 00:03:38,640 --> 00:03:45,520 happens way too often. The dreaded alert. The 2 a.m. page. The 2 a.m. page. So our 76 00:03:45,520 --> 00:03:45,760 first 77 00:03:45,760 --> 00:03:49,840 pattern is facilitated troubleshooting. So picture this. A monitoring tool, let's 78 00:03:49,840 --> 00:03:50,160 say 79 00:03:50,160 --> 00:03:55,700 it's Sensu or New Relic, it captures a system failure. Normally, a human gets that 80 00:03:55,700 --> 00:03:56,000 alert, 81 00:03:56,000 --> 00:03:59,120 logs into four different systems, runs diagnostics. It's the swivel chair 82 00:03:59,120 --> 00:03:59,920 integration. 83 00:03:59,920 --> 00:04:05,470 The swivel chair, exactly. With StackStorm, that alert is the trigger. So instead 84 00:04:05,470 --> 00:04:05,920 of a 85 00:04:05,920 --> 00:04:11,060 human logging in, StackStorm acts instantly. It becomes a kind of digital first 86 00:04:11,060 --> 00:04:11,680 responder. 87 00:04:11,680 --> 00:04:13,200 And what's its first move? 88 00:04:13,200 --> 00:04:16,640 It just immediately runs a whole series of diagnostic checks. It pings the physical 89 00:04:16,640 --> 00:04:17,040 node, 90 00:04:17,040 --> 00:04:22,240 it checks the state of AWS or OpenStack instances, verifies app components, pulls 91 00:04:22,240 --> 00:04:23,280 log snippets. 92 00:04:23,280 --> 00:04:23,920 And then what? 93 00:04:23,920 --> 00:04:28,560 It puts all of those results, all correlated and with context, directly into a 94 00:04:28,560 --> 00:04:29,360 shared space 95 00:04:29,360 --> 00:04:35,440 like Slack or a JRA ticket. So the human engineer steps in already fully informed. 96 00:04:35,440 --> 00:04:37,200 It saves critical minutes. 97 00:04:37,200 --> 00:04:41,840 I can see that. That turns the 2 a.m. panic from an investigation into just a quick 98 00:04:41,840 --> 00:04:42,560 verification. 99 00:04:42,560 --> 00:04:45,200 But what about actually solving the problem? 100 00:04:45,200 --> 00:04:48,720 That brings us to our second pattern, automated remediation. 101 00:04:48,720 --> 00:04:52,240 And this is where StackStorm's multi-step workflows really shine. 102 00:04:52,240 --> 00:04:53,680 Okay. 103 00:04:53,680 --> 00:04:55,760 Let's take an OpenStack compute node failure. 104 00:04:55,760 --> 00:04:57,680 The goal here is graceful failure handling. 105 00:04:57,680 --> 00:05:00,880 The hardware failure kicks off a really complex workflow. 106 00:05:00,880 --> 00:05:02,320 It doesn't just reboot the machine. 107 00:05:02,320 --> 00:05:03,040 What does it do? 108 00:05:03,040 --> 00:05:05,520 First, the workflow verifies the failure. 109 00:05:05,520 --> 00:05:09,360 Then, if it's confirmed, it properly evacuates 110 00:05:09,360 --> 00:05:11,760 all the running virtual machines onto healthy nodes. 111 00:05:11,760 --> 00:05:12,720 And it tells people. 112 00:05:12,720 --> 00:05:16,960 Yep. Automatically emails the VM owner about potential downtime. 113 00:05:16,960 --> 00:05:18,480 But here's the really smart part. 114 00:05:18,480 --> 00:05:20,320 The failsafe. 115 00:05:20,320 --> 00:05:23,440 Right. Because it can't just keep going blindly if something goes wrong. 116 00:05:23,440 --> 00:05:24,400 Exactly. 117 00:05:24,400 --> 00:05:26,800 These workflows have explicit conditions. 118 00:05:26,800 --> 00:05:29,280 So if the evacuation step times out, 119 00:05:29,280 --> 00:05:31,760 or if it detects it was only partially successful, 120 00:05:31,760 --> 00:05:33,680 the workflow just freezes. 121 00:05:33,680 --> 00:05:37,360 It saves its state and calls PagerDuty to alert a human engineer 122 00:05:37,360 --> 00:05:39,040 with all the data it's collected so far. 123 00:05:39,040 --> 00:05:40,480 So it knows its own limits. 124 00:05:40,480 --> 00:05:42,720 It knows exactly when a human needs to step in. 125 00:05:42,720 --> 00:05:45,520 That balance is fundamental for building trust in the automation. 126 00:05:45,520 --> 00:05:46,800 That is robust. 127 00:05:46,800 --> 00:05:48,000 Okay. Last one. 128 00:05:48,000 --> 00:05:52,320 How does this apply to something as high stakes as CICD, continuous deployment? 129 00:05:52,320 --> 00:05:56,220 Well, CICD with StackStorm goes way beyond what a tool like Jenkins can do on its 130 00:05:56,220 --> 00:05:56,560 own. 131 00:05:56,560 --> 00:06:00,000 Jenkins can handle the build and test phase. 132 00:06:00,000 --> 00:06:02,480 StackStorm takes over for the whole orchestration. 133 00:06:02,480 --> 00:06:06,480 The automation can provision a new AWS cluster, deploy the code, 134 00:06:06,480 --> 00:06:11,920 and then it starts to carefully shift traffic over using the load balancer. 135 00:06:11,920 --> 00:06:13,920 And it's watching what happens. 136 00:06:13,920 --> 00:06:14,640 Constantly. 137 00:06:14,640 --> 00:06:17,680 It's pulling real-time performance data from a tool like New Relic. 138 00:06:17,680 --> 00:06:21,120 Based on the metrics you define, latency, error rates, 139 00:06:21,120 --> 00:06:24,800 it intelligently decides whether to fully roll forth the new deployment 140 00:06:24,800 --> 00:06:27,120 or to instantly trigger a rollback. 141 00:06:27,120 --> 00:06:28,960 It manages the full life cycle. 142 00:06:28,960 --> 00:06:32,400 Wow, the value there is just so clear when you put it like that. 143 00:06:32,400 --> 00:06:34,560 You're getting consistency, incredible speed, 144 00:06:34,560 --> 00:06:38,800 and you're freeing up your best people from doing these repetitive stressful tasks. 145 00:06:38,800 --> 00:06:42,400 You shift from doing the work to writing smarter operational code. 146 00:06:42,400 --> 00:06:45,440 Okay, so now that we know what it does, let's get onto the hood. 147 00:06:45,440 --> 00:06:47,920 You said it was a modular architecture. 148 00:06:47,920 --> 00:06:51,520 Can you break down those core components for us, the building blocks? 149 00:06:51,520 --> 00:06:52,000 For sure. 150 00:06:52,000 --> 00:06:56,320 It's all built on loosely coupled microservices that talk over a message bus, 151 00:06:56,320 --> 00:06:58,480 so things can fail or scale independently. 152 00:06:58,480 --> 00:07:03,520 So to follow the flow, think of it in terms of sensory input, brains, and then 153 00:07:03,520 --> 00:07:04,000 actions. 154 00:07:04,000 --> 00:07:06,240 First up, you have sensors and triggers. 155 00:07:06,240 --> 00:07:07,520 The eyes and ears. 156 00:07:07,520 --> 00:07:09,680 Exactly, the eyes and ears of the platform. 157 00:07:09,680 --> 00:07:13,440 Sensors are just Python plugins that are always watching external systems. 158 00:07:13,440 --> 00:07:17,440 When an event happens, a host goes down, a GitHub repo changes, 159 00:07:17,440 --> 00:07:19,520 the sensor fires off a stack storm trigger, 160 00:07:19,520 --> 00:07:24,320 and the trigger is just the platform's internal version of that event. 161 00:07:24,320 --> 00:07:28,720 So the trigger is the if this, then you need hands for the then that. 162 00:07:28,720 --> 00:07:29,840 The hands are the actions. 163 00:07:29,840 --> 00:07:33,520 These are the outbound integrations, what StackStorm actually does. 164 00:07:33,520 --> 00:07:36,960 They can be super simple, like an SSH command or really complex, 165 00:07:36,960 --> 00:07:39,040 like an integrated call to Docker or Puppet. 166 00:07:39,040 --> 00:07:40,720 And anything can be an action. 167 00:07:40,720 --> 00:07:41,840 Basically, yeah. 168 00:07:41,840 --> 00:07:45,520 Any script or command line tool can become a first-class action 169 00:07:45,520 --> 00:07:47,040 just by adding a little bit of metadata. 170 00:07:47,040 --> 00:07:49,760 And what decides which action runs for which trigger? 171 00:07:49,760 --> 00:07:50,880 That's the rules engine. 172 00:07:50,880 --> 00:07:51,600 That's the brain. 173 00:07:52,400 --> 00:07:54,560 The rules are that coded link. 174 00:07:54,560 --> 00:07:57,360 They map a specific trigger to a specific action. 175 00:07:57,360 --> 00:08:00,000 The rule can apply matching criteria, 176 00:08:00,000 --> 00:08:03,040 like only run if CPU usage is over 90 percent. 177 00:08:03,040 --> 00:08:06,720 That maps the data from the trigger to the inputs the action needs. 178 00:08:06,720 --> 00:08:09,440 Okay, but what if I need to run five different actions in a row? 179 00:08:09,440 --> 00:08:10,880 Do I need five different rules? 180 00:08:10,880 --> 00:08:14,640 No. That is where workflows come in. 181 00:08:14,640 --> 00:08:15,680 They're the assembly line. 182 00:08:15,680 --> 00:08:16,640 The uber actions. 183 00:08:16,640 --> 00:08:18,000 The uber actions, exactly. 184 00:08:18,000 --> 00:08:20,800 Workflows are how you sketch multiple actions together. 185 00:08:20,800 --> 00:08:24,240 They define the order, the complex transition conditions we talked about, 186 00:08:24,240 --> 00:08:28,240 and they make sure the output from step one becomes the input for step two. 187 00:08:28,240 --> 00:08:28,800 Got it. 188 00:08:28,800 --> 00:08:30,320 And finally you have PACs. 189 00:08:30,320 --> 00:08:32,640 Correct. PACs are just the shareable kits. 190 00:08:32,640 --> 00:08:34,000 They group everything together. 191 00:08:34,000 --> 00:08:38,000 Sensors, actions, rules, workflows into one simple unit. 192 00:08:38,000 --> 00:08:40,000 And that's why the StackStorm Exchange exists, 193 00:08:40,000 --> 00:08:42,080 so people can share these operational patterns. 194 00:08:42,080 --> 00:08:45,520 And you can access it all through an API, a command line, or a UI. 195 00:08:45,520 --> 00:08:48,800 Full REST API, a really powerful CLI, and a web UI. 196 00:08:48,800 --> 00:08:53,120 Yep. That sounds technically impressive, but who's actually using this? 197 00:08:53,120 --> 00:08:55,200 I mean, real-world adoption is the true test, right? 198 00:08:55,200 --> 00:08:58,160 Oh, absolutely. And this isn't just theory. It's very established. 199 00:08:58,160 --> 00:09:02,320 It's Apache 2.0, licensed, actively developed, 200 00:09:02,320 --> 00:09:06,320 and used by companies with just massive infrastructure needs. 201 00:09:06,320 --> 00:09:09,200 Like who? Well, MedFlex, for one. 202 00:09:09,200 --> 00:09:13,280 They use StackStorm to build their own internal platform they call Winston. 203 00:09:13,280 --> 00:09:14,640 Winston. Yeah. 204 00:09:14,640 --> 00:09:19,840 And it's dedicated specifically to event-driven diagnostics and auto remediation. 205 00:09:19,840 --> 00:09:24,880 They just needed that reliable, super-fast response time in their cloud environment. 206 00:09:24,880 --> 00:09:28,640 StackStorm gave them the framework. Wow, okay. That's a huge name. 207 00:09:28,640 --> 00:09:31,840 What about in the security space? Target is a big user there. 208 00:09:31,840 --> 00:09:34,560 They realized that the real power was in its flexibility. 209 00:09:34,560 --> 00:09:37,120 So they used the existing integrations to get up and running fast, 210 00:09:37,120 --> 00:09:41,360 which freed up their teams to focus entirely on building custom security features 211 00:09:41,360 --> 00:09:43,920 and automating their unique compliance checks. 212 00:09:43,920 --> 00:09:46,960 So they used it to accelerate their own security development. 213 00:09:46,960 --> 00:09:50,560 Exactly. And then you have someone like Pearson, who uses it for internal 214 00:09:50,560 --> 00:09:51,120 efficiency. 215 00:09:51,120 --> 00:09:55,680 They basically took all these small, specific operational tasks 216 00:09:55,680 --> 00:09:57,840 and turned them into individual actions. 217 00:09:57,840 --> 00:10:00,080 Then they could easily orchestrate them into bigger, 218 00:10:00,080 --> 00:10:04,080 reliable macro tasks that they share across the whole organization. 219 00:10:04,080 --> 00:10:07,200 It really does keep coming back to consistency and speed. 220 00:10:07,200 --> 00:10:10,480 And the fact that it's mostly Python, I think the source said 94%, 221 00:10:10,480 --> 00:10:13,600 that makes it so much more accessible for engineers today. 222 00:10:13,600 --> 00:10:15,280 Right. They can write automation logic 223 00:10:15,280 --> 00:10:18,000 instead of clicking through some opaque vendor tool. 224 00:10:18,000 --> 00:10:21,840 It's all about making chat ops better, automating security response, 225 00:10:21,840 --> 00:10:26,240 and letting teams focus on innovation, not just running the same old playbooks. 226 00:10:26,240 --> 00:10:29,760 So to kind of sum it all up for you, the learner, 227 00:10:29,760 --> 00:10:33,600 StackStorm lets you transform those ad hoc operational patterns 228 00:10:33,600 --> 00:10:38,960 and that tribal knowledge into defined, code-based, event-driven processes. 229 00:10:38,960 --> 00:10:41,680 And because everything is code and it all goes over that message bus, 230 00:10:41,680 --> 00:10:45,680 the entire system is fully audited. Every single action is recorded. 231 00:10:45,680 --> 00:10:48,720 Every action, manual or automated, is recorded and stored. 232 00:10:48,720 --> 00:10:51,840 You can send it all to Splunk or LogStash or whatever you use. 233 00:10:51,840 --> 00:10:53,920 That audit trail is huge. 234 00:10:53,920 --> 00:10:57,240 And it actually brings us to our final thought for you, the listener, to kind of 235 00:10:57,240 --> 00:10:57,600 chew on. 236 00:10:57,600 --> 00:11:01,680 Given that all this automation logic, the workflows, the rules, 237 00:11:01,680 --> 00:11:04,880 is stored and managed as code with full version control, 238 00:11:04,880 --> 00:11:09,600 how does that change the very definition of a bug in the future? 239 00:11:09,600 --> 00:11:10,560 That's a great question. 240 00:11:10,560 --> 00:11:16,000 Right. Is a failure in the system a traditional coding error, like a syntax mistake? 241 00:11:16,000 --> 00:11:20,310 Or is it actually a flaw in the operational logic that you wrote into the 242 00:11:20,310 --> 00:11:21,040 automation? 243 00:11:21,040 --> 00:11:24,540 It feels like it requires a totally different kind of pure review, something to 244 00:11:24,540 --> 00:11:25,040 think about. 245 00:11:25,040 --> 00:11:26,080 Absolutely. 246 00:11:26,080 --> 00:11:28,720 And thank you once again to our sponsor, Safe Server, 247 00:11:28,720 --> 00:11:32,720 for supporting this deep dive and for all their work in digital transformation. 248 00:11:32,720 --> 00:11:36,160 They provide incredible hosting for platforms just like StackStorm. 249 00:11:36,160 --> 00:11:42,240 You can find out more about their services and how they can help your team at www.safeserver.de. 250 00:11:42,240 --> 00:11:45,200 That's www.safeserver.de.