I met up with my friend Matt Brender (@mjbrender) yesterday, the day before the Intel Development Forum 16 (IDF16) in San Francisco to talk about snap – an open source telemetry framework that encompasses a “deep stack” that starts at the hardware level and reaches all the way up through hypervisors, operating systems, containers and applications. If you wonder how you might ever be able to instrument everything in a software defined datacenter, this is likely the best path forward.
MF: Hi this is a technology ridecast and our special guest today is Matt Brender
MB: Hey Marc, fancy meeting you in San Francisco!
MF: I know, right! How you doing?
MB: I’m good I’m glad to get such a good Uber ride.
MF: I hate to tell you I’m kind of lost
MF: Yeah so you’re working on Snap.
B: Yeah so I’ve joined the Intel Software Defined Infrastructure team and what we open sourced back in December is an open telemetry platform called Snap. So it allows you to collect, process and publish telemetry so all the interesting little metrics of the datacenter in behind a single API, and that enables people to do some magic things downstream either on the analytics side or the orchestration side.
MF: Yeah, so what’s the basic architecture look like? There are things like collectors, receivers, what are they called…
MB: Yeah so, basic plug-in architecture for some sort of agent. So you’re running an agent on each of your servers that you want to collect information from or you can run it remotely if there’s an API you’re going to query. But then for each layer of the the datacenter that you want to collect from there are particular libraries you need to talk to that, I mean we can’t abstract that away – so you want to get your SMART data off a disk you need something that talks SMART, so you have that library usually built into whatever linux distro you’re using, you might want to pull some things from PSUTIL alongside that and then grab some stuff from the Docker daemon on top of it. So, there multiple plugins that load into that one agent, that agent will then schedule tasks, and tasks will collect the information you’ve requested and then pass it to the processors that you’re using and ultimately pass it to someplace to persist it through a publisher
MF: What’s the frequency that you can get updates?
MB: Yeah so it’s customized per task, which is a pretty cool functionality of Snap. You can go as low as high as you want. We’ve tested it down to 20 milliseconds.
MF: So this is obviously, you know, great technology for Internet of Things kinds of stuff, but how else are people using it?
MB: Yeah that’s really good point. The Internet of Things space is fascinating, but we built this initially just for cloud infrastructure actually. What they could do to increase their utilization per node. If you can start seeing the information coming off of the hardware, the operating systems and then the applications on top of that, and you collect all of it through a standard API we can start automating much better dashboarding. The goal is certainly to feed that into a system that makes things smarter. So smarter placement of workloads in a cloud environment, smarter analytics and predictive analytics on whether you need to buy new hardware or you’re not utilizing some portion of your servers well enough.
MF: So this sounds really interesting, it’s a full-stack – operating systems, application, but it also includes the hardware. I mean it’s a fuller stack, a deeper stack, if you will.
MB: Yeah, that’s what’s super interesting about it. I mean we are Intel, so we wanted to make sure that you can see your hardware statistics alongside all of your application and operating system level statistics. I’m starting to learn like the ins and outs of CPU architectures, obviously for the first time, because, out of necessity, I work at Intel and I better know what LLC occupancy means or like, the least used cache or the last cache on a cpu – and it’s super fascinating because these have real ramifications to how your applications run and some of those like little magic blips that we’re used to just accepting, as like “Oh well that was just a monster or gremlin that ate my workload”, there’s actual hardware that correlates to that, so when you can query in a meaningful way we can probably make more intelligent systems.
MF: Matt this is terrific thanks for coming on.