Video: Effectively measuring Copilot impact and ROI | Duration: 3652s | Summary: Effectively measuring Copilot impact and ROI | Chapters: Welcome and Introduction (3.68s), Panel Introduction (122.31s), Copilot Metrics Overview (183.445s), Measuring AI ROI (496.13s), Measuring AI Adoption (577.305s), Setting Realistic Benchmarks (695.27s), Measuring AI Output (1053.01s), Cohort Analysis (1202.345s), Context Engineering (1671.805s), Agentic Workflows (2059.645s), Qualitative Feedback Integration (2305.255s), Measuring AI Productivity (2687.015s), Closing Remarks (3127.715s)
Transcript for "Effectively measuring Copilot impact and ROI": Hey. Thank you for coming on. My name is Jose Palafox, and I'm gonna be your moderator today. We're gonna take just a minute or two to let folks, get on the session here. So So if you're here already, thank you for coming. We will be out in just two minutes. While we're getting going, why don't you take a minute and start thinking about, how you're measuring your developer productivity? You've got three, four really, experts in sort of dev productivity metrics and, you know, measuring sort of ROI on AI usage at companies. So as you're thinking about how your company is measuring, think about some questions you wanna ask this group. We're gonna have a really long and healthy, open q and a here at the end. So, we'd love to love to have you toss those in the q and a here in the right hand side, and we'll get going in just maybe two more minutes here. Just letting a few people, get, get on the webinar, and then, we'll get rolling. There is gonna be a whole team of GitHubbers, that are monitoring the q and a. So if you have generic Copilot questions and you ask those in the q and a, someone will try to, get back to you there. We've got four or five, awesome solutions engineers that are monitoring the chat. So a big team of us are here, to to help help answer any other any other generic question. But but for the panelists specifically, if you've got, questions on developer metrics or, productivity metrics generally, would love to get those populated in the q and a tab, while you're coming on and thinking about it. Alright. Well, we're two minutes after here. Let's, let's bring out our full panelists and, get going here. Awesome. Thanks, everybody. So my name is Jose Palafox. I've been here at GitHub for the last six years, and my role is really to work with customers on adopting GitHub Copilot. So I'm really excited to get to chat with you today. I brought, three sort of industry experts as well as our product manager for our metrics solutions on to talk with you today about developer metrics. We've got a couple of interesting topics. I'll be kind of moderating the chat and getting folks introduced. But first, we wanted to kick off, maybe with a quick overview of recent metric shifts that we've had in the platform in the last few months that you may not be aware of. So, Eric, if you wouldn't mind, I'd love for you to introduce yourself and, maybe share kind of what we've been working on at GitHub for the last few months that that the audience may not, may not be aware of. Yeah. Absolutely. So I'll do that now. You'll see yourselves for a second. Do you see what I want you to see? Do you see my browser? We see Infinity. There we go. Okay. Great. So, I just quickly me, I am our PM for metrics at GitHub. And I've been in this role officially for about two months on it, but I've been at GitHub for ten years. And I've been, directly with developers at at the pretty much every step where I'm trying to think about the ways that you need us to be better. So right now, my focus is metrics, and I just wanna run you through some of the things that we're doing. So the first thing that we did in February was we made our Copilot metrics offering, GA. And that meant that we brought new usage dashboards in front of you with lines of code, code complete. activity, IDE usage, models, languages. We added more APIs at the enterprise org end user level and pretty much, just looked to bring a lot of the things that we were cooking since the universe to you at at a at a in February at GA. And since then, we've been very, very busy because I know there's a lot of data gaps. There's a lot of different ways to think about data. We're gonna cover that with you today. There's not one metric that you really should be over indexing on, but there's a lot that you need to help you understand your own business and your own work. So if you even just scroll back through our change log and type in metrics or usage, you'll get a lot back from us. And some of the highlights that I just wanna tour you through, I don't even have a slide deck, I just want you to see the change logs, because you can take them, is you have per user CLI activity. If your teams aren't using the CLI yet, they absolutely should be. This is where there's a lot of effective, agentic work, Eric, right don't know. if you're, trying to show something new. It just still says Copilot metrics generally available. Oh, there we go. Now we now we move. okay. I I'll do it every time. My bad. I should have. Yep. Okay. So this was this was what I was saying about our change log. go. There's a lot in here. And now you can see it with your eyes instead of just hearing it. And then the next thing I was gonna share was this tab, which is the Copilot message metrics, CLI activity is now available to you, which is important. I think you should start making sure that your teams are using this. You can also see, CLI activity at the totals level, so top level. You can cover across IDE and CLI so that you can get a better holistic use across your dimensional breakdown. So the feature, the model feature languages, and language models, just having more granularity in the data that we give you. We also are a full platform, so there's more available to you for code review. And I think that it's important for us to think about agentic use in every So there's the CLI, which is generally for coding, but you can do other tasks. There's code review, which you can apply across your repos, depending on what level of administration you have so that everyone by default gets a review from Copilot. And then you can tell if your business is more passive about that or actively engaging with it and applying suggestions. You can also see pull request throughput parity. I think that this one's very exciting, and and valuable for you to continue to be mindful of, just your baseline pull request activity for your business. And then if you have Copilot participation with that, so total review suggestions, total applied, copilot created pull requests that were merged, median time to merge for those, and medium time to merge without. I think there's meaningful metrics constantly coming out from us, and you should be subscribed to our change log to keep up. There's plan mode. Are people using agents in sophisticated ways beyond just the default? Are they using specific measures to help them to even set up agentic use effectively? We give you the ability to see that. And then we're also just doing other things where we're making sure that there's quality improvements under the surface, where we're changing our, custom domains to be more stable and make it so that you have easier management if you're an admin. And then overall, we're just also boiling this up so that it's not just at the user level, but we're doing this at the, aggregate levels for you so that you can see it at a monthly report, a weekly report, daily report for your business and your enterprise and teams. I'm gonna stop sharing those and come back to you now. But, overall, I just I think I want you to see the pace and the frequency. There's a lot of ways. The point I'm making is there's a lot of ways for you to consider measurement and consider your business, And we're gonna talk to you across the full suite of that. And my perspective as GitHub's PM is to give you as much as possible directly in the API so that you can do what you need to do with it to better understand your business. And this conversation today is gonna help you scope that and shape that and think about that while I'm still feverishly in the background trying to give you even more data to to parse for your own insights. Jose, back to you. Thanks, Eric. I know the pace of, work. I I know for myself, I feel like, I've gone through just like a time vortex since November. Like, you know, the the speed that the industry is moving is just really, really fast, and I feel this, like, constant go go go. So, I really appreciate you walking everybody through sort of, what's been coming out and sort of our our strategy with our APIs here. And I think this kinda helps us lead into the first question I wanted to ask the panelists. So, maybe I'll I'll toss it over to you, Krishna, to get us started here. Introduce yourself, introduce Jellyfish a little bit, and, let let's talk about a situation that a lot of our audience members are in probably, which is like you you're out and you read some article or your board reads some article that says, you know, at Google, 80% of the code is written by agents or at Netflix, 80% of the code is written by by AI assistants. And now you've got sort of some, you know, board member that's coming down, talking to your leadership or talking to you and saying, you know, how much of our code is is attributable to AI? And, you know, maybe you don't know yet or you don't even know how to start measuring that type of thing. What what do you give like, what advice do you have for somebody that's in that sort of situation where they're they're under pressure, people are asking them to start, you know, articulating ROI on the the AI investment they have, and, may maybe they don't have a clear picture of what it looks like today. Yeah. That's that's a great question. So I'll introduce myself. My name my name is Krishna Cannon. I lead product over at Jellyfish. Jellyfish is a intelligence platform for r and d teams for the AI era. So our job is to help you understand your adoption, productivity, and then whatever the downstream, impact of using AI is on your your company's software delivery. And so, obviously, we've all seen over the last several months, the industry has moved very fast towards agentic use, and this is, you know, the the time warp that Jose is talking about here where, like, from November to now, it's gone gone all agentic. I really love the question because it's one that the customers I'm talking to every day are getting a version of it. And my advice to them is a couple things. First is that don't fight the question. Right? They may not be asking the right question in the most nuanced way, but but don't fight the question. You as a engineering leader, as a platform team leader or an engineer, you also need to know the answer to, are you using AI successfully at your org? What does success look like, and how do you get better? And so they're just trying to prompt that discussion, so look at it through that lens. Once you get over the the hump of measuring, you use the tools that Eric's describing or a tool like Jellyfish. The one thing I'll I'll note here and then we'll come back to this, of course, is that it's really important to benchmark your data and provide context to the person asking the question. Boards only know what they're reading in the news. They don't know what's actually true on the ground. You know, we know at Jellyfish that the ninetieth percentile, for example, does 20% of their PRs agentically. So the board member can read all the articles they want. Those are awesome. Get them pumped up. But come back with measured facts, data, and put your organization in context to try to drive improvements. Thanks, Krishna. Why don't we go to Beth and then to Justin? If you could introduce yourself, your company, and then how do you advise someone coming in kind of this this situation? Yeah. Absolutely. So firstly, thanks very much for for inviting us. So my name is Beth Twigger. I'm the customer success director for The Americas at Blue Optima. So we provide a software engineering intelligence platform, specifically our suite of metrics around productivity, quality, and cost as the triple constraint that enterprises need to operate through. And what we've typically really been seeing is the expectations that get set through. the hype. You know, we all see it in the news and then and LinkedIn can, set the bar seemingly very high. And, actually, you know, the first thing we recommend is to approach everything with some healthy skepticism, which is, you know, it typically comes naturally to those in the engineering fields. But to understand, you know, okay, what's realistic? And I think to Krishna's point there around context. So what's possible and what's realistic can be two very different numbers. And if you are a very large enterprise that is, laden with legacy architecture and mainframes and, historic processes, and you find yourself comparing, what you want to achieve to, you know, an AI native startup, in a in a relatively, low governance industry. Those types of, capabilities are so different. Like, what you'll be able to achieve is so different. So setting yourself with a a realistic benchmark of organizations that are very similar to yourself, Can feel a little bit like, we come in as the the party poopers there, to set people with the that realistic expectation, but it's so important to give people that baseline and then help them to understand how to build from there. So comparison can be the thief of joy. So comparing yourself to yourself historically, is a really great place to start. And obviously, that's. what we've been focused on, with customers from the Blue Optima side is, right, what was what was your product performance pre Gen AI? What was it early stage Gen AI? And then what do we see the forerunners within your organizations actually achieving and helping them to set that, as the, the goal? Justin, Alright. you. wanna yeah. Thanks. Yeah. Sorry. Thanks so much, Jose, for having me on as well. Very happy to be here. Good to talk to everybody today. I'm Justin Riock. I am the deputy CTO at DX. My background is pretty much in software development. I spent a long part of my career writing code. I was an enterprise architect for a while after that. And then about eight years ago, I would join a company called Gradle as their chief evangelist. A lot of you, I'm sure, are familiar with the build tool. On the commercial side of that company, we had pioneered a practice called developer productivity engineering, which was the first time that I really saw a very scientific approach to improving developer experience and understanding it as a leading indicator of developer productivity in an organization. And since then, I have just not been able to quit this field. I find it to be such a compelling problem, a challenging problem, absolutely, but very satisfying, when we can get this right for an organization. DX is a research backed, product that measures effectively developer experience, which, again, as we see is the leading indicator of developer productivity in an organization. We tend to think of developer experience as more of a systems problem than a people problem, in line with a lot of the work of folks like w Edwards Deving and Ellie Goldratt. We have chief researchers like, for instance, doctor Nicole Forsgren, who is the inventor of the DORA metrics, Margaret Ann Storey, Peggy Storey, who, invented the space framework, and then, of course, our CEO, and cofounder, Avi Nota, who recently published a new book called Frictionless, with Nicole Forsgren. And then, of course, everybody wants to know how AI is impacting developer productivity. And so really over the last, a little over a year, we've, built, our own AI measurement metric framework meant to be complementary to our core metric framework, which is called the core four, which is a distillation of Dora space and, DevEx. And that's a lot of the conversations that we're having right now as I'm sure you can imagine. So in terms of advice, for for measurement, I think, I mean, first of all, I would echo what what the other panelists have said. You know, we really have to be, careful about making sure that we're setting realistic benchmarks. We just released, sort of a state of the union longitudinal study literally this third, excuse me, this this Tuesday, earlier this week, that's been tracking gains in specifically the PR throughput, weighted PR throughput, velocity metric from about November 2024 to February. We found a me a median p 50 PR throughput increase on average, of about 7.76%. So that's the p 50, people that we saw in this. And then we saw an average of about a 13.1% uplift in PR throughput, which, of course, suggests that there are some outliers up at the top in that p 95, p 99 area. But I think that it's really important to you know, there's a couple of takeaways there. Like, first of all, if that's what you're seeing, like, good. That's pretty much what's in line with with what the industry is seeing right now. You haven't missed the boat on the five x, the 10 x, right, which is a whole other story. But I also would say that it's really important to remember, despite all this hype, you know, this technology is still about improving developer experience and improving overall developer productivity. So our core and foundational productivity metrics are still the right metrics to look at in terms of whether or not these investments are actually working. But these there's new data that we have around telemetry and utilization and things like that coming out of, like, for instance, the Copilot API can be very useful for separating cohorts of users, trying to figure out, okay, are daily active users of Copilot, like, seeing better PR throughput or seeing larger PRs or any of those other core productivity metrics than cohorts who might be using, using the tech at a different, a different frequency. Yeah. So if I I think I I summarize maybe what I heard from you all. It's like, don't stress. Like, you probably are more or less in line with where the industry is even if the the media hype seems to be portraying a different picture. And, take take time to reflect on where you were before you started adopting AI so that your comparison comparing yourself to yourself, incrementally improving rather than to some external external benchmark, which I think is is really sound advice. One maybe spicy question that came into the q and a and I'll I'll kinda toss this up. I I don't know who who wants to jump on it, but, you know, there was a question around, like, if agents are writing a huge amount of code today, do you feel like that's survivable code? Is that indicating a lot of code churn? Like, do you even feel that that's, like, a a thing to aim for right now? I I don't know if anyone wants to raise their hand or just pile on that, but I'd be curious to know, like, do do do you think, high agent output, you know, without, strong guardrails today is really even a thing to aim for? Right. I think the principle remains the same whether or not it's a human or an agent. If somebody's writing 3,000 lines of code that don't actually give you any business value or aren't assigned to relevant to a task that the business is is moving towards. It doesn't matter how it was written. It matters that there is a misalignment either with your human developer or how you've prompted and designed your agent flows. So the the, thing that should be measured is not necessarily just the number of lines that is written, but actually, you know, what was the the effort involved in delivering that source code change? Because that sounds if you were looking at lines of code, oh, great. That sounds really successful. But if that then comes with a gigantic token cost, not only have you upset your business because you've not worked on the things that, were agreed upon, but you've probably also upset your CFO. So being able to operate and measure not just output in terms of code, but also the quality and the costs associated with those changes is really critical to ensure that you are not over optimizing or getting tunnel vision with your KPIs. People talk about a North Star metric, but there's, you know, other points on the compass that are important to make sure you're going. in the right direction. I I wanna come back, Justin, to something else you you said. Sorry. I I I have to keep us moving just so we we finish on time. But I I do I wanna come back to something else you said about cohort, assignments and thinking about how different cohorts are using the product. One of the things that I've been really excited about when I've seen all of your products is that you integrate workforce data into, into the platform. And so I wanted to come back to you, Justin, to see can you expand on that a little bit more? Like, how do you think about cohorts in the data? What are you looking for? You know, particularly, I also see like this, you know, right now some of the tools are favoring, like, the top 5% power users. Right? These are the CLI products, right, that make an individual go really, really, really fast, and that's where a lot of the media hype is focused. When I step back and look at the GitHub platform, I see something that, you know, helps everybody contribute whether it's in the cloud agent or, you know, in a PR comment or, you know, all over the platform. And that all has to do with a lot of, like, context engineering. And so I'm I'm wondering if you can kinda help people think about the different cohorts they'll see in their business and what does that mean for their enablement of the product or, like, of AI tools generally. And we'll we'll ask this to everybody. So may maybe you wanna kick us off, Justin, then we'll go go Beth and Krishna. And if we we have some more questions pop up, we we can we can talk through those. Yeah. No. Sure thing. I I think it's a I think it's a great question. So, I mean, at the at the most surface level, you know, we look at utilization metrics coming out of the API telemetry itself. So Copilot is able to, for instance, surface through the API when, users are checking in, when they're starting sessions, and things like that, which tells us, you know, we can get a baseline frequency for how often, people are using the tool. We like to separate, those cohorts of users into, you know, heavy, moderate, and light users of the tool, but we could also we it really just means daily, weekly, and and monthly active users that we see, checking in, checking in and and starting sessions with that back end, telemetry. You know, I think that that that, like, kinda last year, maybe even over the last sort of couple of years as we've been focused more on coding assistance and we're just starting to move into agents and getting all of our infrastructure ready for all that, we were all, like, hyper focused on, like, we have to have a 100% utilization. Everyone in the business has to be using these AI tools, and I think, know, some of us have learned, like, well, that doesn't really tell us all that much. You know? I mean, first of all, that's incredibly easy metric to gain. We'll run right into Goodhart's Law, you know, if we're requiring and and even tying to performance reviews, you know, like daily use of AI. If you want me to make the telemetry look like I'm using AI every day, I absolutely can do that. I can have it update my readme file 20 times a day if you'd like, and it looks like I'm checking into the session. So I think, there's a couple of aspects there, and and we'll get more into the measurement later. Just that it's important to to cross reference those users to our core and foundational productivity metrics. I wanna understand if daily users of Copilot are shipping, you know, more PRs. I wanna understand if their if their defect ratio is higher to best point about quality. You know, I wanna make sure that our qualitative metrics like change confidence and code maintainability, I wanna understand, where those are. Now maybe more importantly, where we're not seeing utilization because, absolutely, we we have measured that that daily, and good users of this technology are outperforming their peers. There there's a very clear, learning curve. You know, we see people going from, like, no utilization to light utilization, productivity and quality metrics across every firmographic and every demographic of developer tends to go down. Then it normalizes, and then it gets higher, and, ultimately, you know, people are outperforming. I think that these skills, you know, be becoming good at at prompt engineering, become good at context engineering, become and becoming good at writing agents, are skills that are gonna benefit most engineers for the rest of their career. And as leaders, I think we really need to take it seriously to provide materials, time to learn, understanding of best practices, room for experimentation, and and all those things. And as a person who's been writing code professionally since the late nineties, I'm on this journey too. I've had to build new muscles kind of around this workflow, so I know it can be difficult. But I think as leaders, we really want to find out, like, why. You know, why are certain engineers more reticent to use the tooling? Is it that we haven't been providing them with enough education? Is it that they haven't found enough compelling use cases? Is there a lack of psychological safety around the use of these tools, which we know is deeply linked to, productivity? So I think that the utilization metrics are are handy when we can create these cohorts and then cross reference them to our core productivity metrics. But I also think that a lack of utilization is an opportunity for leaders to figure out, you know, what's going on there, what can we do to make the tech more accessible, because these skills are very important. Yeah. I I would completely agree with that on the basis that the first layer you want to look at is is usage. So it's great to see, obviously, all the additional metrics that the GitHub is is producing that we can then, layer over productivity metrics like coding effort to see, in in our case, in quartiles, you know, those that were, in the top, quartile were seeing a 20% more increase in in productivity. Those with very little sort of sporadic usage was a 3%. So, you you know, it is exactly as Justin says, it's not just having the license, but actually how we're able to to leverage it and build it into daily workflows to really start to see results and those things take time. We've been seeing monthly iterative gains as long as twelve months and that continues to grow. So giving people that that space to, not just to experiment, but then to start to build it into into their true workflows. The second layer is then how do we see that differentiating across different tenures? You know, do we have, our, you know, younger talent coming in and already able to leverage these types of tools? Do we have, time actually taken to understand our code base before we start leveraging those types of tools? Or even things like technology stacks. Reasonably early on, I say so towards the 2023 now, we were seeing a lot faster uptick, in GenAI authored source code, specifically within Python engineering. That's now ramping up across more technologies, as people's awareness and also the tools features improves as well. So it's definitely worth, once you understand that top layer, digging into it a little bit more to understand the people are going to have a different experience depending on the technology stacks, the context that they're working in. That gives you that opportunity to identify who those individuals are and really start to share knowledge within the organization of practical use cases. Actually, how somebody working in a similar environment to you has leveraged that tool and seen a benefit that can be measured, and spoken to effectively. I I think I think Beth's on on on the right track here because what we're seeing is that the the instance of developers not wanting to try AI tools or being afraid of the AI tools has has actually dropped a ton. I think last year, we were seeing a lot of psychological safety type concerns or existential type concerns. And in my experience, those have largely passed. When we see gaps in adoption or gaps in usage, they're much more likely to be around challenges in the code base you're working on, the type of work you're doing, or the enablement you have for the tools. Right? We've been in this, like, sink or swim environment for so long now. And now that you need to standardize, it requires enabling those that haven't figured out how to do it by themselves. It's not a lack of willingness. It's a their code base is different, their context is different, and they need more teaching and more tools. So it's not enough to segment by the metadata of, like, where they're located. You have to segment by the type of work they're doing, the repo they're working in to identify those pockets. Yeah. So let me just play back a couple of things I heard here that are that are good advice. So one, it sounds like there's multiple paths for people using the tools, and so integration across multiple surface areas is an important kind of segmentation. It's not like just what interface you're using, but where you're using it in the SDLC, is like an interesting thing to track. Two, I think I heard really clearly, like, daily use is, like, the key thing because there's a learning curve that has to happen. So if you wanna get successful at the tools, you've gotta integrate them into all these different workflows, and that just means touching them every day and trying to figure out how to use them in different situations. And then maybe the third third thing I pulled out from from listening to you was, like, different technology stacks may be more primed to adopt sooner or later. Right? So the folks that are working on Python were, like, ready to go or JavaScript were ready to go, And maybe now some of the back end systems are finally starting to kinda catch up and they're they're getting ready to go. But, like, complexity of the, environment sometimes will cause a time delay in how fast they can adopt the technology. So thanks for all of those. I think those are really good, things for folks to keep in mind, you know, as we're we're moving through here. One question that came in that I think is kind of adjacent here oh, sorry. Go ahead. Yeah. So one point I just wanted to add there real quick is we're hearing a lot about agentic readiness. So how ready is the organization overall to enable its engineer? So have they invested in context engineering, invested in a test suite, invested all those things that support the engineer are are, I think, adjacent to all the points we're making before. So I just wanna make sure we don't we don't drop that one. Yeah. And is there, like, a starting point you would you would tell people to look at for context engineering? I think when I talk with customers, like, we have a couple different surface areas in the product. Like, you can put custom instructions on an org level. You can put them in the repo. You can put them locally. And, you know, we do a lot of advising on it. But I'd be curious, like, how do you tell people to approach context engineering? And maybe we can just go in reverse order again. We'll we'll sneak back, from from you to Beth to Justin. And I I think this is really important for people to think about. Like, do you have have advice for them on how to get a like, how to get started on that? Yeah. Yeah. So we we start we largely see folks starting at the the repo level, and and then they can have individual overrides that that add on to that. And so that's been the most common pattern that that we see. What we see also being successful is having a rubric for what are the four or five things you want your context engineering to account for and have those known so that when your infrastructure teams or platform teams are providing those, they have a checklist of sorts that they can go through and say, hey. We've provided tests. We've provided documentation. We've provided read mes, etcetera. We actually did an experiment recently where we trained an agent to look for elements of good context, and then we actually put it loose on a bunch of repos to have them scored. And so we created additional data that could be benchmarked for context engineering purposes and gave that to our infra teams and use that to help them bolster it. So, you know, the summary is, like, get started at the repo level, have a plan, and then just like anything, benchmark and compare. Beth or Justin, do you have anything you wanna weigh in on kind of, the context engineering plan or Eric? Any anything else you'd advise folks as they're starting to think about, how to get ready for for agents inside the inside the code base? I have a lot think one let Beth go first. that's very kind. Thank you. So I think a couple of the things on, you know, obviously, on top of what Chris just mentioned there is there's great foundations is specifically looking around the maintainability of code, as well as some of the best practices around security. So, you know, not leaving those things up to assumption, but ensuring that any human and any agent has a good understanding of what if foundational maintainable code looks like, best practices around complexity, you know, single purpose principles, ensuring there's no hard coded secrets. So things that maybe you take for granted or as seemingly obvious, ensuring that those are a part of your, agent core agentic principles that are, are clearly outlined for your agents as well. So there's a a strange, I think, irony happening, especially for somebody like me who's been clawing at this developer experience optimization space for almost a decade now. And that is that what's good for humans is also good for agents. And so we have, you know, scorecard and capabilities within DX, and we have an out of the box sort of AI readiness scorecard. And if when you you look at the criteria on the scorecard, it's things like, you know, accessible, well structured documentation. It's things like, you know, data models with, like, clear, clear relations and things like that. It's easy to maintain modular code. It's fast CI and fast feedback cycles. And if any of this sounds familiar, it's because that's also indicative of a very good developer experience. Right? So the irony is that, you know, after a decade of us sort of, like, saying, hey. You really need to be thinking about these things for, you know, for your overall, more frictionless delivery. Now all of this investment that we're making at AI up at the executive level may finally drive us to make these improvements, and I've come to grips with that. Like, that's okay. I don't really care how we finally get there. I'm just glad that we're finally getting there. But we are also doing something pretty wild at DX. We we dropped a new feature, a couple of weeks ago, which we're calling our agent experience index. And so just like we've been gathering qualitative signals, and experience sampling from engineers, we're now doing the same thing with agents. We are literally surveying agents after they complete tasks and asking the agents questions about the human agent interaction. We call that our steering, asking it about, like, code based predictability, like, how easy was it to work with the code base in various parts of the platform. So we're actually starting to measure this because we want just just like just like we've been doing with the signals that we get for developer experience, we want to use that data, to anchor, cultures of continuous improvement around the developer experience. We wanna be able to do the same things for agents. You know, we want to, make it easier for agents to operate with less feedback and less less context. And this, of course, has the offshoot of of actually burning less tokens because we're having more, optimal, experience with the agents in the platform. I think a thing you're all hitting on is, like, you can use agents to help prep for agentic development. And this kind of pulls us into something I wanted to talk about at the the back end, but I think it's relevant now to maybe move forward. We've got this cool new feature or, I guess, framework called agentic workflows. And agentic workflows allows us to stick the Copilot CLI inside of a GitHub actions environment and then trigger it based off events that are happening in the platform. And what I'm seeing a lot of customers do with this is create workflows that update architecture files or update context files autonomously for people. So I I'm curious, like, how how do you all think about, like, like, if if you had a way of automatically probing and updating, what would be best practices for doing that? Because I think this is kind of a new frontier is, like, using the agents to get ready for more more agent work. But, yeah. What what what specifically would you have as, like, an autonomous task that that would need to be updated, you know, that could help manage, the context files or, like, you know, other information the agents need to be able to contribute effectively in a repo. Anybody can jump on it. Sorry. I I I realize this format's a little new for us, so I apologize for not not asking someone directly here. Why don't we go back? Do you wanna go first? You haven't gone first yet. I'm actually I'm gonna pass to to Justin just because I'm, not as hands on as he is. So I think if we're going practical cases for engineers themselves, I'm gonna pass to Justin, and then maybe I can touch on how we're measuring that as an impact, after he's, he's shared some wisdom. Yeah. Oh, sure. Thanks, Beth. No. It is an interesting question. I think I would go back to you know, we have this this AI readiness scorecard, and and what's interesting is that so much of that overlaps with what we would call developer experience. But there there are some some areas that are specific for agents, you know, so things like having agent markdown in place and things like that associated with, with each repository and each service that we're working with. That is something that's very specific, obviously, to, agents. I think that it's becoming very good practice, especially dealing with, like, you know, like, a spectrum and design, use case or something like that, to just when we're updating, like, our documentation, for instance, and this is more like ongoing, maintenance. But when we make an update. to a service or when we make an update to something in a repo, we're just sort of saying, like, hey, agent. When you go and update the readme file, you know, make sure that you're simultaneously updating the agent markdown file as well. And now we're kind of forked and sort of maintaining both of those things at the same time. So that's helpful. I think that for us, you know, kind of looking at this from from from two directions. First of all, we have the opportunity to, as as our input, make sure that we're looking at the readiness of the platform across these these different vectors, that I mentioned before. But then also kind of taking these agent experience, measurements, and make it you know, using that to validate that we've done what we can, to improve within the platform. You know, if we wanted an agent to go and, like, do that for us, I think we have to be pretty careful, because, obviously, our agents try to be very, very helpful, and they love to go and, like, re rearchitect our entire code bases for us when we don't necessarily wanna make that invasive of a change, especially to an existing or legacy repo. But I do very much like the approach of a of a diff only, and, like, plan and apply type of strategy where it's like, okay. We've got this big code base, and we want to improve some aspect of it, modularity, readability, whatever that is. Go tell us what you would do. You know? Have an agent go and, like, scan the repo, create and and even, you know, tell us what you would do in the form of a unified diff patch, you know, so that we can actually if we want to apply those changes, we can do that with a one liner on the command line. So we're still sort of saving time. But then we're not going through and making these, like, massive invasive changes to our existing code base, which burns a lot of tokens and is pretty risky. So One thing I wanna maybe pull us back to, and apologies for for thrashing us a little here, but, I I didn't wanna lose sight of like what the, how do we help people get better at using these tools? And one of the things that we've got in the platform is a way for our tool tool to give advice to users about how they're using it. So inside the CLI, there's a slash command called chronicle, and chronicle will provide, like, tips on how the user is using the CLI, maybe help them set up their custom instructions files, and it also gives a lot of, like, qualitative advice. So I I was curious to ask you all, like, how do you think about integrating qualitative advice into, how you're enabling developers or what you're tracking? Like, I feel like a lot of times we're looking at numbers or, like, PR metrics, that sort of thing, but, there is sort of like a human element that, like, a manager and a one to one, you know, needs to be able to, like, coach somebody into how to use these tools. So do you, like, programmatically collect qualitative advice? Do you, like are you advising people to, like, monitor session data? Like, how are you thinking about helping managers coach people into getting better at these tools? Krishna, do you wanna do you wanna take this one? Yeah. Yeah. I love I love the question. You know, we we think it's really important to blend the qualitative with the quantitative as you're describing, Jose. So we have, obviously, a a full qualitative platform to collect developer experience data. I think I think that's sort of standard now. We've augmented that with a specific AI impact developer experience motion. So you can collect data around what are the blockers behind using AI, what are the challenges you're facing. And a lot of the feedback we get there is, hey. I just don't know how to use this tool correctly or I got stuck on this, that, or the other thing. So we found that to be very rich for our customers. And then most recently, we actually added in a agentic feedback collection mechanism so that when developers are using the AI tools at logical stopping points in their workflow, we collect data on what was hard or easy about that last step that they accomplished so that you can give verbatims to them later, their managers, and their teammates about things to improve upon or things to train on. So I I think the data tells you where to look, but the qualitative tells you how to solve the problem. And so we were invested in making sure that we have, like, dynamic ways to get that data. Beth, Justin, you have anything. to add there? Yeah. Yeah. I think it's it's great to to see that kind of capability because, obviously, we've talked earlier about the the need for people to to have that type of enablement and understand how they're able to to drive their own improvements. It's not just a case of, you know, only making that type of information available, during very structured, course online courses, for example. Like, those are a great resource that an enterprise can provide, but being able to do that from your own laptop, from your CLI, and get that exact feedback when you need it is is great because then it gives everyone the opportunity to kind of be the, the master of their own success, and to to keep keep growing. I think one of the other very interesting pieces here will be the capability to then, measure how much of, your own coding effort is actually authored by an LLM versus authored by a human that might be enabled by LLMs, but then needs to make significant updates to that code to make it, you know, palatable, functional, or fit within your repo. So, you know, as teams improve more and more, in how they're able to to use, the the tools whether that's through, self directed learning or or more formal structured styles, being able to see the the amount of source code that's actually authored by AI, increasing incrementally over time will be a a great measure of success for these types of initiatives. I would I would agree with that. I mean, obviously, at DX, we we really believe in the power of qualitative signals is and using quantitative and system metrics as a way to, as a way to validate what we see from those qualitative signals. You know, we and and and, you know, I think across the board, you know, we would all agree that you have to have a very high signal there. Right? You know, I mean, almost half of our customers get a 100% participation in these surveys, and the rest usually fit somewhere between that 90 to 92 to 95% and really to the point where we can trust the qualitative signals even more than some of the system metrics. And I think that that absolutely carries over, to the way that, developers are, experiencing kind of using some of this new technology. I really like Beth's point about, you know, looking at that percentage of AI code, that's actually making its way into production, versus, you know, what's still being written by a human. I think that that is one really good metric for how effectively, we're kind of using these. But I also would just stress the importance of of not only providing educational materials, but also providing time to learn. Right? Engineering leaders just really need to understand that it's so important to be able to give engineers adequate time to absorb these these materials, and actually put them into practice. We have a couple of free guides available. There's a prompting guide as well as an advanced AI prompting guide that are available on the on the website, which were built, like, doing this type of research. We actually surveyed, like, SVPs of organizations who had rolled out, various tooling to thousands of engineers as well as surveying developers directly and asking them what they thought their most high, like, valuable use cases were. And when we where we've found those results, we kinda put together, like, a top 10, with coding examples and prompting examples. And then we found out immediately, like, okay. This is great, and developers really like this. But some of this stuff, you know, we need to get a little bit more sophisticated when we're dealing with large legacy code bases. And so that's what the advanced guide, helps with is, like, how do we derisk, you know, what we're doing with these code bases? How do we build better agentic validation loops so that we can trust the output, when it first comes out initially? So, no, I think that that the qualitative signals, in my opinion, remain some of the most important data that we can get, but we have to follow that on, with good education, good materials, time to experiment, and time to learn. There there's three questions here in the chat that I wanna maybe pick up, and I I think they're all pretty closely related. So so the first one is what metrics should we use to identify power users? Because I think right now, just looking at, like, raw token consumption, you can tell there's, like you know, it's unclear if that's productive. Right? And then I think the second and third question, which are are very closely related are, like, what does good look like for leveraging AI? And, you know, how do you translate activity into productivity gains? Or, like, what does that journey look like? So I think all of those are kinda hunting around the same idea, which is like, what's the good signal I should be picking up? And then if I'm not picking that signal up, what do I need to do? You you know what I mean? Or what's the what's the message, to to bring to my engineers about what they need to change? I don't know if anyone feels like jumping on jumping on this, but, yeah, I'll I'll toss it out to all of you. Yeah. I. mean, So I think in terms go ahead, Krishna. I no. Go go for it, though. You go for it. I'll follow. You sure? Okay. Thank you. Yeah. So, yeah, I mean, I think the first thing is actually agree on a metric. I think that's the first hurdle that a lot of organizations kind of meet initially is you'll have teams that have got, you know, token, usage. You'll have teams that have got the number of chats, the number of lines of code, accepted ratios. There's so many different data points. It's so important to be clear on what question each of those things is answering. And then to tie that to some sort of primary output metric that, you know, is not just applicable in, the case of use of of AI or agents, but also human as well so you have that full organizational view. So for Blue Optima customers, that's that's coding efforts, our our metric of, intellectual effort into source code change so that we can tie that usage to something actually meaningful that reaches a code base. So the first thing is obviously make sure that you've you've got a clear view that's applicable to the whole organization on what we are going to talk about when we talk about developer productivity. Productivity. And then ensuring, obviously, as we talked about, that you've got those those quality and and cost angles, other metrics that will give you a a more holistic picture overall. That would be kind of my my primary, advice there, because otherwise, you fall into the trap of, you know, it's it's like counting lines of code or counting the number of commits. You can say, yeah, we want everyone using AI. So I think it was Justin's point earlier, it's just somebody's gonna do something that's not particularly meaningful to to business value in order to to gain a metric. So making sure that that's that's really tied to an an outcome, so that you get more meaningful results and and reduce that opportunity to to game the systems. Yeah. I I I agree with a lot of that. I think knowing what you're trying to measure is is important and knowing what your goals are in organizations is critically important. I think the one area where maybe I'd I'd I'd build and maybe be slightly, on a different page is, I I'm not as concerned that engineers are trying to game the system right now. I I I do think in the the CTOs that I'm talking to and the VPs that I'm talking to is that that that was last year's problem where folks felt very threatened by AI, and you you legitimately did have to worry about setting a goal that was gonna create, bad behavior. I think Justin cited this earlier. Right? You you get what you measure, and if you if you measure a dumb thing, you might get a dumb result. Could be if folks feel threatened. And that is still the case some places. But the view I would add is I I don't think most engineers are there. I think most engineers see the power of these tools and are realizing that you want more engineers to be enabled with this rather than fewer and so are are generally trying to do the right thing. So that that that's the part that I would build. If I had to boil it down to two metrics, I look at, throughput and cycle time. Throughput in terms of actual raw things being done, and you can normalize that by size, complexity, etcetera. And then cycle time because that's the measure of value. Did you complete projects and ship meaningful code to production? If those things are going up, then you're making a good investment. We can worry about cost on on day two. But throughput and cycle time are the ones that I recommend first and foremost. Yeah. And I would just kind of add to that. Like, you know, where we see some of this gamification is you absolutely have engineering managers now that are just putting this in performance reviews. They're like, hey. We need you to be, you know, checking in with the tools, like, every day. And I I hope that that we see that kind of, I hope that we see that decline, because to Krishna's point, it it doesn't really it's not the right it's not the right goal. I agree too that, you know, looking at cycle time is is important. Looking at PR throughput is important. I mean, this is ultimately what we're trying to do is accelerate things. But but as we've all kind of hit it, you know, hit it already, you know, I think we can all agree that there is no one metric and there should be no one metric. We should be looking at multiple metrics, to get kind of the full picture. However, a metric that that wasn't mentioned yet, that that I really like, is what we refer to as innovation ratio, and this is the percentage of time that engineers are able to spend working on new features as opposed to keeping the lights on in maintenance. And the reason that I like this specifically for AI is that I have definitely spoken to plenty of engineers who are like, oh, I absolutely love this agent stuff. I love doing spec driven design. I feed the agent a spec, and I go play PlayStation for half an hour, and I come back and my work is done. It's like, well, like, I mean, that's good. Like, you're maintaining the status quo and, like, you're doing less work. But what I like about innovation ratio is it's a great output metric to tell us, like, well, wait a minute, though. Are we actually improving our concept to cash pipeline? Like, is the full cycle now actually creating more value for the business? Are we able to respond to customer feedback more quickly? Are we able to add, you know, new valuable features to the tool ultimately resulting hopefully in in a better revenue outcome? So I don't like to pick a single metric, and I won't pick a single metric, But I do really like innovation ratio as an output metric telling us whether or not we're shipping more value. I think that touches as well on kind of where or what types of tasks are, I guess, the best or the most appropriate to be using AI, because some of the things that we've seen, particularly even pre Gen AI eras, was teams that have a a balance not just of new features and improvements, but do have about, I think it was 20 so percent of their coding effort going into refactoring and maintenance work, keeping the house tidy so that you're, you know, not routinely building on a shaky foundation, those teams are the most productive. So actually having a little bit of that time. So if you've in that case got a an agent going off and running, against your beautifully written spec instead of a couple of hours, inevitably either on, I'm guessing, Crimson Desert or Arc Raiders at the moment, is going to be going putting human effort where it's really valuable, which is refactoring that, you know, hairy scary legacy code base that you don't want an agent touching because it does require, you know, your your human, and maybe more thorough, input, to to make those fixes. One thing you both touched on here was, or all all of you touched on, I guess, was, like, the potential for gaming local usage or, like, having people just kinda, like, you know, token max or whatever. And, you know, when I've been talking with leaders, what I've been talking with them about a lot is moving the workload off the workstation and into the CI pipeline and having, like, what agents and AI does be really tightly defined as, like, a step in the CI process rather than sort of free form exploration by everybody, you know, into how they're how they're working. But one thing that really stands on its head is, like, it moves the it moves the conversation from the individual up to the repo. And so I'm curious, like, how are you thinking about measuring, like, agent performance? Like, is it just a, how many tokens does the CI pipeline consume means, like, that's the the cost to the business to maintain that line of business or, like, how are you thinking about, I guess, making the transition from measuring what people are doing to measuring what agents are doing independently on the on the pipeline? I don't know. Justin or Beth, whoever whoever wants to jump in. I. think it's a great question. First of all, you know, I mean, when you can shift workloads to CI, and this was true before AI, because you have an efficient CI pipeline. You know, if we're not sitting there and, like, increasing the context switch window for engineers because they have to push everything to CI, and then they have to go and wait thirty minutes or whatever for CI to finish doing what it's doing. Like, that's not an ideal situation. And then as much as possible, you still want engineers to be able to kinda do things locally. So I think that if we can isolate to CI because we have, nice and effective and maybe distributed CI in some way, and we're able to get nice tight feedback loops that aren't pulling engineers out of their their flow too often, then then I think that's a good approach. Right? It it allows us to then set, you know, better guardrails around consumption and and things like that. I think that in terms of, you know, how are we measuring agents, I mean, again, we're looking at multiple things. We have our new agent experience index, which I think is really kinda wild and and interesting. But before that, we were just looking at sort of basic telemetry. Like, you know, it's pretty easy, especially when when most of the agents, I think, that we're starting to move towards. I mean, certainly, there's off the shelf solutions, but we're still sort of building a lot of our own inference pipelines and things like that. It's just not too hard to get the agent to check-in with some type of back end. And, of course, I know that's available, in in in Copilot and other other people have just been been building, like, you know, custom inference pipelines where you go check-in with something like an open telemetry back end or whatever. Just figuring out, like, how many tasks are actually being completed by agents and then trying to map that to what we would call the human equivalent hour cost of that task. You know? So, like, if we think that this task would have taken a human x number of hours to complete, then how can we then, cross reference that to the number of tasks that are being like, equivalent tasks that are being completed by, agents? And, of course, we cannot forget about quality and human oversight and and all of these other things. So Yeah. Yeah. Absolutely. I think an important point just before we we deviate too far into the full kind of agentic experience having spoken about, you know, set what's realistic for your organization is most companies at the moment are very much in that hybrid phase. There's a lot of human development still happening, and agents are starting to appear especially in in very large, you know, high governance organizations. They're fairly early stage. And so the focus initially is on, can I get my agent to be as productive as my average developer? That's a great, you know, first step. And then can we start to, you know, max it out to be a a high performing developer, or at least be equivalent? The cost piece then shifts from where you're previously looking in in Blue Optimus case at the cost per hour of coding effort. Your input there is your blended kind of costs or planning rates of your salary plus benefits. From an agent perspective, it's token costs. So your output, your work product is effectively the same in that it is source code change, but your cost input is the variable. So that's where we'll start to see, I think, particularly when we're looking across different models where you really want to start thinking about token, optimization and the LLMs that you're choosing for specific tasks because, you know, not all are created equal where one, shines, another one may struggle and and vice versa. So as we kinda grow in this skill space, being able to understand whether or not you're using a coding assistant or whether it's an agent, what, models are you using for what tasks to be able to to cost optimize. So BlueOptim has done an analysis across more than 50 different models specifically looking at the capability of different LLMs in refactoring existing complex enterprise code. And, well, so, you know, the obvious adage of cheapest is not always best, but most expensive wasn't always best either. So there were, you know, lower cost in terms of token, but with the same success rate for refactoring and the same amount of coding efforts successfully produced that could be committed to a repo. So particularly as that that skill set evolves, you know, but future engineers gonna need to be able to decide which LLMs are best for the job, and achieve that as well as, obviously, staying on the right side of your CFO. Yeah. Well, I think that's one of use core I'm sorry. two seconds. Two seconds. I think that the point about use cases is really important too. I I think another metric is how much of the SDLC are we applying this to, and that's obviously gonna mean different models are better at different tasks. So I really like that point. Yeah. No. I was gonna pile on the same thing. We haven't, you know, we haven't plugged this at all in this conversation, but, like, one of the key things that I love about Copilot is that as soon as a new model comes out, I can switch to it and try it, and I can I can experiment with different models, you know, from different providers, really readily? So, it it's interesting to hear, like, you know, that that's that's you all is, like, an important factor in in getting to the right, like efficiency or ROI on the on the tools is being able to to to pick which model you're applying to what problem. We're coming up on the last ninety seconds of the time we've got with our audience here, so I really wanted to, say thank you all for joining. I don't know. I think what we'll do is we'll take all the questions we have here and I'll aggregate them up for the team and then maybe we can use them as a blog post follow-up or, you know, maybe add some additional commentary and a follow-up email to folks that are joining. But, really wanted to say thank you to you three, four for for joining here. And, yeah, thank you to everyone in attending. We had, I think over 4,000 people sign up today, to to come and, attend. So, really appreciate you all being here and supporting the group. Yeah. Any any closing thoughts? I guess we can maybe just go around the horn really, really quickly. My my my last thought is I I I love the ROI stuff we're talking about now with with with tokens. And I also think that if you're not there yet, that's fine too. You know, we have an AI, impact framework, maturity framework. I'm sure our our friends on the call do also. But use those to to to step through it, and and we'll all get to those questions eventually, but just keep stepping through it and measuring. I would say too that, you know, remember what these metrics really are there for. They're not to put up on a dashboard. They're there to create cultures of continuous improvement. Right? The audience still matters more than the data. So taking this data and using it to improve the platform, using it to improve the developer experience, and then ultimately the agent experience, I think is the right path forward. And, you know, to Krishna's point, like, our data shows, hey. You haven't missed the boat. You know, if you're seeing eight to 10% right now, great. That's what the industry is seeing. So keep plugging at it. We're gonna see, keep seeing improvements. Yep. I I would agree. I think we're coming full circle back to maintaining healthy skepticism, continuous improvement, and realistic goals. So, don't always believe the hype, but create the hype for yourself internally and and measure against that. Awesome. Alright. Thanks, everybody. We're gonna get out of here. Really appreciate your time today. And, like I said, we'll try to follow-up on some of these questions, whether that's in a blog post or an email or something like that. But, keep them coming. If you're a GitHub customer, feel free to reach out to your AE. We can bring any of these folks in to talk to you, one on one, or I'm available to chat with, different accounts one on one as well. So, thanks again, and we'll see you next time. I got it.