Show Us Your (Agent) Skills Episode 1 - w/ Wes McKinney, Jeremiah Lowin, & Randal Olson

Transcript#

This transcript was generated automatically and may contain errors.

What are people at the top of the game building with AI agents, and how are they doing it? Are they Claude maxing with eight terminals open at once or adversarially testing opus 4.7 generated code with codex Do they define suites and swarms of sub agents or use agents markdown and agent skills? We are on a mission to find out

First on the roster are Wes McKinney the creator of pandas Jeremiah Lowin from prefect and Hilary Mason of hidden door along with a few more surprise guests Join us to find out what all these token maxes are doing think excel world Championships meets Eurovision see you all in the first episode

All right Welcome everybody to our very first episode of show us your agent skills. So excited to be here today with a Lot of old friends and familiar faces Thomas Vicky Randy Olson Jeremiah Lowin and Wes McKinney. What is up everybody?

Hello everybody

Hello, thanks for having Thanks for sharing That video of your workflow that we could put in the trailer as well. It's really great to see you in 90s 8-bit side-scrolling glory, of course, yeah, I Was a surprise, but I I enjoyed it. I showed it to quite a few people So I even changed my github avatar. So thank you

Fantastic. All right. I'd love to welcome everyone. We've already got 30 people Watching so thank you so much for joining everyone for our first episode The the real motivation was here is where all agent maxing at 3 a.m Just surrounded by glowing lights that it feels like, you know, we're alone in Times Square at 3 a.m Building and we want to we want to share what everyone's working on and what everyone's Doing and not always, you know, the hottest shiniest workflows, but things that are really impactful for everyone to that end

Feel free to ask questions in the chat We're doing in discord so that we can continue the conversation afterwards YouTube live chats are ephemeral But agents are forever as we know So, please join the discord to show us your agent skills channel. Also Please like and subscribe if you like this type of stuff And leave a comment if you do if you don't like it, don't leave a comment It's really easy to not do that. You would think check out our luma calendar and our substack as well And we've linked to a github repository in the description where we're going to start putting stuff from the live stream So, please start that share it with friends

and Thomas and I one of the reasons we we started working on this is we're building an agentic data science course and started sharing stuff with each other and Wanted to share those things more publicly and the link to that is in the YouTube description as well So without further ado, I would love just to To jump in Maybe first I'll do a brief introduction for everyone and then we'll Get into it. Oh, yeah, please say hi in discord Let us know who you are and what you're up to and what you're interested in such things is

Introductions

Where's McKinney a man who perhaps needs no introduction at all is an entrepreneur and software developer focusing on many things including analytical computing famously Wrote a bunch of the beginnings of pandas surrounded by rats in the East Village Small apartment there mistake me if I'm correct if I'm incorrect Correct me if I'm mistaken his greatest work yet, though is spicy takes org, which we're going to look at today.

We have Jeremiah the Lowent the founder and CEO of prefect where they've been building Since the before times as well some of those popular tools for managing managing data and AI workflows, and I still have my Prefect rubber. I actually have to prefect rubber duckies that Jeremiah sent me during the pandemic

Jeremiah is also a strategic advisor to Spotify positive some OSV global ambassador of Compass coffee among other things and we have the one the only Randy Olson co-founder and CTO of good eye labs pointing frontier AI models at lots of like solving a lot of business problems today Longtime AI ML researcher been building AI, you know for 15 to 20 years first one of the first podcast I ever did was with you about teapot back in the day But Randy is a date of his legend and longtime moderator of the subreddit on data visualization as well

Three questions for the guests

Without further ado, that's definitely enough out of me. I wonder where's if we just want to jump in Sure and I do we have a little thing we want to do which is ask everyone three questions before before we dive in and the questions Are what do you love most about working with agents?

I Guess not having to write code anymore. I Didn't like I'm somebody I've loved I love writing code and I wrote, you know An enormous amount of code over the last 20 20 plus years But I didn't expect that like not writing code would be such, you know Such a relief and I think mainly like I think what's nice is that it relieved me from it has relieved me from like the things that I didn't like about software engineering like a lot of the boilerplate and like TDM that's involved with like building and delivering software projects

Guess not having to write code anymore. I didn't expect that like not writing code would be such, you know, such a relief and I think mainly like I think what's nice is that it relieved me from like the things that I didn't like about software engineering like a lot of the boilerplate and like TDM that's involved with like building and delivering software projects

Beautiful. I can relate to that a lot. I'm actually someone who's never enjoyed writing code so much I Mean it caused me to have like a little bit of like a personal identity crisis last year But I was probably a lot of people but it's all good. Now. I'm good. So just last year I was gonna say far out I was kind of tuned out of I was kind of tuned out of LLMs for for like all of 2023 and 2020 24 I was just like F this and Yeah, so that explains that explains why you have this resurgence now, you're making up for lost time. Yes

Like they weren't solving a they weren't solving they weren't like they weren't that good. Yeah, they weren't that good.

What do you find most frustrating about working with They they don't they don't listen like they lie You know, they they make the same mistakes over and over again You know, yeah, just the ignoring instructions and and the occasional stupidity which I I think is sometimes the result of like, you know, there's lots of conspiracies about the AI labs quantizing and Doing things to try to you know Get more capacity out of their existing GPU clusters and in in the process of dumbed down, you know the models and so you never know like Each day you wake up and you you fire up cloud codes like what what kind of cloud am I gonna get today? Is it gonna be smart cloud or dumb cloud and a lot of days? It's dumb cloud and that's that's pretty annoying

without a doubt and diamonds like Sometimes I'm trying to do something simple and it will bamboozle or discombobulate for seven to eight minutes then come out with with Nonsense third and final question for now if all AI agent conversations somehow got leaked to the public What would you be most concerned about your conversations becoming public?

I Mean Mostly, I mean mostly like, you know, the like private personal projects You know that I work on like becoming public not that there's anything like not that there's anything like so crazy but just more of like it would expose a lot of private details of like, you know, my personal life and You know projects that I built for myself that are like, you know Projects that I built for myself that are like, you know Just contain a lot of like non-public information about you know, things that I don't publish on the internet Basically, nothing bad, but just like stuff that I wouldn't want published on the internet

Well, you may be one of the only people who has nothing bad bad in there Yeah, your questions must be were less embarrassing than what I put into a when I told Thomas I was like we should ask this question. Well, yeah, I guess Yeah, I think having having all of my cloud that AI or my chat GPT sessions Public, you know that that's like, you know, that's another matter since like, yeah We I guess people people people ask everything about that, you know health issues and you know, all kinds of things I'll just delete my LinkedIn. Yeah, exactly I thought that's why you've moved to New Zealand Thomas is to get off the grid for the upcoming agentic leak

That's right exactly that uncle Elon as was alerted me to his new name and uncle Peter Thiel are about to perform Enough out of me where's well, you know, I mean, you know Claude is now running on grok data center So, you know, maybe maybe maybe coming sooner. So just well he he who controls the flops controls Etc. But yeah, there's a not-so-distant future in which like we're all like just paid in flops by agents, too And we'll probably be like paid in like shit flops half the time So we'll see what you know, the the oncoming currencies are. Yeah. All right So yeah, you want me to you want me to take it away here? I can kind of yeah reveal Reveal my reveal my stack

Wes McKinney's agentic stack

Well, first of all, I've been building a lot of stuff with AI the last six months or so, probably generated on the order of a million lines of code across like a dozen projects or so. It's a lot of, it's been a lot of work. It's been some serious projects. Like, let me, let me share my screen here, which is like a little more fun here. I think I'm not showing anything super, super weird on my screen.

So, um, yeah, as, as, uh, Hugo mentioned, like one of my fun things that I made is this website called spicy takes.org, which I recommend to everyone, um, trying to solve the problem that like, people don't have time to read blog posts anymore, and I actually really like reading blog posts. And so I. I made a system using AI, of course, to summarize and pull the spiciest quotes out of 11,773 blog posts across 34 different blogs from people that I follow online. And so you can go here and read Martin Fowler's, you know, takes on this mythical man month just published this week. Not very spicy. John Gruber, Armin Ronecker increasingly has been quite, quite spicy. If we look at like Joe Rice, why token maxing is for fools, a rant on fake productivity. So lots of fun stuff to check out there.

Speaking of which I am quite the token maxer myself. I, this is like one token maxing leaderboard. And so in the last seven days, like I've averaged something like, I don't know. I see if I can do math like 1.3, 1.4 billion tokens per day. So I'm, I'm burning quite a lot of tokens. So I wanted to explain a little bit of my setup and how exactly I'm doing that and maybe try to do some stuff in the terminal if we have time.

But I like this, I'm working on a talk for that I'm giving next week in San Francisco at AI council. So I'll give a little bit of a preview of that. But Jesse Vincent, who created the very popular superpowers skills framework has this quote, which he gave me permission to use, which is the difference between vibe coding and agentic engineering is planning architecture and caring about the output. And so essentially I've been building systems to do agentic engineering and not just vibe coding.

The difference between vibe coding and agentic engineering is planning architecture and caring about the output.

I think when I started out, it was just like, okay, I'm going to open cloud code in the terminal and type prompts and see what happens. And at a certain point you're like, wait, okay, this is not creating things that are good. Like I can look at the, I can look at the output, I can run the code, I can see if it makes sense, but like, how do we, how do we manage all of this at scale? Like if you, if you work on six projects in parallel, like how do you keep, keep the train, you know, from, uh, you know, crashing into the side of a mountain or whatever the right, right metaphor is.

Um, but you know, my modern productivity stack, my agentic productivity stack, um, basically centers around the superpower skills framework, which is created by, created by Jesse. And then over the last six months, like I've been building a series of. agentic engineering projects, which solve like peripheral problems that are associated with developing lots of projects in parallel and wanting to be able to have observability of what all the agents are doing, as well as quality assurance on the work that they're producing. And then to be able to like manage the development pipeline, essentially, like how do you observe the software factory that you're, you're producing. And then, you know, more recently I've even gotten into like more project like granular project management and issue tracking after being, you know, burned by beads and having beads destroy some of my, my Git repositories very annoyingly, I hear that it's gotten better. Like when, now that it's on dolt, like it's gotten a lot, it's gotten a lot better.

But, but yeah, it's so I'll kind of give you a brief, you know, view of, of those things. And so a lot of this stuff runs in the, you know, runs in the terminal, like I'm going to launch a clod and I got to do something. Can you just zoom in a bit to the terminal?

Yeah. Give me a, give me a minute here. It's got to sort out the 12, 12 terminals there. Yeah, I do. Relatable. I've never been able to figure out those like multi terminals in one. I always just do tabs and I term and that's like, that feels relatively the same for me. At some point I became very comfortable with the, with the like split, I guess split vertically and that's all I can handle. And then it's, I know when I run out of vertical paints, I, then it's a new tab. Yeah. The whole horizontal, I can't, my brain doesn't handle it.

I think to be maximally productive with agents requires you to be like a little bit unsafe.

And so, yeah, it's, it's going to take a, you know, take some time to figure that out. And in the meantime, I'm just kind of, you know, praying to the, praying to the token gods to like be, be kind to me, like let, let me not be prompt injected. And, you know, that, that being said, like supply chain attacks are also a big problem. And I think as the agents get more sophisticated in the, in the nature of like the security attacks get more sophisticated, that's like even more of a problem.

But anyway, so I will, I will see, I will yield, yield, yield the floor to the, to the next, to the next guest, but I appreciate you all listening and yeah, I look forward to seeing you all in the token mines. All praise be unto the token gods. We need a token god totem.

Before you leave, I know you'll pop back in a bit and you've got family dinner, but I do, we did create...

We did create another little video using CDAS. So pretty good summary of what we just saw. Exactly. Thank you for coming out of the token mines and bringing your, your AI lasers to the streets to help us battle all the agentic systems we're working with as well. And do feel free to pop back and show us. Of course, of course I will.

Introducing Jeremiah

Yeah, but appreciate you, man. And lots of people in the chat saying loved it and thank you. And so now we're going to have Jeremiah, who Thomas will introduce again in a second. I will just remind everyone, if you've just joined, come and chat with us in Discord so we can continue the conversation afterwards. Links in the description. Share with a friend, hit like and subscribe, all of those things.

Well, very excited to welcome Jeremiah. I think we met like over 10 years ago at an amazing startup called Quantopian in Boston. And already back then, I remember we talked about the sorry state of data pipelining and orchestration. And then, of course, you went ahead and actually fixed it with Prefect and built like an amazing company around that. And now also with FastMCP, I think just made MCP usable for the first time. And Verizon, which I'm very excited about. So yeah, welcome to the show. Very excited to have you. Thank you, Thomas. So good to be here.

What Jeremiah loves about working with AI agents

And yeah, so we have our usual questions that we start out with. What do you love most about working with AI agents? You know, I thought about this, right? Because it feels too obvious to be like the productivity gains. And I think Wes has already covered a more interesting second answer. For me, what it is, is I use it as like a second brain. And so I have big expectations about the information that I put in at a moment coming back out later.

And so this is one of the reasons that I use an open call, for example, so that I can go muck around with its memory in a way that works for me. I don't know if my way would work for anyone else. I think that's kind of a theme of what I want to talk about today. It's sort of like custom software, just-in-time software, something like that. But I use it, and I prefer to use it because while I'll go into Claude or ChatGPT as a very convenient and especially sometimes mobile like way to answer a question or way to investigate something. Most of the real things I want to know about or work on, I have trickled information in over weeks or months and, you know, I have this idea and I want to spill it in.

And so I sort of started prepping for this like years ago. I started recording my meetings, which I think many of us do now, but the thing that I do that I think folks are still surprised by, but it's very important to me, is I record a voice memo almost every morning of what I'm thinking about, what I want to do. You know, if I'm driving to my office and dropping off kids at school, I got like 30 minutes in the car. And so I kind of tee up my day. When I'm working from home, it's a much shorter voice memo, but I kind of tee up my day and I put that in. And that's sort of, in a previous era of LLMs, I'd be like, that's my system prompt for the day. And it kind of affects everything. Now with more modern memory systems, it's sort of this update on like what I've accomplished that might be out of sight for my agents.

And that's been a huge unlock for me and it's been a really valuable habit to fall into because I'd love to tell you that like, ah, this meeting, I put it in and now I know everything. But the truth is, it'll be like I'm walking down the street and I'm like, oh, here's an idea. Oh, we should name it that. Oh, we should do this. And that can go in, if not in the moment, it can go into sort of this, I prefer to do in the morning, maybe I would do it in the evening. And now I've like teed up my agents. And so that's huge. That's what I really love is just pouring information in and then working to get it out. And that's sent me on down all kinds of rabbit holes around like memory systems and knowledge and retrieval and all this other nonsense that I don't know if we'll have time to get into.

That's been a huge unlock for me and it's been a really valuable habit to fall into because I'd love to tell you that like, ah, this meeting, I put it in and now I know everything. But the truth is, it'll be like I'm walking down the street and I'm like, oh, here's an idea. And that can go in, if not in the moment, it can go into sort of this, I prefer to do in the morning, maybe I would do it in the evening. And now I've like teed up my agents.

Pet peeves: agents reviewing open source frameworks

So then obviously, we all are fans of AI, but we all know the frustrations of it all too well. So what have been your pet peeves? I have a particular one that I think goes beyond the mainstream hatred of the things that it does, which we'll get into in a moment when we talk about my workflow and stuff. But when you maintain open source software, but when you build software that's a framework in particular, how you take modifications into that software and contributions to that software, in my opinion, very different than how you take contributions into like an application or something that does a thing, right? Framework is for building applications and then we have applications. And the way agents review code, for whatever reason, seems to have a bias that the PR should be accepted. And if only you change these things, it will be accepted. And this is the wrong approach for a framework. Framework should not be modified unless whatever's coming in is so overwhelmingly useful to an overwhelming majority of users that the framework should take it on.

And so I live this with Prefect. I'm living it now even more with FastMCP, which is a great framework, but people are doing a lot of things with it. And frankly, they're doing a lot of really cool things with it that don't belong in the framework itself. And so we get a lot of PRs, which are like, oh, it should do this thing that is useful to me. And I agree it is useful. It is cool, but I don't want it to go in the framework. And so this has been a real problem with deploying agents to review the code is this thing will come in and the code's fantastic, but what it does is not something that belongs in the larger framework. And the agent will be like, if you just do these two things, I will accept this. And I'm like, no, this is rejected on its face. And I wrote a blog post about this that somehow climbed the hacker news ladder and had a surprisingly positive reception where I think it was called an open source, an open, I should have called it up for this talk. I'll get it up for when we share, but it's called an open source maintainer's guide to saying no.

And it was like, this is hard enough when code was expensive to write. And so the act of writing and contributing code was sort of the admission ticket to interacting with a maintainer. Now that code is so cheap and it's just kind of getting lobbed over, there's this real imbalance as a maintainer, which you feel and you don't want to take out on the contributors just to be clear. But how do you say no to something that is technically good, but not aligned with the purpose of your framework? So agents turn out to be really bad at that. And that's become, that drives me nuts.

That's a great post and I'll just link to it in the Discord. And everyone should check out Jeremiah's everything on his blog called Mostly Harmless, Truly Artificial Intelligence. I did message Jeremiah earlier that probably should change the name these days to Mostly Harness. I meant to do it just for you, Hugo, before this, but I ran out of time.

I quite like this movement towards saying don't send PRs anymore, send issues. Because code gets stale so fast, right?

If your agent skills got leaked

All right, well, then we got to ask. It's the worst answer I have. If your agent skills got leaked, what would you be most concerned about? I mean, it's the same second brain that I love so much. It's all there, right? All my deepest and darkest fears and how I hate spiders and all that is probably in the brain of this thing. And I don't want folks to know about that. So that's, I mean, I think it speaks to how prevalent these things are right now, that I almost think that's the most boring question. Because it's everything. It's worse than my internet history, right?

Anatomy of an agent skill

Here we are in the FastMCP repo. And Randy, because you asked about it, I want to give a lot of credit for this to Bill Easton, who is one of the maintainers of FastMCP, who wrote a blog post, which is literally like shut down PRs. We do not accept PRs. And we didn't go quite that far, but I think we wrote this contributing doc a couple of months ago. And this is kind of the headline. The best contribution is a great issue. And Bill and I did a FastMCP maintainers podcast about a month ago, and we were talking about this. And he was like, provokingly, he was like, how dare you say your agent is better than my agent at actually doing this? Why is it that my cloud implements the MRE and the issue better than your cloud opening the PR? And I think that's where some of what you guys want to talk about on this show that's clearly nakedly just about stealing people's agent skills, but we're cool with that. That's okay. Why is it? What skills have I put in? What biases have I put into it? And I think the funny thing is not that much, but just enough that it's stylistically aligned in a way that's going to make it easy to maintain the code later. And I think folks forget about the importance of the maintenance of this stuff.

So let me see, where should I take you? There's a lot of skills that I use in particular to work on FastMCP, where I'm the core maintainer, which are hard to show. So let's talk through that. My workflow here is extremely agent-driven, but as I said, I don't like how the agents actually interact with a lot of the contributions and stuff. So my workflow is surprisingly manual on FastMCP. I will spin up agents to both write my code and review others' code, but ultimately I will step in and I will look at it and I will work on it. And it's probably a lot slower than it would be otherwise, but I keep wishing for an agent that won't make this sentence true, but the code is a lot better off, the code base is a lot better off as a result.

And as a sole maintainer who does not yet trust an army of agents to do this, I kind of need that. I kind of rely on that. And this is one of those just like affirmative, you should see the things that the agents have tried to sneak into this code base. So that's a real trade-off. And one of the skills that I value most, because I operate massively in parallel in this hybrid agent-human workflow, is I wrote this explain skill that I think I made public at some point, but if I didn't, we'll do it under the banner of this podcast. And there's nothing grand about this skill. This is not like the superpowers collection that Wes was talking about a moment ago. This is a minor, stupid, natural language prompt for the agent. And as far as I can tell, it's like, I don't know, agent-driven, but it's not like a skill. It's like, I don't know, 80 lines long, but as far as I can tell, there's only one sentence that actually matters in it. And it's been in every version of the skill. And it says, talk to me like you're explaining this to your colleague who knows about your project, but wants to understand what you just did.

And that sentence changes the tenor, because I'm running 10 different agents on 10 different things over 10 different timeframes, and I'm cracking over my laptop with a coffee, and I'm like, what's this one? And if I just ask, what are we doing here? It'll be like, oh, well, we renamed this variable on this line to do this. And that's no good. And if I step back and I'm like, well, what does this code do? I'll get an explanation of the code, but not the change we're making. And so this skill, with that key sentence in it, ends up giving me the right mental model of not just grounding in the broad, but what are we up to here in a necessarily technical, but not overly verbose way. And so this skill has become my workhorse. I reference this skill.

Skills vs. MCP servers

And while you're changing that, I suppose, I mean, a lot of people are insiders here, but a lot of early agentic users or people who do a bunch still ask me, when do I use skills and when do I use MCP servers and that type of stuff? And I often like to think of MCP as when you want to solve distribution problems, essentially, and serve data or certain types of tools to other people. But as someone who works so deeply with both these things, I'd be interested in your thoughts. Yeah. So skills are awesome ways to give, to steer behavior. They go in to the agent's brain in the exact same way that a message from you does. And so they're phenomenal ways to steer behavior. MCPs are great ways to distribute business logic from a central place. And so obviously, this leads into the MCP versus CLI style debate.

Which I think is stupid, if you're an individual, use whichever one you prefer. And non-starter, if you're an enterprise, we're not installing a bunch of CLIs on people's machines. We never have. We never will. It's a nightmare. We're going to distribute the business logic centrally. So I find that debate to be kind of uninteresting. And I think the popularity of it, or the prevalence of it, speaks to this interesting problem in MCP land, where most people think of MCP as a way for companies to reach their customers. It's for reaching out to third parties and fanning out. And I will tell you that overwhelmingly, the use case for MCP is distributing internal business logic to internal teams in enterprises.

And so you have this real representation where the MCP versus CLI debate to most users of MCP is insane. It's just, we don't talk that way. But for the average person who knows what MCP is, and probably uses it in their IDE, in their open call, or something like that, the representative use case is actually using something like the GitHub MCP server, where it is a third party relationship. And this debate over CLI actually makes a lot of sense when you're one user deciding how you want to access that. So that's my soapbox. I'll put it back now.

Walking through the explain skill and ship-it skill

I love it. And can you talk us through your skill now? This is the first time in this session we're looking at an agent skill. And some people may not have even seen agent skills like this. So maybe you can talk us through the markdown and the idea of progressive disclosure and what happens when an agent looks at this. Let's do a little anatomy of a skill, and then I'm going to leave some intentional mystery. So Randy has a lot of interesting things to talk about. Or Randy, just jump in. We'll do this together. Whatever you want to do here.

Skills are shockingly simple for how effective they are. They have two front matter. A name, that's really important. That's how you invoke it. And a description. And the description is going to always be seen by the agent. The rest of it, and this is I think the genius thing about skills, why they're not just canned prompts. Everything down here is only seen by the agent if and when it decides to invoke the skill. So your description actually really matters and that's why a lot of skills you'll see this very clear instruction of when to use this because there is a tool call in a sense to load the skill and see all the rest of it but that progressive disclosure as you just said a moment ago is the magic of skills so you can have tons and tons of skills that are all shown as these little tiny descriptions and then the agent's like I need this skill.

The downside if you will is that while a skill can tell an agent what to do and in fact more complicated skills zips can include scripts and other like actual business logic so that there is a way to do that, but fundamentally the skill is natural language instruction masquerading as a work the way I like to talk is like it's a polite note to your agent and usually it does what the skill says. And they've gotten a lot better as well, right like even six months ago they like skill agents and LLMs have gotten a lot better at interpreting descriptions than they were even when you know I had our our moment last October.

It's true, and and yet I think a lot of times I have to say use your explain skill even when I say like explain this to me it's not perfect. Maybe that should have been slash commands a lot as well for that. I'll do it and on and something else really lovely here with skills I think you mentioned like you don't just throw everything in context and we're told we have like a million token context windows effective context windows are far smaller so we still need to be mindful of our own we need to context engineer constantly as developers and users of AI agents and as you mentioned you can put anything there I like you can even you know say if you need to do this use this MCP server or also involve the human in the loop ask ask the user, you know etc. Exactly. I also love that you've included what not to do. So maybe you can talk us through.

Yeah, I mean this skill you can probably tell by reading it the skill itself at this point is MCP authored it's MCP offered my god it's late LLM authored it has all the little telltale negative contrast and all of that but it is I like to think of skills as living documents, which is one of the reasons that that it's nice that they're on your machine, but there's also kind of makes distribution hard. So my skills and this I opened some folder here I don't even know what it is where I have, you know 20 odd skills in it. Most of these skills are changing as I I'm like, oh this didn't work. I think I have one here called github reply which is intended to try to make it respond in the way I like use my voice in a reply, which isn't because I'm trying to be like these are a little like annoying things like don't say great work followed by a rejection. That's confusing. This is like it's not because I'm trying to masquerade it as me I'm usually pretty obvious if I'm using LM to draft the reply is because I think there's a right way to treat people and the LM doesn't do it and this goes into that thing I said before about how it does stuff say no close the PR no, we have to politely say no.

This is like probably my the first skill I ever wrote and the one I use the most and gets me in more trouble, but I'm always like ship this and it's as you can see I say a lot here. It means open a PR. It does not mean merge the code. That's bad. It's bad. It's bad. It's bad. It's bad. It's bad. Does not mean merge the code that's bitten me a lot. Most elements think ship it means like merge it. That's not how I use it. Sorry for confusing a lot of folks this work tree stuff's probably hold over from when these things didn't know about that. You can see from the arguments here this started as a as a slash command, but I why does this skill exist? It's not a useful skill. I haven't even read this skill in like a year probably it seems like a really stupid skill actually but the reason it exists is because I want to write the words ship it and have the right outcome happen and this skill is my bridge to ensuring that and like that's how we use skills. Um, here's a skill for creating skills. I don't think I wrote this actually, I think I cobbled this together from a lot of things I found online so this is how we make living documents. And so this skill is how to write an effective skill.

Personal software and the future of contributions

I'm curious. I mean, there's a lot of things coming together, right one is um, github just uh being too easy to have huge PRs, right that then you need to review or decline to um as well as having your own preferences right of how you want things done and I think it was Peter Steinberger who said that like in github repos you would like to have not even a feature to like submit any PRs but only prompts, right? So and then like those prompts are easier to review and those kick off like uh code generation once you're happy with a prompt that actually then use your machinery and that creates the code that you have and uses your custom skills. You think that would be a good model?

Yeah, I I mean I'm I'm into I'm into personal software basically in the most true sense right and so in a funny way, I think that's also resulting in very personal decisions of how you achieve that software so you have your harness you've open cloud pie whatever you clawed whatever you want to use as your like base substrate and then the way people pile on functionality features customize it skills is deeply personal and so you look at things like um claw hub, which is the skills repository for open claw huge power law in popularity in those skills, so a lot of people are using a small number of those skills if you go look at them, like I ask my open call all the time like you want this skill it's like no, this is garbage I I already know that it doesn't affect my behavior so I think there's a lot of work to do and how people acquire these customizations but we'll see.

Prefab and generative UIs

Let me um, let me blitz through two things and then I'm going to yield some of my time these are two things that have been exciting to me in terms this this one I'm calling up because it's more in Randy's uh world uh, I I desperately wanted to create MCP apps in python and that meant I needed a python front-end framework that didn't require a back-end uh, which almost every one of them assumes is a very specific back-end, which we don't get here. We have an MCP server and so this started as like a mini project inside FastMCP and now has become its own thing as of a couple weeks ago but what was cool is because of the same insight that MCP has this product market fit in enterprises it means MCP apps also have product market fit as data dashboards and distributing information internally and so we wrote this this is a python DSL that generates these generative interactive UIs. This is sort of our our hero here and it's all python and it looks pretty bizarre but this and like generative UIs has become my new sort of spelunk more than a side project not quite a career all for the goal of building like interactive dashboards that can stream from an agent's brain and don't have to be hand-coded so that's been super fun and very open invitation here to come to come play and contribute.

And then this is well, can I just say I'm really excited about Prefab too because I really think that like that is the future of where things are heading we're starting to see like Claude pop-up interfaces right in the chat you know and like even when it's developing things and it's scoping you can start putting a prototype right there like enabling that through MCP is it's it's super cool my my team uses this for slides sometimes I think we even turned it into a theme and then someone um oh, I have to I have to find it here.

And while Jeremiah's bringing that up I will say this is something we'll chat more a bit more with you Randy, but something we're kind of talking around is why we should be building for agents as well as humans and everything we build for a browser should be headless as well so Andres who who works over at uh dbt labs, um published like a dashboards like overview I guess and included Prefab in it and styled his Prefab this is one he made which I should have just brought up his website but I I couldn't find the link at the moment. So I brought up my LinkedIn post about it that I just made one is this like MySpace theme and then the other was a Windows

Theme and I just like love this and this is just like base prefab underneath so I want to Talk to andres literally I saw this three hours ago as you can see and I just want to talk to him make these like Default themes in prefab. So that's been super fun on the like agentic uh space and then

On the theme of custom software. This is another experiment of mine. This started a few months ago where I wanted custom like slide software um Mostly because the way I give a talk like the speaker notes. I want them a different way And so this I actually this is very live. This uh came off the agent presses last night. I haven't fully used it So we're not going to fully use it yet This is the latest version of a piece of software that I call cardboard which is for laying out my conference talks as cards like on a board And follows a vocabulary that i've developed Purely for me like no one else should use it where I think of my talks as having acts and beats And then within those beats We have slides and so this this is the top of the cardboard where I lay out um These are the the five acts and then these are the beats Oh, there's six acts I guess this was a keynote I gave at the pie conference and then if we scroll down we get the actual slides which here are just mocks But i'm working on the actual speaker notes, which for me always have this blue gray yellow and and pink Which we don't need to go into format. And so this is like my custom slide software and then this turns into the deck which The images are broken, but if we bring up the speaker notes for it, they all carry over in that same like blue gray

Like style and so apologies that I guess my images are broken in this one that I wrote but this is um um This has been really fun to have like a piece of software. That's exactly makes talks the way I want them the way I like to give them and And I don't think anyone else would use it maybe they would I don't know maybe open call if folks would use this we'll open it up, but like This has been great. And most importantly, this is not an interactive ui. This is read-only I'm, i'm clicking on this stuff jumps down. That's the one interaction it has I interact with this entirely over an api or an mcp server Talking to it from any agent. I want that can connect to it Um, so exclusively this is like recording a mini voice memo putting into an agent seeing the changes here reacting In voice and I can't edit this if I wanted to

um And the brain behind this is open claw This one I this one I principally work with with my open claw because that way I can like work on a talk You know, I can there's a talk i'm giving in three weeks for pi data london so I can like Feed in something tonight close it. Don't worry about it Talk to the agent about a thousand other things and then I can come back and we can actually pick right up because of the memory Substrate there. So I I typically use my open call for stuff like this, which is really distributed over time That's actually my main use of my open call

And is that the only one you use or do you mix between open code and I code I use open claw as my main personal Um interface Because of how i've customized its memory when i'm working on code. I use claude desktop and codex codex desktop Um, which I migrated to from the cli's mostly because of how much better it is at managing parallel sessions um, and so those are my two and I you know, I probably have my own heuristic about which is better for what that I That I assign work to um So those are for my let's work on a thing It's memory bounded and I use an open claw as my memory absorber for more asynchronous work

And that is some of my agent skills Very cool and just feedback on cardboard. It looks super useful. So I definitely would be up for for using it All right, maybe folks want to see it. We'll uh We'll get it out Amazing thank you so much that that was really insightful Uh, so I think then uh, we can move on to the next contestant Yeah, so thank you. Once again, jeremiah for coming and showing us your agent skills plus plus it's amazing to see everything

Introducing Randal Olson

We should talk more about the future of ephemeral software, just-in-time software. I think the ability for everyone to generate things that are helpful to them and their communities, we've just got the tip of the iceberg here. As one Australian politician said to another one back in the day, you're all tip and no iceberg. Now, I'm not saying that to anyone here, but it's the types of things that happen in the political landscape here. I am so excited for Randy Olson to show us what we're up to.

I do want to just quickly show everyone what Randy does. There's a spider. Watch out, Jeremiah. Yeah! These are amazing. Thanks for sending that through, Randy. I know how busy you are. I really appreciate you taking the time from Washington State to send that through.

Look, there are so many ways to introduce Randy. One of our first conversations, one of the first podcasts I did nearly a decade ago was about automated machine learning and all the wonderful work you're doing on TPOT. I actually remember I asked you if you want builders to take away one thing from this conversation, what would you want that to be? You said, just stop doing basic grid search. There are so many other interesting things to do. Yet, look where we still are. LLM is still doing grid search.

But a lot of the other things that I've loved about your work are all the amazing things I've learned from you about data visualization, all the amazing things I've learned from you about the world through data visualization. You've been moderating the subreddit for a long time on data viz. But also your work at GoodEye, everything you're thinking about with respect to evals and all the work you've done over the past couple of years, helping non-technical people build with AI and even building basic retrieval systems for executives and that type of stuff. So before, and we'll be talking about, one thing you'll show us today is teaching agents to produce beautiful and grounded data visualizations, which they are not good at, right? So you've built loops to do this. And it's an evolving set of skills, which we'll get into soon.

Three questions with Randal

So our three questions, what do you love the most about working with AI agents? I mean, I agree with what everyone has said so far, which is like, of course, the productivity gains are huge. You know, like I am just I'm a builder at heart. I love building things. I have so many ideas and I've always been limited by capacity in the past. And now it's just fire up a new chat, fire up a new tab. Boom. You can build that in an hour and it's just so easy. And then also treating as a thought partner. So like I also have like a growing memory base where I throw in ideas and track everything. That's so invaluable, especially because like most LLMs, even nowadays, still tend to be sycophantic and they'll just agree with you. And when I load an AI agent with my, what I call a digital twin, it knows I actually want someone who's not necessarily harsh, but won't just agree with me, but will push back. You know, like I want a thought partner. And you can induce that, you know, by prompting LLMs.

But I think another really interesting thing that I really like about them is like how you can be kind of, I'm getting cut off here, working while being like outside touching grass. That's also been really nice. You know, some days I'm better at that than others. But because you can, you know, like Wes was showing us earlier, I think we all use the superpowers flow to do development. It's kind of slow, but, you know, you can have four tabs running in the background and be like, you know what? It's beautiful out. It's time to go for a walk. And you're still getting stuff done. You can even check in remotely while you're off on a walk or you're spending time with family and so on. So I think I have really appreciated that. It actually gives you, because I am, I'll self-admit, I'm a workaholic and I'm also a hyper-focused person. I can spend 12 hours straight just hacking away at code. It's great to be able to step away.

And when I load an AI agent with my, what I call a digital twin, it knows I actually want someone who's not necessarily harsh, but won't just agree with me, but will push back.

Without a doubt. And I do, I totally agree with being able to do it while I go for walks on the beach and jump in the water, come out, check what's up. I do think it can mean my time away isn't always time away. But I also am interested in the future of being able to chat with these things better so I don't need to look at my phone all the time. Not trying to get too future music, but like some Java style, I also don't want to be a glass hole, but some Java style augmented reality stuff could be super interesting. Also, the ability of us, I was chatting with Vincent Varmadam about this last week. I've been to Amsterdam quite a lot recently and walking along the canals with him, talking about building stuff. Just imagine having an AI assistant there that would build stuff that we're talking about in real time. That may seem futuristic to some people, but I actually don't think it's necessarily that far off. Yeah, I think that's feasible nowadays. You just invite a meeting recorder, listen, and then send it off to an AI agent, right? Go build.

Totally. Well, yeah, the thing is that they don't necessarily... Specificity is so important now, like do this, don't do this, do that, don't do that. But having skills around that can be super useful. Second question, what do you find most frustrating about working with agents? Gosh, I think it is the natural tendency for them to be extremely agreeable. Like we were talking earlier about how the default assumption is, oh yeah, we're going to merge this PR, right? Default assumption is, yes, we're going to do something. Yes, we're going to give in to this demand or whatever. I think it's like they're too agreeable by default and you really have to fight them to get them to not be agreeable. I've written about this a little bit, like there are ways you can prompt them. You can take advantage of their agreeableness to be disagreeable or be constructive. But yeah, it's really frustrating. I think that's getting better over time. I think the post-training is starting to focus more on combating it. But yeah, that is a very ongoing frustrating thing for me.

Totally. And if all agent conversation somehow got leaked, what would you be most concerned about yours becoming public? Gosh, I mean, of course, all the personal detail stuff, I put a lot in there too. Maybe it's just how, maybe some of the tics that I have when I'm interacting with AI agents. I always find myself typing the same things. It's like one of my most typed phrases is probably like, be concise and unambiguous. I'm like typing that over and over and over again. I should just turn that into a skill. So probably funny little quirks like that, that I have developed while interacting with LLMs. That would be a pretty funny thing to see.

Super cool. And I feel it's unfair to ask that question without answering it myself. So I will just say, I think the thing I'd be most concerned about is some of the words I use when I lose it. Like when I'm being definitely engineering for too long at the end of a long session and I'm dumb, it's dumb. And I'm just hoping that we can wrap up soon and it does something really dumb. And some of the words which would get me banned from YouTube are saying live now. And words that are acceptable in Australia, but definitely wouldn't be over there.

Teaching agents to make beautiful data visualizations

Without further ado, let's jump into all the amazing visualization stuff you're up to. Yeah, first I thought I'd share. So of course I have this visualization skill workflow that I want to show. But since we were talking about skills in general, and there might be some folks who are just very new to skills. I mean skills are frankly a pretty new concept. I figured maybe I'll hopefully screen share successfully here on my screen. And maybe the entire screen. And I'll just like show. Is this showing up okay? Is it plenty large enough? It's showing up. And if you could maybe zoom in. Zoom in even more. Okay. How about now? A couple more. It's not zooming in for us. It might be... Oh no. Is it showing up? Maybe it's just doing very small zooms. I think it got a little bit bigger. Yeah, that's awesome. Yeah. Okay, good.

Okay, so now things are going to be really, really jankified in there. Now I did pull this up on a terminal. But this is still a skill. And I think you kind of got the lay of the land. So this is the actual skill, which we can also make available after this as well. But basically you have your basic front matter that is used for progressive disclosure. And then here's some things that I think about when I'm working with skills. And actually, first off, at a high level, with AI agents in general, is you can open up your cloud, whatever AI agent tool you're using. I highly recommend connecting as many data sources as you can to it. Simply because the more information you give to your AI agent, whether that's through MCPs, CLIs, whatever, the more it's going to be able to do on your behalf. And so if you don't have it connected to anything, and you really restrict it down to just one folder or just one code base, it can do a lot. But if we're talking about it can read your code base, it can check all the changes you made today and then write a report automatically onto Slack. So I have a ton of cron jobs that do things like that, that automatically send off reports to colleagues, of course getting their permission, that I'm going to send an AI thing their way. So connecting everything through MCP and CLI, of course, with permissions that your IT admin is okay with. If you're the IT admin, then you're what you're okay with. It's a huge one.

I would say another one that's really useful, and not all of these are going to be in this skill is, actually I guess so, I guess it is, is I highly recommend in every skill, if you need to do anything, make sure that you have a phase at the start of the skill that your environment is actually set up. So in this case here, in this skill, it's going to be using some unsurprising libraries, and hey, some familiar libraries. And so it needs to install those first and make sure. So I highly recommend you have the skill set up the environment, because otherwise if you just tell it, hey, go do this thing, and it's trying to write Python code, it's going to crash and have to figure it out on the fly. So if you can tell your skill ahead of time, like, hey, make sure everything's set up. then that is a great start to make sure for a smooth skill. Another thing I highly recommend, which is not in this skill, but it's done in many many well-designed skills, is to design your skill as actually like a thin driver. So you want to make, like this is the idea of progressive disclosure, right? So instead of having, this skill is actually very very long, relatively very long, has lots of stuff in it, every single one of these phases could be its own like reference markdown file, right? So then if you're having your skill do something, let's say half of what's in the skill is not relevant to what you're doing right now, it doesn't have to load that into context. It can just, if I would just want to jump straight to phase four, it can just load the phase four thing, load that into its context, and not waste the context on all the phase one through three. If you ask like Opus to do that, it'll do that by default usually nowadays, and that's a really good skill design as well.

I would also say design your skill to be super unambiguous. like this one here is actually relatively unambiguous it's like very specific you know at this step do this at this step do that it actually tells like exact commands to run and how it actually provides exact code snippets to run right to when it's checking things the more specific you can be in your skill the more repeatable it is in the future that helps a lot as well because like when it comes down to it a skill plus an llm is kind of a program right and so like llms are good at handling stuff that's that's a bit it's a little bit ambiguous whereas the other things you know like if you need an environment set up you can just tell it run these commands um and then you've got it in code in here right but providing it deterministic scripts to run at certain points as well so we have degrees of agency there and autonomy 100% yeah i mean i think if i were to decompose the skill into a local file all this code would be separate like a separate python file right i just put it all put it all in one file so it's relatively easy to see here um and then i think the last thing i'll comment about this is um always like at the end of your skills or even just have your own skill do like a reflect and improve loop because you really should i completely agree with with jeremy here like should treat your skills as a living artifact they're not like a one and done thing there's something that you need to improve over time and so when you run through on a new use case you're probably like all the time with data visualizations i'm having to correct how it does things right how it how it visualizes data how it thinks about data how it makes a good data visualization um and so if you have if you explicitly put in your skill you can like phase seven here whatever say reflect and improve how did this run go how can we improve the skill to make it better then it just becomes a part of your loop right every single run you're learning something new and putting it into the skill and it's compounding and i think that is like a super super powerful thing about about skills in particular is that compounding nature to them.

and it's compounding and i think that is like a super super powerful thing about about skills in particular is that compounding nature to them.

um so i don't know if we wanted to chat about any of that but that was like my my uh throwing a bunch of ideas out there around like how i think about skills and um i think a lot of best practices around skills so many wonderful tips and best practices i something did come to mind that's tangentially related there were a couple of questions that i plan on getting to later around harnesses uh in the discord chat and what made me think of it is the idea of like a skill being something that's continuously updated and iterated on and something i'm trying to help people understand and convince people of is a heart and a harness isn't something that a lab ships and you use right um and for those who haven't heard this term think of the llm as the brain and the harnesses all the things that do stuff from the brain so tools that it uses whether it's um you know write to a file or execute something or connect to an mcp server or you use this skill now what happens though is that when you write a skill or connect to a different mcp server essentially they become part of your so the idea of iterating constantly on a skill is part of the kind of the mental model of continually building and rebuilding uh your harness and i'm interested in if this is how you will think about that yeah i i totally agree i mean it's like the the model unless you're hardcore and using like you know open source models running your own local models the model's fixed right you can kind of choose between a few frontier models whatever there's not much you can do with that even the harness unless you're uh you know also hardcore and making your own harness or using a local harness that's kind of fixed too right and the harness kind of like helps orchestrate things like you know use sandboxes or access the mcp or when to hit the model when to think etc when to take an action whereas the skills are really the thing that can evolve with you and they can they can kind of like guide the harness plus the model and everything around it towards what you actually want it to do right you know and i i think that's that's the the amazing thing about skills and they're just text files that's great totally.

and the models have got i mean we had that moment right with 4.5 and then um you know the that week where the three labs all released the models right that put us in what simon willison calls the inflection point and yes where they i think they sucked up the harnesses that we were all building they did like whatever it was like reinforcement learning via verifiable rewards or whatever on all the traces um and meant that we don't didn't need those extreme kind of all the tool calls we're using and that type of stuff and at that point the amount you can get away with a bash tool and a search tool and a bunch of skills you know bob's your uncle as we say now and i'll link to a post by amp which is a really interesting uh coding agent and product um where they wrote you know harnesses are dead and it's it's provocative but the basic idea is that models are getting so good that harnesses are better when uh extremely minimal now i i think i agree i agree with that too especially it's just like it still makes sense to keep building harnesses and keep building skills on top of harnesses but the great part about that is like that actually can just become training data for the frontier model and then just like you said out of you know just straight out of the model it becomes able to make beautiful charts hey i would love it if they if they trained on my skill i know we don't need this skill anymore um hugo did you want me to just jump in really quick i know we're i would i would love to and look we can we can go a bit over each each guess we've gone five minutes over so if we take that cumulatively we can go for another 24 hours or something i think actually perfect let's do it let's let's token max.

Randal's data visualization background and the skill workflow

um cool so um for those who are unfamiliar or it's just been ages and you and you forgot about me and all the very important data visualization work i've done so back in the early 2010s i was especially i was in grad school so i was looking for any excuse to do anything other than work on my phd hopefully my advisor is not not listening although i think he knows so i was working on data visualizations like crazy and like i mentioned i'm a builder i'm a very curious person and so i'd always like go and find a data set try to find an interesting question and make a blog post about it right so i worked on very very insightful things right like what are the deadliest films of all time by on-screen death counts there's a cool forum where people actually watch movies and count every single death i solved things like where's waldo or wally you know or in other parts of the world where you know if you remember you try to find waldo in the books there and then i use machine learning to try to find out oh where is he biased towards being and then optimize the path i optimized road trips and so you know this was like another data visualization machine learning problem worked on really really important like spurious extrapolations like the frequency of the use of the word novel in academic research papers if we extrapolate that all 100 of research papers by 2130 will use the word novel and then there was a short period i guess it's dead on on twitter or something but there's a short period where we were all at least i was obsessed with marble racing you know data visualization on that and so like i what i was always trying to do was to find something interesting or sometimes something silly and get and like make through data visualization make machine learning make ai or even just make data topics more accessible to get them into you know visualizing data analyzing data and so on.

and i did this for a long time and eventually i realized oh hey like ai agents can do this so when i when i first went and tried to have an ai agent you know just like hey ai agent make beautiful data visualization make no mistakes what popped out was honestly like unimpressive and then that's when i realized oh i have to like encode still what makes for a good data visualization at least in my opinion and of course having done this a long time i'm highly opinionated on what makes a good data visualization um if you're familiar with edward tufte he has a giant in the field of data visualization he's very opinionated on what makes a good data visualization no chart junk has to be a clear story clear annotations you know like straightforward use of color so on and so forth and i said what if i encoded that principle into an ai agent and so that's essentially what what i've done in this posting series and you see i've done a whole bunch of them where essentially the idea of what i built and i'll jump back to the skill over here is you take you start with an idea and then the ai agent goes and researches the web to try to like find data and other stories and everything else that supports um that touches on that and then it tries to create a valid data visualization and tell a valid data story from that right so if we jump over here to this tab maybe this text is a little too small um but this is essentially a prompt into cloud code and of course i always recommend still using you know the the frontier models i'll say like you know run that template this is a tool that i use to pull it and i can if we have time i can talk about that and then execute this workflow to visualize the history of marriage and divorce in the usa i decided to run this ahead of time because this does take a while.

but if you run that it loads the workflow which is just this prompt over here right and then it immediately starts it sets up the environment it does the data set discovery and so i've biased it towards certain sources you know the cdc like government and educational institution um institutions for more reliable data rather than just some like random data set that a rando posted out there and you see it researches a whole bunch and so it looks looks looks um you know it finds marriage rates between this period oh i need to find divorce rates and it researches um it looked at a whole bunch of things it looks at pdfs um it looks at uh various web pages and so on and then i'm just going to keep scrolling here because it's a whole bunch of research that it does and then finally it pulls it together into a single data set so this is already pretty cool right like it took an idea researched the web and then turned it into a data set that we can of course go back and look at the sources and and actually just went and found that data this is amazing because i've i've done this chart before i've done this work it literally took me an entire weekend of scraping through pdfs this data is not cleanly organized out there and it just did it i think this part took maybe five or ten minutes right and then the next stage in it is it creates several variants to try out different chart types so it'll do like line charts it'll do small multiples it'll do an area chart and then it'll look at them and using the data visualization criteria i give it it'll say um you know which one it thinks is better um and this one this time it chose small multiples i like that one better um i actually ended up overriding it but um essentially it starts with that prototype because the first thing you want to do when you're visualizing data or when you're yeah when you're visualizing data is you just want to look at it right like how does this actually look what is the distribution etc this is essentially what it's doing right it's looking at it it's getting a sense of what what stories can actually tell what forms of visualization are actually useful and so on um and then it picks one and it moves forward with that the next step is then it start it runs a what i call a verifier loop this is another thing i didn't mention but what another thing i've learned with skills is you don't want to just tell it what to do you also want to tell it how to check it so that's why there was all that code in that skill that code is used to verify certain things about data visualization so it's like a good dpi I also use LLM as a Judge, or evals, that encode certain tufte principles to actually look directly at the image, and then say pass or fail, and if it fails, to give it feedback on how to improve the data visualization, because you can't programmatically assess images, right?

And it basically loops over and over, I think this time this was an easy one, so it just like passed pretty quickly, but it'll loop on this, and keep fixing and keep fixing, and these are basically guardrails on the skill, right?

Because what I've told it to do, is it opened up this massive realm of possibility, and this scopes it back down to say like, hey, I want to chart with this style, with these qualities and so on, and then that zoomed it in to that.

So that is essentially a quick run-through of the skill, now of course we've been staring at terminals and whatnot for a while, and this is supposed to be about data visualization, so what I ended up doing was, I looked at the variants, and I said, I actually prefer the dual-line variant, let's open it, and let's just do like an open, open the, I think I have a tab over here, here we go, open chart.png, and can everyone see that okay, is the box on the bottom right blocking anything, or is that not sharing?

It is showing it, but it's fine, nonetheless, it's perfect.

Okay, so, and so this is the resulting data visualization, and so this one is pretty decent, I would say, it's pretty okay, it's like, well, for one, it got that data, it shows an interesting story, it calls out interesting aspects, you know, like these are well-known stories in the marriage and divorce rates over time in the US, like post-World War II peak, and how it's been going down over time, and it tells, you know, puts a little minimalist annotation here, and so on, but like other things that are heavily enforced is like very minimal background details, no chart junk, labeled axes, making sure that it's, you know, using just the ranges that it needs, making sure that everything is clear, and that it's telling a clear story, which is this part here, right on the chart to make it basically like a shareable thing.

And I just want to be very clear, this is a result of this agent running the skill and no human intervention yet.

Exactly, yeah, there is zero, there is zero, well, and in this case, there is... Verification loop. Exactly, on the verification loop, this was completely out of the AI agent itself, and if we go back...

Well, can we just go to the chart that you just showed, because I do, I think it's amazing, I do want to say, this is something you and I have discussed before in DMs, but, so the actual image, it's incredible that you've been able to encode all of this in a skill, so we can still see the terminal, if we could see the image.

Oh, you want to open the image. Yeah, because what I can see, there's one thing that I'd change, and it's the final frontier of, you know, making sure data viz looks cool, what is it? It's this freaking overlap.

It's so annoying, isn't it? It's so annoying! This is a problem in like all kinds of image-related things. If anyone solves this consistently, you're a billionaire. I don't know why, AI models, everything is so bad at detecting when like an annotation is over a line.

If anyone solves this consistently, you're a billionaire.

So this is the final frontier of human in the loop, is this is basically what I do. By the way, I run this skill every single morning, and that's how I make that post series, and most of what I do is like, oh, no, I'd like that post, or I'd like that image more. Oh, hey, an annotation's overlapping, otherwise, looks good, post it.

Maybe that would be another embarrassing thing if all my chats got out, is like, you assume I'm typing genius stuff into my prompts, but actually I'm like, hmm, looks good, let's proceed.

Running the skill live

So, Randy, would you also mind if I just quickly shared my screen, because I ran this skill the other day when you sent it to me, and I think it works so wonderfully for me. I just wanted to, and I've shared this with you, but I wanted to run everyone else through it very briefly, I hope.

Can you see secretariat data, and now, yes, okay, good. So I, first of all, this was the result, which we'll talk about in a second. The idea I got, so, remember Randy said, all you need to do is put in an idea, and so I put in no horses broken secretariat, 1973, Kentucky Derby record. That was the idea.

Now, where did I get that idea, and I've linked to this in Discord, but from Randy's blog, so it was, the other day it was the top one, but he does it daily, so it was this one here. So I put in that idea, and you can see the setup it did, and I'm sorry, just to step it back a bit, this is a log I wrote, and I often get my agents to write brief logs, I've got a skill, which is write log, make it human readable, don't make it too, make it concise yet comprehensive, yada, yada, yada, so I can quickly see everything that happened.

It found a Randall Olson CSV. What it also did, so then it did three parallel variants, et cetera, verify a loop. Now, it put notes for the human reader, the Randall Olson coincidence, classic LLM style language there, but the very first search for the data set surfaced a blog post called no horses broken secretariat, 1973 record, and found Randy's, and then it said his post used the same source, I'm considering downloading it, but I'm going to use Wikipedia instead, and I found that really interesting, and so what we'll see also is that, once again, pretty good, not quite as, well, not quite as nice as Randy's, and a bit of overlap here.

The one other thing I'll mention is Randy's generator evaluator loop is something that, you know, when Anthropic originally wrote their building effective AI agents blog post, they recognized this as a really important workflow, one of five workflows, and I'll link to this in Discord as well, but it's really great to, you know, see this in the wild, because this is something that's incredibly important, so I just wanted to share a few of those things and show how easy it was for me to use your skill.

Encoding judgment into skills

I do have a question, Randy, which is we talked with Wes briefly earlier about encoding, turning judgment into intelligence, and you've done something amazing here, well, many amazing things, but one thing here is that you've taken things which normally we'd need to be applying judgment in the loop and encoded them in a skill, you've got the verification loop, but how many things can we do that with, and how do you approach encoding judgment in intelligence?

Yeah, I think I really approach it like a data scientist, which is, I think, a fun part. We're all data scientists to some degree here. And so I think about it as, you know, let's just try it first, right? Like, let's just, like, literally the whole, like, tufte test idea, where I created an eval for a tufte test, was I was like, okay, I know the principles, let's, like, put that into an LLM that can look at an image and apply it, right? And just see, just see if it works. And then I just keep tuning it from there, right? I try more examples. I have this data visualization, I have that data visualization. I build a full set of data visualizations, ones that I know are good, ones that I know are bad, and I just keep tweaking it until, you know, like, essentially treating the eval itself as sort of like a living document, you know?

Like, try things out. This worked well, this didn't work well. Oh, let's encode that, you know, and let's keep building that into this eval. And now I have, you know, a thing, not just a skill, but like an eval, I guess I'll just call it, or a verifier that is hyper-focused on judging that one thing, you know, and saying like, yes, no, this is good, and here's why.

So that's how I approach a lot of that, you know, as like, it's all down to data experimentation and continuously improving things, you know? Like, accepting that it's never going to be amazing at first, but if you have that compounding loop, then it's going to get better over time.

Agents gaming evaluators

Yeah, evaluation and verification. We'll wrap up in a few minutes. I don't know if Wes is going to join again to show us what happened with his runs, but I am interested in people's general thoughts around when we do build evaluators and verifiers, the ability of agents to game them as well, and how vigilant we need to be. Do you have any opinions about that, Jeremy?

I was like, quite, quite vigilant. I mean, it depends on your objective and your risk tolerance, right? Some years ago, we were working on an agent framework called Marvin, which to this day I still think is the greatest way to work with agents, but didn't quite catch on, and therefore, I don't, you know, it is what it is. We had a lot of LLM as judge situations in there, especially because we didn't know how to unit test our own, like, we didn't know, evals wasn't a thing, and so we reached for the thing we did know, which is we have a magic talking computer, now let's ask it if it did a good job.

And I think the thing that I took away from that is that to the extent that LLMs have a little wiggle room in what they return to you, when you're trying to nail down something and it's kind of subjective and it's like, needs to be run very frequently, that variation really plays out in a way that you don't notice in a one-off conversation. Like, it sort of forces you to watch what happens over 10,000 runs of the same thing. Now, today we call that, like, an evals suite, and there's a science around it, and we understand it better.

Randy could talk a lot more about this than I can, honestly. He's the expert. But back in those early days, I just got this very healthy respect for that variation really is there, you just don't really notice it in any one instance, and maybe just not suspect of that loop, because it's, I mean, Randy just showed us how valuable it can be for honing, it's like hill climbing to an outcome, so it's really useful. I think you just have to, like, temper it for the precision you require in your outcome, I guess, like any good model.

Yeah, I mean, I also think about it, like, it's so dependent on the context. Like, if it is, you know, like, it's a guardrail, and something absolutely cannot happen, then you're probably going to want to try to find something more deterministic. But I think even still, if, like, an eval is, you know, flips or is wrong, whatever, 80% of the time, like, it's still directionally valuable, you know, like, it's like, the tufte test, I've not measured it, but I have seen sometimes it judges things wrong. But still, on the whole, it's really valuable to have in the loop, because it catches, especially a lot of the obvious stuff. And that's one less time that I have to spend my thought tokens, I'd much rather spend Claude's thought tokens, you know, doing something like that, especially if we think about doing this at a really massive scale.

And that's one less time that I have to spend my thought tokens, I'd much rather spend Claude's thought tokens, you know, doing something like that, especially if we think about doing this at a really massive scale.

Yeah, absolutely. And the other point, I think, which we're talking around, is evals are things that aren't static, like skills, like Johannes, there's something that you constantly iterate on, and perhaps more at the start, and then build trust with them, and so on, as with any of these products. And we're actually, in a few weeks, Hamil Hussain is going to join us to take us through his AI eval skills, which is going to be super fun.