Transcript#
This transcript was generated automatically and may contain errors.
Well, let's go ahead and get started because we have not one lab manager today, but three because we have a selection of fun project showcases from the community to share with you today. And to get us started, we have Damie Pak. Damie, would you like to introduce yourself?
Hello, my name is Damie Pak. So I am dressed up as Lubega. I don't have any of the clothes required for it. But I was going to talk about the most serious question of our human existence. If you did not know when Mambo No. 5 came out, how would you know? And through the woman's name in Mambo No. 5, if you remember the bridge that goes, like, a little bit of Jessica or a little bit of Erica.
You should know all the lyrics by now.
Well, I'm just like, you know, I have one wife, and she does not like me, you know, saying all of these other woman's names. Anyway, so my real wife, not Lubega's wife. Anyway, so I'm going to be talking about, like, how I kind of thought of the project and my kind of, like, philosophy about how to do stupid projects, because I think they're really important to learn. And then just kind of like, more of a, I'm not sure if this is the right way, but this is how I do it. And maybe feedback from the community. And then more of a, where do we go on from here? Like, maybe we could do this with a Bad Bunny song that goes like titi, my Spanish is terrible.
Damie's Mambo No. 5 Bayesian project
All right, little dummy, you have 15 minutes on the clock, just as a reminder.
All right. So let's look at my RStudio code. I have not looked at the Mambo No. 5 code since I published it. So we're going to go and then see if we could even, like, figure out what's happening, if I actually documented it well. And then just talk about, like, how I did the structure of it.
So this is how I did my Mambo No. 5. And the weird thing is, is that I was in academia before, and a lot of academic coding is a lot about, like, not really reproducibility, or I mean, it is, but I think there's more of, like, this focus on just, like, a script, like just figures, and we don't really care about whether it's modular or not. And I've been trying to, like, learn from that.
So basically, how I set up these kind of, like, data analysis scripts, especially if there's a lot of math, is that I really enjoy having the math as a supplement to the GitHub repo. I think that's really important. Because when I do a lot of analysis, or when I did a lot of mathematical modeling in R for my postdoc, I mean, the code is whatever, but the math, the fundamental part that moves the machinery is so important. So, you know, I would love, I love Quarto, I love how you could just, like, can you guys see my, sorry, let me just...
Well, we were looking at RStudio. I couldn't see RStudio.
Oh, yeah. Okay. So this is, like, my, you know, the full mathematical model. So, you know, even the silly side project, where the question was, how do we use Bayesian formula? How do we use the analysis to figure out what year was Mambo No. 5 released based on the woman's name that appear in the song? So, you know, that's a silly question, but that's actually kind of, like, a really interesting, you know, way of thinking about Bayesian statistics. I never did Bayesian science, ever. Like, this was my first time ever doing it. And so I was so scared on starting the project. And I was like, oh, no, what if I mess up? And everyone thinks I'm stupid. They sent me to jail. And then I just tried to think of, like, the silliest project, and that was Mambo No. 5. But even the simple act of taking that simple, you know, question, like, there is a lot of math involved. And so, you know, I really enjoy having Quarto, and then just having the math, and then showing it to people in the community to make sure I'm doing it right.
Anyway, so let's just move to the coding project. And so let's see. So let's see. I see SRC. So I must have been really smart, and actually made all of my functions. Yeah, I must have made all of my functions, you know, like, documented, and I'm actually really proud of that. I was really mad at HeliWicom, because what I did was I used social security data to figure out the name popularity of babies. And I was going through, like, the actual US government pulling that data. And then HeliWicom had to ruin my day and actually have an R package already out. So I did this for nothing. You can see that that's why I got a little frowny face here. And basically, with that data, I could then try to figure out, calculate the likelihood. So the likelihood that tells me, you know, given this year, you know, what is the probability that I would see these names, the joint probability. And I'm actually proud that I commented. Like, I'm actually really surprised that I actually put comments, because I generally don't, especially for stupid side project. Oh, my God, I actually put like, oh, the parameter names, too. Wow.
Past Dami was, I think she was trying to impress someone. But anyway. So basically, what you would do is that you would give like a data frames of the names across the United States. And then, you know, have like a general prior belief of what the release years was, you know, you know, is it did the song come out in 1970, versus like 2000. And then like, maybe some belief of how old the women are, because like, they're in the club, they should be like over 21. Or maybe they're like 18. They're using fake IDs. And then so, wow, I actually put a lot of comments. And I am so surprised at this. Um, but basically, um, I could see that I was like trying to compute the joint probabilities here. And then I was marginalizing by age, I'm not sure what that means. But basically, I think, from this project, I was really forcing myself to actually think more modularly. So how if I want to, you know, adapt this to a different song, how am I going to make it so that it could be completely adapted to another music with like woman's name in it. And then one thing that I'm not quite sure is the best way I'm not sure how people in the community do this is that I try to keep like, the analysis scripts different than the than the functions, like using like, two different R script files. I'm not sure if this is the right way of doing it. This is the way that works for me, that's the most organized. I know people like doing the, you know, actually having like the Quarto document, and then just having this one main prod, you know, one main like file that shows everything. But I kind of prefer to script method more with like, the numbers, I can like kind of see like, in which order I should run them. Not sure if this is the best way. But this is kind of how I do it. And then you can see here that this script is just simply for processing the names, like we all know that 99% of the things that you do is processing, cleaning. And then this is kind of like a really bad script, because you can see this is where I got really tired, and did not put much comments on I have no idea what's happening here. But from what I gather is that this script just calculates the posterior distribution. So I'm not going to go too much into the Bayesian math. But, you know, from here, I could see that I process the script, and then calculated the posterior that kind of gives me an ID, or that kind of gives me on the probability, the posterior likelihood that the song was released in this year.
And then I think this is like the giant script that made it possible. So we're gonna see if this even runs, who knows, like, let's all pray and hope that it runs. It does not Oh my gosh. Live coding is so dangerous. Oh my god, what do we do? We're going to go to jail. Okay, so I'm just gonna Oh, I guess I'm doing this live and I can just do it the worst way where I just source all of this by hand, but I don't really care. This is bad coding. If you'd like to watch Dami do other things live, you can watch her episode of VizBuzz where she coded plots live. That was pure chaos and tons of fun. And they muted me so I could not swear. It was terrible. Okay, so let's see. Let's see if that worked. Not the best way to do it. But you know, we all can't be saints. Yeah, there is no best way. And also everybody on Discord is saying we do the same things. We keep our stuff separate, our functions separate from our analysis. You're not alone.
Okay. And then so this was kind of like the main figure. So this was a figure that I had in mind. So the posterior distribution and then the different release years. So for example, if from this figure, when do you think mambo number five was released?
99. That was when it was released. And so this, I think this represents a credible interval from what I remember. But you know, I was kind of proud because I actually never done much Bayesian analysis. I was like, Oh, this actually worked. Um, I think I should have gone further, because I should have really like checked some of like my assumptions and like actually made sure that I was doing things correctly. And I'm actually not sure if the math is fully correct. And I think that's fine. So I think a lot of my hang ups on like open science or like showing sharing data project is that I'm terrified of being wrong, again, you're gonna send me to jail. But I think my way of now thinking is that we're all human, we're all going to make a mistake. Like it's just inevitable. Like a lot of the things that we do is really complicated. There's so much moving parts. Like even mambo number five, the silly side project does have some moving parts. And I think my philosophy changed into actually making it easy for people to catch my mistake. So you know, I think, I think that's what I was like trying to do here with like, a lot better like coding and or sorry, commenting and structure is that I want people to find my mistake and to find it easily because I think it's inevitable.
is that I want people to find my mistake and to find it easily because I think it's inevitable.
I'm not sure if that's like the best philosophy to have. But I think this is better than to, you know, be shy about your mistake and try to hide all of the complexities from people. I think it's just a lot easier to just lay it open. If people find mistake, they're going to forgive you for letting it be simple. How much minutes do I have?
You have five minutes and 20 seconds left. And I say showing your mistakes out loud is kind of the entire point of the science lab. So I'm in full agreement that that's a good idea.
Yeah. So I guess, you know, that philosophy of like showing your mistake, like, that's why I made the math so like visible, like, you could go through the math, you could write it out, you could actually try to make sure that the math makes sense. Like I made my wife, who is a mathematician, actually go through it, like she had a huge deadline. And there's just me being like, you know, hey, babe, can you actually like look at the math for me? Um, but, you know, that's, I think that's kind of like the main philosophy that I kind of like, got after I left academia, like, when knowledge is kind of like the capital of like, career investment, you kind of get shy about like, letting people know, like, your mistakes, it's like something that is, it just happens. And so I've been a lot more like, very, like, pro, like, just find my mistakes for me, like, you know what, you know, I'm just gonna make it happen. And then one thing I'm gonna say on my last minute, I think I really have to thank Mark, because I actually, he was actually the guy who looked at a lot of my math for me. And I think it shows the importance of community, just like, it's nice having people who see your stupid project. And it's like, you know what, I'm going to treat this as seriously as possible, and actually, like, check your math for you. Like, I think that's the importance of a data science community. It's not really like the technical know how, or I mean, it is a technical know how, but just like, having people who just like, accept and just like, actually are interested and like, actually care about what you stupid thing you're doing. So and you know, in this kind of age of the AI, I don't think like AI coding can replace the human connections of like, the silly projects. So that's why I encourage everyone to do silly projects. Because you should not let machines take away your stupidity. That should be something very human. And I will stand on this. Do not let the machines take away your yearning.
Because you should not let machines take away your stupidity. That should be something very human. Do not let the machines take away your yearning.
Like Lubega's number five, number five, number, mambo number five. A machine cannot write that.
I mean, people I mean, you could try do a do a comparison. Now go to go work in cloud. I was gonna ask you, did you use cloud or any other tools to help you with us at all as you were writing? So I try not to for a lot of my stupid projects, because it's just a way of learning. So I give myself three tries. If I'm like, hey, I tried this three times, it does not work. Then I'd like use cloud and try to see not really ask it to like, code me anything, but just to give me like, maybe like an outline of how it code. So I do use cloud for that. And I think I've done it for like, calculating the posterior, but I always make sure to have someone you know, who does mathematics to check. Amazing. Yeah, I think that the community aspect is really, really important to and like, there are so many stories of people working out loud in the community and sharing their stuff. And getting jobs from that, right, like having really life changing things happen from that. And in all of them that I know, it was all very imperfect. It was like, I put out this crappy thing. And the magic happens with the feedback and people engaging with it and people finding it fun and kind of wanting to use it or taking a different direction.
So definitely work out loud. There's a place on the discord server where you can share projects. But really anywhere on the discord server is a good place to share projects and show people things. Ask them questions if you share. Well, Dami, you do have like a minute and a half left.
Okay. Oh my god. Oh my god. I had something really great. Oh my god. Oh my karaoke talk. Okay, no, not that. Um, I guess. Okay, so in my job interview, so I work at a pharma making vaccines. And he asked me, have you done anything with Bayesian modeling before? And I swear, I was thinking about bringing this project up. And I just did it. I did it. I'm like worried. I don't know. They're gonna be like, so there is a French company. So I was like, Oh, do they have mambo number five in France? I think mambo number five made it all over the globe. Yeah, because it's German looping us German. It's a German song.
Who thinks that Damie should have brought this up at an interview? I think so.
Well, they'd be like, hey, you have to make vaccines. Do you know Bayesian? I'm like, oh, yeah, I did model number five. Like, do you think that would have gone well?
I think it would have. I think it would have been amazing. Well, Damie, thank you so much for sharing your project with us. It was so much fun. Damie, you have a blog, Isabella.
Isabella has been sharing it, but that project is there. So if you want to go check out Damie's math, please go take a look at it. And yeah, oh, I love the picture that Mark just shared in the Discord server, which is D Rob's 2019 talk about working out loud and sharing public work or working in public, which is just amazing. And I fully agree with it. Like the more you can do out in the world, the more valuable your work is. Okay. We are on to our next lab manager in our lab manager rotation today, which is Mark Rieke. Everybody give a round of applause for Damie and say hello to Mark. Mark, would you like to introduce yourself?
Mark's Um! Actually Bayesian skill model
Yeah. Hi, guys. I normally have a standing meeting on Tuesdays from the data science lab. So this is actually my first time joining. But I'm a data scientist and a Bayesian statistician and whatever work needs me to be for the time being. But I'm a big, big Bayesian nerd. So I was very glad when Damie reached out. I was like, oh, hey, can someone look at this? And I was like, oh, yes, absolutely. It's my time. Yeah, exactly. Bayesian stats apply to something like niche and weird and fun. Yes, please.
But yeah, so I'll be sharing, I guess like my assumption is that like the Venn diagram of like people who like come to the data science lab and people who are interested in dropout, or at least like are, I don't know, I'm guessing that's like a bigger, there's a lot of overlap in those two groups. Hold on, let's do it. Let's do a poll.
Noor and I raised our hands on camera like nerds. Do you watch dropout TV? Let's see. Yes or no.
All right, poll in Discord. Let us know if you do watch dropout and you know what dropout is. Can you do like a thumbs up or some sort of like affirmative emoji reaction when you're in Zoom?
Oh, we have a few. Yeah, Jared, Greg, Isabella, amazing. I also subscribe to dropout. However, I do not watch the one show that Mark is going to be talking about today, I think so. I'm ready to learn. All right. Yeah. Let's go. I'm going to put 15 minutes on the clock for Mark starting now.
Okay. I may actually need to quit and rejoin because Zoom is telling me that I needed to update permissions and it's not going to update them. So I'll be back in like 15 seconds.
It was fun to come back just to Lubega playing. But yeah, so for folks who don't know, Um! actually is a show on Dropout. It's just a game show. And the whole premise is the host will make statements about nerdy things, pop-culturary things. And in each statement, there is something that is wrong that is said. And the job of the contestants is to buzz in, be quick on the buzzer, and then make the correction of saying, Um! actually, the quote isn't like, Luke, I am your father. It's, no, I am your father. It's that kind of level of pedantry.
It's a great doing dishes background noise thing. But in season, in 2024, they swapped hosts from longtime host Mike Trapp to the person, to Ify Nwadiwe, who had-
I love both of these people. Mike and Ify are both amazing.
Yeah. Ify was referred to as the best contestant. And it was purely based on the fact that he had the most wins. And I was like, well, you can rack up a lot of wins just by playing the game a whole bunch. Or you can rack up a lot of wins by playing the game a whole bunch, or by playing against quote unquote, like lower skilled players. Or, you know, just looking at win count treats a one point win and like a massive like nine point win as the same, you know, they get treated the same. So I was like, I wanted to see if Ify Nwadiwe really was, if you like did the math, if he really was the best I'm actually player. And that led me down like this wonky rabbit hole that is building out this model. It is it's a Bayesian model. And the nice thing about it is let me make this bigger. Sorry. The nice thing about like doing stuff in a Bayesian way is that I'm from like, like, there's like all of this sort of like philosophical reasons that I am a dork for Bayesian statistics. But there is also like the practical benefit of that, like, the input the software to make the like to implement the model itself is super flexible and malleable such that you can rather than try and like hemorrhage your data or like, you know, bash it with a hammer to try and get it to fit your modeling software, you can tweak your model to fit your data generating process. And this is like one of the weirdest models that I've ever fit. There's like normally there's three contestants and like one person gets a point in each round. And that has its own sort of like wonky math that goes along with it.
And sometimes, though, multiple contestants like get a point on the same round. And you have to like handle that edge. And sometimes everyone gets a point. So, like, this handles that edge case. Sometimes oh, and this is I was not I am just scrolling through and rediscovering all of the old nightmares of trying to get this to work. This is like the flow tree of like, if two players are awarded a point, what is the probability that like a player is awarded a point based on their underlying skill that the model is calculating? So, I don't know. This is all kind of like a long way of saying that this really wacky, weird, nonsense project had a lot going on under the hood. Much like dummies. Yeah, exactly. I'm gonna scroll down. Yeah, and there's like there was a single one off like four player game that gets got included in this as well. And then there's team games. Oh, my gosh. There were so there's so many there's so many edge cases in this. And again, like the nice thing about like doing this all in one big model is that you can this can all be one big model rather than like, oh, this is a
How do you figure out what points were awarded to in a team game?
So, in a team game, it is the case that we say I'm so off track.
Yeah, no worries. Sorry. You have 10 minutes left.
In a team game, it's just that it's just so the way that this thing works underneath the hood is it says every individual player has like some like skill value. And that's kind of like the core of what the model is estimating. And in a team game, you're putting up two players who are on the same team against like, you're just adding their skill values together. And then against the two players on the other team, adding their skill values together. So, it awards a single point per team, but it's determining like the probability of like which team is going to get awarded a point based on the underlying skill of the members of the team.
But yeah, I guess all of that to say, though, that like doing this in a single Bayesian model lets you oops, this is the wrong one. This is going to look really nightmarish because it's just it's accounting for all of these extra edge cases and whatnot. Sorry, are we looking at Stan?
I don't know if anyone knows what Stan is.
Oh, Stan. So, Stan is like a it's a hyper focused like probabilistic programming language that has like an interface to R, an interface to Python, but it's really just focused in on like purely writing models. And the really, really nice thing is that Stan code is almost a one for one map to like if you write LaTeX, if you write out a model on pen and paper, it's really, really, really similar looking. So, there's almost like a one to one direct translation. It is like a little bit of a hurdle to get up and running with.
And it is to get it installed. Yeah.
Yeah. Well, and I'm in just like getting used to like the workflow and whatnot. So, I have to at any at any given opportunity, show Richard McElroy's statistical rethinking for I don't know, everyone should go read that book and it'll unwire your brain about how to think about modeling. But yeah, sorry, I'm just rambling. So, does anyone have any questions? I'm just I'm just yapping about Stan and math and I don't know.
We did have a question, which was, have you shared your analysis and you have not shared the end of your analysis yet, but have you shared the analysis with the dropout slash I'm actually people? I know that they do pop into Reddit quite a lot.
I don't think I, I, my working assumption was that they are, would not see something from the little, little likes of me. So, I had not. Although I did, I guess like the, I think I shared it on I, at the time I was, I was still on Twitter. I shared it on Twitter and like tag Brennan Lee Mulligan, who is, as this model found, like would have the best sense of like, as this model found like would have the best skill.
If you'd like to, you know, taking into all those considerations.
But I'm sure he gets like a bajillion people tagging him every day and stuff.
Oh, I'm sure. Absolutely. So you found out that Brennan followed by John Jeremy and Erica. Oh, and Matt is up there as well. Actually had the highest skill.
Yeah. Relative skill relative to other people.
Right. Right. So, if this is looking at like, if, if you were to rank order uh everyone's like underlying skill The way that Bayesian models work is that you generally run like thousands and thousands and thousands of simulations. So this is looking at like, in each of those, like, you know, that set of simulations, we can estimate for an individual player's skill. If you take, if you rank within each simulation, from like, best to worst, those skill values, what's the average across all the simulations, where that person ends up in like that ordered ranking. So Brennan has an average rank of like six, just means he's very often up at the top out of like, I think it's like 187 or so. So he is, on average, the highest ranked player. This accounts for things like, if I scroll down, there are some players here who have, like Jamel Wood, for example, I think he was a guest contestant who did just blew everyone out of the water, but he was also only on like one episode. So his uncertainty interval is really, really, really wide, versus someone like Erica Ishii, who has been on like several, several different episodes. So she has like a little bit of a tighter uncertainty interval interval than like someone who's only on once. So doing this sort of like across all of the simulations comparison negates, or it adjusts for this difference in uncertainty between different people.
Okay. One more question. When we're talking about relative skill relative to other people, does that mean that if somebody of high skill is playing against other contestants of low skill, that that win is discounted?
It just means that you would expect them to do pretty well. And if they, if they don't do well, then it means then it's probably the case that they that like the model will update their, their skill estimates to be like lower. And then the inverse for the person who is. like low skill and then like did really, really well. I looked at like a couple like hypothetical examples. So if you put this, this match has never happened. But if you put like Brennan Lee Mulligan, Ify Nwadiwe, and then like Ali Beardsley all in a match together, Brennan is likeliest to win. But obviously, like, if he and Ali have like, not a non trivial chance of winning. They at least have a lot of overlap in there. Yeah, exactly. Yeah.
Amazing. Well, do we have any questions, Isabella that I have missed? I feel like, I feel like it's hard to compare anybody to Brennan Lee Mulligan, because he's a superhuman brain. But I think that there's something to be said for like, the instinct that everybody has that says that if he is the best, that there's something that is non skill related, that makes us want to just say that if he is the best, and I love if he so I, I would probably be on that same train. I think that there's, I was gonna appear because we did have just like on that note, there's like a big benefit of sort of like the like what what Dami did what this is doing, is that like, rather than like constructing like a proxy from things that you have sort of like readily available, because like Bayesian models themselves are so flexible, and you can implement really like, whatever, there's a really good example of it is this is my favorite paper.
I was just gonna say this feels really applicable to sports.
Yes. So this yeah, the sorry, I'm like trying to do too much all at once. You're fine. No pressure. Two minutes, 20 seconds.
So yeah, here we go. This is so so one of the things that's like nice about like Bayesian models is that you can actually like, if you have some sort of like obscure thing that you're interested in, like the the math nerds will say your if your estimate is your quantity of interest is like difficult to measure, you can still measure it directly and like directly encode, encode that in the model, rather than constructing sort of like a, a weird proxy that like kind of gets out of it kind of doesn't. And that's kind of what, you know, counting up total wins is it's like a proxy for player skill, but it isn't what player skill is directly. So you can include that in the model, even though we don't, like, there's not any observations for player skill. This example is looking at this is like the Bayesian model workflow from that, like, Gelman and some folks on like the standup team like put together a while ago, but as like an example of like, flexible software implementations, they rather than fitting like a logistic regression to like a set of golfers, you know, taking putts, they fit what they call like a geometric model where they actually model, you know, do the sort of like geometry, encode like the actual, like, geometry in terms of like distance and like shot angle and like radius of the radius of the hole in golf, encode that in the model directly to get like, you know, if you were to use a pure statistical model, you get kind of a bad fit. You do this geometric fit and it's much, much better.
All right. I'm going to interrupt you in your last 30 seconds and say, the people demand you talk about your Arc browser.
Oh, I like Arc. I like that it has I like that it does like mirroring across multiple things. The only thing I don't like about it is that sometimes if I pull up something and then like this screen goes like kind of gray and it always throws me off.
It's like when your shiny app goes to sleep.
Yeah. But I love it. Just quick and easy and snappy.
Yeah. I think a lot of people have said like, Hey, it's not supported anymore. It's just not being actively developed, but I think that they are still like applying security patches and stuff, which is good enough for a lot of people.
Yeah. Yeah. At some point I will have to I don't know when also at some point I'll have to migrate off of Arc to whatever the new flavor of thing is, but for now it still works.
All right. Well, also Zach had said this probably isn't possible or realistic, but what do you think about doing something like this for Game Changer? And I said, I pity the fool who tries to analyze Game Changer.
Oh, Zach, I have thought about that. And it is all of the asterisks.
Honestly, I have too. I think that Mark and I have talked.
Yeah. Everything that I was like, Oh man, this is like the I'm actually thing was so weird and wonky and strange. There's like a myth like that times a million for Game Changer.
Yeah. Often Game Changer is so chaotic. If anybody has not watched Game Changer, this last season of Game Changer was probably the funniest that I've ever seen. And it's Game Changer is a game show where the rules are constantly changing according to things that are happening in the game show. And oftentimes rules are abandoned. Things go crazy and you end up with like this inverse reality. I highly recommend it if you have not watched it. I will also plug one more thing for Dropout. If you like D&D and you enjoy watching actual play podcasts, the production value of Dimension 20 is amazing. Go watch the early ones, especially Crown of Candy, Unsleeping City. Like they just have fantasy high, just have amazing production value with their sets.
Sam's Quarto extensions
Okay. We have one more lab manager for today and that is Sam Parmer. And we still don't know if his mic works. So let's find out in real time.
Sam, does your mic work now? Hello. Can you hear me now?
Yay. Oh, he's the Verizon guy. Okay. Let's do it. Sam, would you like to introduce yourself?
Yes. So Sam Parmer. I was a, I guess, member of the data science community. It's been a while since I rejoined, just because of work and other obligations. And I'm a part of some external working groups and then I also work in pharma. And yeah, I'll be sharing some of my adventures with Quarto Extensions. So I'm going to share my screen. That's okay if you want.
Absolutely. Let's do it. I will put 15 minutes on the clock for you, but hey, our time is going to run out around that time anyway. So let's do it. Reminder, everybody, you can ask questions in Discord if you would like to.
I just want to quickly show Quarto, if you've never used it before. It's a flavor of Markdown as well as a command line interface. And this is a general scheme of what's going on. You have Quarto, and then you have some sort of engine that converts R code over to Markdown or Python code.
