I'm proud to have worked with Elan for several years. As you can tell, he always puts a great deal of effort into preparing for his presentations. Amazingly though, this is actually his normal level of conversational speed, clarity and humor :)
13:00, this is a really good one. With Minecraft, Mojang came to a realization that very few players had ever been to the Nether (based on the percent of the population that had the achievement "We need to go deeper!" which is received upon entering the Nether). The ended up realizing that very few non-hardcore players (players that didn't consume game related content outside of the game, like videos, guides, articles, etc...) knew that the Nether existed. This is why the added obsidian monoliths and broken portals around the Overworld to give you hints.
@@Sarmachus Sorry I can't remember exactly. I saw it in a game development video a year or so ago, and I believe he flashed a tweet from one of the MC devs on the screen. Regardless of it's authenticity it still serves as a good example and valuable lesson.
@CruzZ what are you talking about. This guy provided a ton of real world examples where statistics could help solve a problem. That doesn’t mean people will always use statistics for good though, he even mentions that in the presentation with an example. Just because big gaming companies suck at stats doesn’t mean his presentation wasn’t phenomenal!
@@dontfk Big gaming companies most definitely do not suck at stats, if anything, that's the one thing they master above all else. It's just that most statistics are not relevant to the players enjoyement. They are very relevant to shareholders tho.
"Any actual statisticians are totally cringing." Yep. It's not just pedantry. People will literally not know what their test means, and then they will judge whatever change they make in hindsight anyway.
Thanks for the talk. I had to learn this stuff on the job at Big Software, Inc. when we started measuring PC boot time impact. There were large variances between each boot.
@12:53 A thing to note is that in that example, people have been playing the "hard" puzzle before and the "easy" puzzle is a novelty, which may cause players to spend more time on it for the experiment, without it being the better solution long term.
It matters a lot for procedural generation, as statistical distribution is a huge part of random number generation. It can also be used to approximate various things... replacing physics in some cases. Sometimes called an "analytical" solution, you can see this show up on some games oceans, etc. The ocean is based on statistical analysis of real oceans, instead of trying to actually simulate fluid dynamics. I'm sure there are more uses than that, especially outside the game.
Yeah, that thing he did was called hypothesis testing. That took me a good 20 hours during a single week to figure out how to do it by hand at school. Finding out that it could be done in a minute in excel blew my mind.
As a data scientist... it's a LOT. But really, you don't need all the math to do it practically. You really just need to know the basic definitions, and what the test does. And there you go you got analysis. If you're a game dev, assuming you got some programming experience, you can already do a lot of these things in the language R, with very little effort, and even very easily build some machine learning models.
I essentially got most of this stuff during my semester of statistics class. As he said, he pretty much blazes through it, you mostly need time to understand when what is used, why to use it, what the downsides of using it are, etc and lastly of course, HOW to use it.
While I wouldn't do everything identically, I didn't have any large complaints, which is not generally what happens listening to quick statistics intros. A good talk.
Love this talk! I spent quite some time trying to derive the 8.14 confidence interval in the first example and finally had to install Excel to verify. I couldn't see it at first but the slides actually mix five and six observations. At ~7:38 there are five observations. At 8:19, the confidence interval is calculated using six observations, i.e. T_DIFFMEANS(A2:A7, ...) rather than the 2 x 5 observations shown on the left.
This video has existed for almost 4 years and it feels like not a single game dev has ever watched it. Their sales division has warehouses of supercomputers simulating human brain functions trying to figure out how crap a game can be before you will buy it, and just how much you will spend on DLC just to play the game at all.
People are awful at five-star ratings whether that be a game, book, movie, show, item, etc. Basically, people will give 4-5 if the product was at all fun or engaging, or a 1 if there was a problem/complaint/issue or any offense taken. Good video. Statistics are fun.
God, statistics is why I can't ever tell anyone I am sure of something. "Hey does this code work like X?" "Well, I was there during requirements gathering, I wrote the code, deployed it, and no one has changed it since. So I think so!" "Yes or no?" .... uhhhhhh
at 19:00 mins, do you care about the median? I think thats a rather brazen assumption! sometimes its better to have some people who are really invested and really care, and thus are willing to spend on your product, rather then a lot of people that will play for free but don't care enough to spend money, or come back repeatedly. Great talk overall though!!
This depends on assumptions; the assumption here probably is "I am optimizing my game for ability for at least most of the initial group to make it through".
That wasn't actually the point he was trying to make. Especially with a small sample size, outliers greatly skew the mean. As for the point of a few dedicated players willing to spend money, that only works if it's a game that does not depends on having an active online community.
I know I'm really late to finding your comment, but I thought the same thing! Also, Mark Rosewater (of Magic: The Gathering) has a presentation on RUclips about Game Design and in HIS opinion, that highly polarized distribution is better. It's better to make something that SOME people love even if some other people hate it, instead of something that everyone gives a 'meh' to. In game design, I think it's the difference between "cult classic that some people love and play forever" and "totally forgettable game that disappears in two weeks". If at least some group loves it, it can spread by word-of-mouth and certain reviews. Provided your budget was appropriate to build a niche game, you can have a success... while some game that everyone merely tolerates probably makes no impact and loses money.
@@stuartconrod8364 exactly :) Better to be hated by 90% ignored by 5% and loved by 5% Then hated by 20% ignored by 80% and loved by none. Who is going to spend money on a product they don’t love when they have so many alternatives. Plus all those haters are free press too! I think we should lean into the fans more, look at dark souls, its brutal unforgiving and very niche, but clearly doing fine. League of legends, unforgiving, brutal player interactions, but doing fantastically well. Counter strike, same thing. Yes i do think we should keep games accessible, but Not at the cost of what the fans love. I think for example what halo infinite is doing is great, bringing back bots to practice offline before going into the fray. Allows the multiplayer to be as cutthroat and great as it always was, not with unlockable weapons thate give you an edge at the start of the round, No everyone starts with the same weapons, and you need to earn and fight over better ones, so its a true skill matchup. Thats why ists so unforgiving to new players, but also why its so incredibly good.
@@FreekHoekstra it really REALLY depends on how you make money off your game. If it's some recurring revenue, then you need to retain a decent number of players. If it's a game which has interaction between users, then you need a decent player pool. If it's a one-time purchase, you can keep it mediocre across the board. If it's for e-sport publicity, you better make that as balanced as possible. Make the goals easy to understand and controls simple enough to get players pouring in. Overall, no matter the game, a larger pool of players will bring more potential spenders, and of those players only 20% of them will be providing your entire income. Money keeps a business going. So making a game for only 2 people is a ridiculous endeavor unless each piece of content is a guaranteed buy and they cannot continue into the new 'season' without making their purchases... Though if only two people are playing they'll need to be spending hundreds of thousands each time you release content. in a free to play game competition drives purchases. You need some fodder for the big spenders to show off their purchases/power to, or they have no reason to buy the newest released item/cosmetic the day it comes out.
"p values" aren't just complicated; they're a root cause of reproduction problems in studies with small sample sizes, and a general frequentist foible. Bayesians of the world, unite! (Interestingly, the "pick sub-samples" illustrations could lead to an IMO much better solution!)
Bayesians can play around with their Bayes factors all they like, but at the base, they’re still operating under a frequentist model if theyre gunna do any form of null hypothesis testing. Without a criteria to reject the null (p val), you can’t falsify a hypothesis. So collect all the data you want and build up those Bayes factors, but you’re not escaping the problem of induction. :) Frequentists of the world, unite (and not be undermined by a single black swan)!
@@hamm8934 The belief that you can "reject" the null hypothesis based on a single yes/no measurement IS THE PROBLEM. (Sorry, got a little loud there.) Look at the PDF. Draw conclusions about underlying behaviors. Make better predictions and test again. Do not pretend that "there's a 96% probability in this case" and "there's a 94% probability in this case" are vastly different, binary outcomes.
@@jonwatte4293 what statistician or scientist worth their salt believes that a single positive or negative outcome is sufficient? That’s a bit of a straw man. Of course you either (1) directly replicate the result or (2) perform an extension with a different operationalization of the same hypothesis. If it isn’t replicating approximately 95% of the time, it’s quite safe to say the effect isn’t there (assuming adequate power). If it is replicating approximately 95% of the time, it’s quite safe to say the effect is there. The point I (and other frequentists) make is you have to have a criteria of falsification for null hypothesis testing. If you don’t, the very logic of hypothesis testing collapses as you are no longer able to discern a success from a failure. You have to make a judgement call for null hypothesis testing to exist. This whole notion that Bayesian stats somehow avoids or overcomes this judgement call is a complete failure to acknowledge that you are still making a judgement call, just with a different threshold. (See chp 1 and 2 of The Logic of the Scientific Discovery). Get those Bayes factors as juicy as you want. It just takes 1 falsification for them to be undone. We’ll see which method is more fruitful :)
@@hamm8934 bruh I feel like you're still being incredibly disingenuous about this whole thing. The key issue with NHST is that a p-value *only* tells you p(Data | H0 = TRUE)-that's it, full stop. The far more interesting question is p(H | D), and that's entirely beyond the realm of classical frequentist methods. 'Rejecting the null' with p < .05 doesn't mean that there's a 95% chance the null is indeed false, or that the alternative is actually true. What we should be doing is systematically pitting models against each other, and this, I think, is something Bayesian methods are exquisitely well-suited for. And sure, there are some rules of thumb when you're doing Bayesian model comparison and trying to figure out how 'meaningful' the difference between models is, but it's a laughably false equivalence to say that the process of multi-model inference (literally comparing the evidence in favor of competing models) is anything close to a binary NHST decision based on differences in means or a correlation. Not to mention you can compare models based not only on the parameters you include, but on your priors, or the underlying likelihood function... Shit, you don't even need to use Bayes Factors-it's super trivial to compare models via their posterior predictive densities using Bayesian cross-validation with PSIS-LOO. All of this ranting is basically just to say that 'all models are wrong, but some are useful'-and I think if we really want to find the best models that explain (or even better, can *generate*) our data, you're gunna have a bad time with frequentist NHST.
I always tell people that basic statistics and sourcing should be taught at age 11. Would reduce the number of no-argument-freds and would reduce the fake news plausibility rate
Actually your boss wants to know how large the probability is of being wrong: that you pay more than you save. So you want the t-test of the SSDs compared to (HDD minus the time difference needed to pay for the SSDs). You’re not below 0.05 for that with your 4 runs, so your boss cannot not sure enough that she’ll be right. But that’s nitpicking and I really like your video :-)
But Fred didn't hypothesise that SSDs don't make any difference to build times, he was questioning the return on investment the SSDs would bring. Or am I off the mark here?
He needed to prove SSDs had any improvement at all first. After that he had a good idea on how much it improved, and eventually he proved Fred right. It would take too many daily builds for SSDs to be worth it. But before that, he needed to know what the difference even was, and after that he used a simple formula to see how much money it saved. Poor Fred just had some words put in his mouth to make the presentation go a little smoother at the beginning.
To be fair, that wasn't daily builds, it was total builds, since SSDs are a one-time investment. Getting even the lowball estimate of 210 builds out of the lifetime of the SSD is probably easily achievable, so SSDs would be a worthwhile investment.
He covered that briefly with the discussion of dev time cost and how many builds you'd have to do for the SSD to pay for itself. You have to have a null hypothesis to test, and "X isn't worth it" isn't possible, IIRC. It's been a while, but I think your test *has* to basically 'touch zero'; either x=0, x>0, etc. An "even if does save time, does it save *enough* time" hypothesis requires a test that is basically "is x >= y" (where y is the 'threshold' where SSDs pay for themselves). It's either easier to first prove that there *is* a time difference, then calculate the 'value' of the time difference, or it's not even possible to do it the other way (or at least not with 101 statistics).
Wasn’t he wrong in choosing two tailed t-test? Since he is testing whether SSDs are faster, not just that SSD load times come from a different population than HDD’s
Fair question. His reasoning was pretty sound. He would want the one-tailed t-test if it were a safe assumption that SSDs are always either faster or the same (an assumption about the underlying distribution). Making that assumption (which is a bad assumption) is not the same as being mostly interested in finding out if they are faster (which is valid, but does allow for them being slower). His test concluded that they were different distributions, and he could also see that the difference was to SSDs' benefit.
@@ArsenicDrone The boss was specifically asking if SSDs were worth it (i.e. sufficiently faster that their mean speeds come from a different, faster, population than HDD mean load speeds). Wouldn't it be a mistake to intentionally test a broader hypothesis than you require just to verify your actual, narrower hypothesis by observation at the end?
@@mrichards Ah, one of many not-so-intuitive things about statistics. It really comes down to only making the assumptions that you can justify. What the boss was interested in doesn't determine what's possible to test or what assumptions are valid. Notice that his p-value is half as large for the one-tailed test (the result is even more significant). The test got substantially more powerful, but that power doesn't come for free, it comes by making this unjustified assumption. (It's not justified because before he runs the test, he really doesn't know which outcome will happen, and it could actually be slower.)
@@ArsenicDrone No, he really is mistaken. Whether or not it is a safe assumption that SSDs are always faster is actually irrelevant. What is relevant is that the hypothesis he's testing is a one-sided hypothesis--that SSDs are faster. If he had measured SSDs to be slower, by any magnitude, the hypothesis would have been rejected.
Hours played for different versions being radicalised is pretty normal and there are often very good reasons for that because games have lots of humps or steep curves or brick walls. There might be something _terribly_ wrong in the tutorial that makes x% of people just not get past that. And, honestly, I prefer 20% of players go "this is amazing" and the other "bad game" than everyone saying it was "just ok".
If you have 20% of something... let's say all IBM shares, and you increase your holdings by 5% = now you have 22%. But when you say you have increased it by 5% percentage POINT you went from 20%=>25%.
20:12 When you say Fred being right is 3%, but we are using a two-tailed test. I think the conclusion should be Orange version is different than the old version, it's either better or worse.
23:17 That's 45 two-sided tests so you go look for p values below 0.00056. That gives you a 5% false positive rate overall, but I can tell you that you're almost guaranteed to find a true positive unless the classes are carbon copies of one another
That works if you want all 1 vs 1 fights to be mostly fair. Think of something like Street Fighter where you can't change your character mid-match. A rock paper scissors relationship would be fair but then if you are playing rock and the opponent is paper then the match isn't a good test of skill, the game was over at the character select screen. Depending on your context (something like Team Fortress or StarCraft) you might need to instead find the Nash equilibrium to make sure all units have their niche. But looking purely at win rates might mislead you if your player base is not playing optimally. Even if you can trust your win rate statistics, finding the Nash equilibrium is NP complete, meaning that each new character class exponentially increases the complexity of the problem. And there's probably units like the SCV where the kill death ratio is exceedingly bad but you can't win without them because their role is non-combat. Or a unit like the carrier (maybe? I'm not a pro) that isn't resource efficient but is a way to force the game to end if you are already ahead in resources and tech. If that's true and you analyze the carrier per unit, it might look overpowered, if you look at it per resource it might look underpowered, but it still has a niche. I guess that all I'm saying is that it's a hard problem, and game theory might be useful, but could still be difficult to apply if you have a game that is interestingly complex.
Hi! Great talk thanks!! a QUICK TIP for A/B testing! (I'm economist) You could randomly choose who goes into experimental/control group :) That way you don't have to switch, you just have to apply the procedure to many people once, like this: 1) New player enters 2) You generate a random number (between 0 and 1 can be) 3) is it geater than 0.5? experimental, no? control 2) register their group and their target number :D Even if they play only once (you don't need multiple rounds), you can compare the means between those groups ;) Thanks again for the talk!
Very good talk, even I'm kind of screaming at the use of p-values as "the chance that Fred is right." But you clearly know that, and are simplifying because p-values are confusing and don't actually measure quite what we use them to measure. Which is a good reason to switch to subjectivist statistics, but you can hardly explain how to responsibly use priors in a 30-minute talk.
Problem with your cupcake mode example. Making the game easier may have a positive impact in the short term and may have a negative impact long term. Short term statistics can only measure short term results.
The test was simply to determine whether difficulty had an effect on time played in either direction greater than the margin of error for the sample size. These are great as backup tests to ensure the results aren't just a fluke without a unreasonably large sample size.
@@jacobb5484 Doesn't matter. If I'm a tester and only testing the game for 10-15 minutes if its too hard I'm going to report that it's too hard. If the game gets made easier and released and I pick it up and find that 30 mins in it's too easy, I'm going to get bored and quit. I think he oversimplifies the situation.
@@lushen952 Its a simple example of a T test on a paired sample. this isn't for small engaged focus groups with detailed subjective data, but rather big data statistics such as the example of a sub mode being beta tested. The situation in this example the T test gives a percentage chance of either: A. The change had the effect of either increasing OR decreasing what's being measured by a notable amount. B. the data is probably skewed due to bad sampling and falls within the margin of error. once you rule that out, you can make further changes and run detailed tests to actually make an improvement.
it is.. but its is not its a video on how poor ftp gamers flock because lack of money... and how to get them to spend more.... and about how bad ssd are... in 2016 but are now 40-60% cheaper per gigabyte and much much faster... bravo to skipping the description and basic computer imp in the last 5-6 years....
@@simlife445 Moron alert. You confirmed that it is indeed a good presentation then went on some personal rant of the content you didn't like? I don't give a damn sheesh.
More often than not the data you do NOT have is more important than the data you do have. For instance, I and probably millions of other people didn't buy Dead Space 3 BECAUSE it was infested with microtransactions. There is no data for that though since a lost sale literally doesn't show up on the balance sheet. Game devs who decide NOT to "leave money on the table" by making real games without microtransactions are actually leaving a great deal of money on the table in lost sales for which they don't have any data. Game devs need leadership, empathy (essential for understanding customers even if you have no moral concerns whatsoever) and common sense to make good decisions. There isn't any amount of data that can substitute for these attributes.
Damn, I'm not a game developer. I have never googled this topic. I just wrote down the idea of a some computer game that accidentally came to mind and described the game mechanics in the note app on my android smartphone and youtube immediately recommended this video to me. Coincidence? Now I do not know whether it is good or bad...
Something I'd like to add to the graph at 19:00, the blue analytics are healthier because it produced a stronger reaction. Those are the people who are willing to put money into your game.
@@tempusername-l5d A 2 tailed test splits your significance level on both tails, so it's only half as strong as a one tailed test when showing a difference between groups IN A SPECIFIC DIRECTION. Frankly, a 2-tailed test is a sloppy but acceptable way to test, but it really shouldn't be used when you have a specific direction of difference between the groups in mind. A 1 tailed test has more power at the same alpha level. It's basically weakening your hypothesis to hedge your bets by using a 2-tailed test when you should be using one. That's why I don't like this lecture. It's a computer programmer with a SINGLE statistical tool he knows, so everything looks good to apply that tool on. It's like that old adage that if you have only a hammer, everything looks like a nail. If he were a statistician, he'd know better. But he's sitting there spouting off like he does, when in fact he's dead wrong.
@@tempusername-l5d Sure. I never said that he shouldn't use a 2-tailed test in that situation. I merely said that it's foolish to say "Always use the 2-tailed value." Edit: In science, if you have a hypothesis, your hypothesis generally has directionality to it, or you've written a piss-poor hypothesis. So, frankly, I'm often using 1-tailed tests to show that X is strictly less than/strictly greater than, on some real life data, such as, "Are female babies truly smaller than male babies?" or "Did the biodiversity index for the Upper Nooksack area truly increase due to our conservation measures?" In those cases, as a scientist trying to get published in a peer reviewed paper, I'd get laughed right out of publication for trying to use a 2-tailed test in those or many other situations where I find myself relying on statistical inference. Just saying.
@@tempusername-l5d That's an out and out lie that "most papers" use a 2-tailed test. In Lombardi and Hurlburt's study (2009), about 20% of papers in the field of biology and animal research were 1-tailed, with another 20% not telling whether their p-value was 1 or 2-tailed. So, anywhere from 24 to 40% in biology. However, I would defer to you that in sociology, for example, probably no more than 5% are one-tailed. And if you want to keep on this conversation just for the sake of being a blowhard pedant, I'd invite you to casually refrain. I'm not about to argue further with some stranger as to why my research sometimes uses a one-tailed test to support a claim. Sometimes deviations are only mathematically possible in one direction. It happens. Get over it.
Statistics are a fun way to compare datasets but unfortunately sample size and methodology usually mean that whatever conclusions you draw might be completely irrelevant. And as he's saying, the more questions you ask, the more likely you are to be completely wrong.
This is probably very helpful, but just forget everything he said if you're taking a class on stats... EDIT: This does an amazing job of teaching intuition and importance, good talk
This makes me so happy! Great talk I learned a lot. We had 100 barrels at work that were documented to have 50 kg in each. You could quickly tell none of them were empty and it looked like our written inventory was close. The account (not my boss) told me to measure all of them to see how accurate we were. I measured 8 and calculated the standard deviation. Jokes on you I’m not going to break my back and work my ass off to learn something I already know. I’m sorry if you don’t understand what I’m doing I’ll send you a Wikipedia link after I’m done.
Guy dismissed me in the first 21 seconds. Won't pretend I'm not tempted to continue watching. Statistics as a science (rather than bad statistics as a political tool) is the only kind of math I can say I greatly enjoy.
Sony is following the monocle example right now by giving 10USD credit to random accounts. Fo'sure that's Sony's ulterior motive: Measure how much more likely people ate to engage into the store and (if they're lucky) top that 10USD to buy more expensive games... =)
The only thing that made me cringe was when he said people should ignore the one-sided p-value, when his example (and most things you'd want to test in real life) is a one-sided hypothesis. It's not necessarily that we assume/know that SSDs are faster, it's that if we find that SSDs are significantly slower, we shouldn't be rejecting the test. He is actually doing a test of size 2.5% instead of 5%.
If only that was the only cringe part... His explanation of p-values is statistical illiteracy 101. I was really surprised when I heard him making that mistake, interpreting p-values as Pr(H0). I thought that this statistical concept entered pop culture (kind of like "correlation is not causation" already did)... Amazing that people like him have the confidence to give talks on statistics.
@@tempusername-l5d he always says that the p-value is the probability of the boss (can't remember the name) being right. In other words, the probability of H0 being true. So that p
@@tempusername-l5d But that's the problem, a two-tailed test is NOT the best choice here. Look at it from the decisional perspective, what evidence would lead you to take action? Only evidence of SSDs being faster than HDDs leads to an action (buying SSDs to replace HDDs). Evidence of SSDs < HDDs or even SSDs = HDDs would lead to the same (in)action, i.e. not replacing the current HDDs. Thus, H0 = {speed of SSDs ≤ speed of HDDs} and H1 = H0^C = {speed of SSDs > speed of HDDs}. This is a one-tailed test. Period. He's using two-tailed tests because obviously he lacks the statistical experience to choose a proper significance threshold, so he chooses the canonical 5% threshold and halves it with this idiotic two-tailed choice. It's not surprising that he's not able to choose a proper threshold (after all, it's a pretty complex art, definitely inaccessible to someone who can't even understand p-values) but it's pretty surprising that he feels qualified to give a talk on statistics...
@@tempusername-l5d Yes, it's unfortunate that common statistical knowledge is so poor. Basically any soft science (ab)uses statistics and as a consequence, many psychology/economics/medicine/sociology professors feel entitled to teach statistics. They usually don't know what they're talking about, but online sources are filled with their misconceptions. But trust me when I tell you that I'm not too harsh. My comments may look excessively nitpicky, but they're absolutely fundamental to trustworthy statistics. The very idea of what hypothesis testing is and what it represents, must be 100% clear before trying to use statistics to aid in the decision making process. People like the speaker in the video, who take decisions using statistical tests that they don't understand, are extremely dangerous and scary. What they're basically doing is rolling a dice and hiding behind the veneer of "science" or "statistics". But don't let them deceive you, they're still just rolling a dice. Thankfully the speaker works in a relatively "irrelevant" field like game design, but this problem is extremely widespread in more "relevant" disciplines like psychology/medicine/economics. It's not that I'm too harsh, it's that statistics can be extremely dangerous when practiced lightly.
Not that it's the point of the presentation, but this misses the other marginal benefits of working on SSDs all the time, not just in builds. Additionally, if build time doesn't change when moving to SSDs, then the bottleneck is elsewhere and could be tackled via a different component or algorithmic improvement.
Yeah and long build times is actually one of the biggest blockers to ci/cd, which the lack of is usually the best indicator of long lead time which is the best indicator for slow development trapping more resources inside the system, increasing the number of bugs, less feedback, less data, less experimentation and less revenue. Overall meaning slower delivery and lower quality product and/or requiring more resources to deliver. And in the end you should not generally build on your local machine but do it automatically on a build server.
I think 'every game developer' isn't really apt here... more like, 'games as a service' with F2P model developers. Making a single player game has nothing to do with any of these concepts.
What do you mean? Even single player games can gather data about what the players do and then test those stats for valuable insights. Paradox games or Total War series comes to mind.
This is why gaming is so terrible and toxic now. Optimizing for maximum amount of time played and money spent above all else. They used this same sort of statistical analysis to make cigarettes more addictive. Not even a consideration for the well being of their customers, and not even an acknowledgement that they're spending hundreds of millions of dollars to make their product as addictive as possible so it can be deployed against children.
Well, yeah. Welcome to capitalism. Until the underlying philosophy and society of capitalism is minimized, companies _can't_ do it any other way. It is mandatory for them to optimize profit above all else. Any other consideration is secondary at best.
Dont forget: If you buy SSD to improve build time make sure to put your SWAP memory on the SSD. If you don't have a lot of RAM the extra memory used by the compiler/linker will then go on the ssd as well, drastically improving your build time
P is not the probability that the null is true, p values are the most misunderstood aspect of frequentist statistics there is. The p-value means that if you did an infinite number of these experiments, p% of them would have values as extreme or more extreme. 1/20 experiments you will get a p value that is more extreme. I wouldn't say 1/20 is very unlikely.
@@tempusername-l5d I'm not sure, but I don't think the distribution of the p value is affected by rejecting or assuming the null hypothesis is correct. And of course you will have values that are less extreme. The first point in the clarifications en.wikipedia.org/wiki/Misuse_of_p-values is basically what I tried to get through, p-values aren't about probablities.
@@tempusername-l5d it depends on the data and test, not all of them have normal distributions. If you aaaume a different model you get a different distribution but i guess you are right but it is important to distinguish the probability of the hypothesis being true (which frequentist statistics does not address) and the probability of seeing a result as extreme or more extreme assuming the hypothesis is true.
The only thing you save with faster build times isn't less time it takes. You make builds less often the more time it takes, it increases your lead time from feature idea to a working feature, therefore trapping value inside the system, slowing down the feedback of data or in some cases revenue. Also opening up your project and files or whatever decreases developer productivity and gets on their nerves. But in the end, all this is a false dichotomy as you should have a build server that does all the builds automatically and not rely on local manual builds. Continuous integration and delivery are a cornerstone of all high performing engineering cultures for a damn good reason.
I think the hard drive example falls flat. no other data other than building times doesn't show anything of importance. analyzing the stages of the build would be more beneficial. It may have shown results that point to an issue in the build step as opposed to hdd vs ssd performance.
I'm proud to have worked with Elan for several years. As you can tell, he always puts a great deal of effort into preparing for his presentations. Amazingly though, this is actually his normal level of conversational speed, clarity and humor :)
I thought he was just nervous or trying to fit in in a small time lol
Wait what do you mean "normal"? Does he have a turbo mode???
@@0netwoguy54 GAS GAS GAS
13:00, this is a really good one. With Minecraft, Mojang came to a realization that very few players had ever been to the Nether (based on the percent of the population that had the achievement "We need to go deeper!" which is received upon entering the Nether). The ended up realizing that very few non-hardcore players (players that didn't consume game related content outside of the game, like videos, guides, articles, etc...) knew that the Nether existed.
This is why the added obsidian monoliths and broken portals around the Overworld to give you hints.
Where did they say this? I’m having a hard time finding it.
@@Sarmachus Sorry I can't remember exactly. I saw it in a game development video a year or so ago, and I believe he flashed a tweet from one of the MC devs on the screen.
Regardless of it's authenticity it still serves as a good example and valuable lesson.
@@_lime. Thanks for clarifying
Not only is this an amazingly useful talk, it's essentially a perfect presentation. Dope shit.
@CruzZ fake news
@CruzZ what are you talking about. This guy provided a ton of real world examples where statistics could help solve a problem. That doesn’t mean people will always use statistics for good though, he even mentions that in the presentation with an example. Just because big gaming companies suck at stats doesn’t mean his presentation wasn’t phenomenal!
@@dontfk Big gaming companies most definitely do not suck at stats, if anything, that's the one thing they master above all else. It's just that most statistics are not relevant to the players enjoyement. They are very relevant to shareholders tho.
@@ailurusfulgens1849 You're right, I used poor word choice there. What I meant by that was that they don't always use their stats for good intentions
wow, he is a fantastic speaker. charismatic, to-the-point, funny and practical.
"Any actual statisticians are totally cringing." Yep. It's not just pedantry. People will literally not know what their test means, and then they will judge whatever change they make in hindsight anyway.
funny seeing you here, love your vids
Easier to digest and more accurate statistics content on PrimerBlobs's channel
And the currently 1.7 million subscribers agree
Some of the GDC talks are very badly presented for RUclips videoes. Not this one. This was great, in just about every way.
As a Biologist (MS)... i was indeed shouting at my screen when you were talking about P values....and then you called it out so im happy now.
As an educator and a grateful listener: that was bril-li-ant.
That was fantastic. Their presentation skills are off the charts.
Thanks for the talk. I had to learn this stuff on the job at Big Software, Inc. when we started measuring PC boot time impact. There were large variances between each boot.
@12:53 A thing to note is that in that example, people have been playing the "hard" puzzle before and the "easy" puzzle is a novelty, which may cause players to spend more time on it for the experiment, without it being the better solution long term.
Eiði
Excelent presentation. THIS was simplified? I am afraid of the scenic route xD I wish to know more, and more practical applications on game dev.
Extremely simplified. Statistics is, like, a whole field of mathematics.
It matters a lot for procedural generation, as statistical distribution is a huge part of random number generation. It can also be used to approximate various things... replacing physics in some cases. Sometimes called an "analytical" solution, you can see this show up on some games oceans, etc. The ocean is based on statistical analysis of real oceans, instead of trying to actually simulate fluid dynamics. I'm sure there are more uses than that, especially outside the game.
Yeah, that thing he did was called hypothesis testing. That took me a good 20 hours during a single week to figure out how to do it by hand at school. Finding out that it could be done in a minute in excel blew my mind.
As a data scientist... it's a LOT. But really, you don't need all the math to do it practically. You really just need to know the basic definitions, and what the test does. And there you go you got analysis. If you're a game dev, assuming you got some programming experience, you can already do a lot of these things in the language R, with very little effort, and even very easily build some machine learning models.
I essentially got most of this stuff during my semester of statistics class. As he said, he pretty much blazes through it, you mostly need time to understand when what is used, why to use it, what the downsides of using it are, etc and lastly of course, HOW to use it.
More interesting and useful presentation about statistics I've ever watched.
While I wouldn't do everything identically, I didn't have any large complaints, which is not generally what happens listening to quick statistics intros. A good talk.
Love this talk! I spent quite some time trying to derive the 8.14 confidence interval in the first example and finally had to install Excel to verify. I couldn't see it at first but the slides actually mix five and six observations. At ~7:38 there are five observations. At 8:19, the confidence interval is calculated using six observations, i.e. T_DIFFMEANS(A2:A7, ...) rather than the 2 x 5 observations shown on the left.
Absolutely fantastic presentation, would love to hear him speak more
a semester of stats in 30min. thanks guy.
This video has existed for almost 4 years and it feels like not a single game dev has ever watched it.
Their sales division has warehouses of supercomputers simulating human brain functions trying to figure out how crap a game can be before you will buy it, and just how much you will spend on DLC just to play the game at all.
Excellent, used this to explain the p-Value to some colleagues, since our data science team is not able to explain their models that well...
People are awful at five-star ratings whether that be a game, book, movie, show, item, etc. Basically, people will give 4-5 if the product was at all fun or engaging, or a 1 if there was a problem/complaint/issue or any offense taken.
Good video. Statistics are fun.
Chik-fil-a is not a five-star establishment, people >_>
Incredible talk. Thank you so much!
The first talk where I needed to decrease the playback speed instead of increasing. Great material! =)
One of the best explanations of the T test I have ever seen, read, or perceived in any medium.
God, statistics is why I can't ever tell anyone I am sure of something.
"Hey does this code work like X?"
"Well, I was there during requirements gathering, I wrote the code, deployed it, and no one has changed it since. So I think so!"
"Yes or no?" .... uhhhhhh
at 19:00 mins, do you care about the median? I think thats a rather brazen assumption!
sometimes its better to have some people who are really invested and really care, and thus are willing to spend on your product, rather then a lot of people that will play for free but don't care enough to spend money, or come back repeatedly.
Great talk overall though!!
This depends on assumptions; the assumption here probably is "I am optimizing my game for ability for at least most of the initial group to make it through".
That wasn't actually the point he was trying to make. Especially with a small sample size, outliers greatly skew the mean.
As for the point of a few dedicated players willing to spend money, that only works if it's a game that does not depends on having an active online community.
I know I'm really late to finding your comment, but I thought the same thing! Also, Mark Rosewater (of Magic: The Gathering) has a presentation on RUclips about Game Design and in HIS opinion, that highly polarized distribution is better. It's better to make something that SOME people love even if some other people hate it, instead of something that everyone gives a 'meh' to.
In game design, I think it's the difference between "cult classic that some people love and play forever" and "totally forgettable game that disappears in two weeks". If at least some group loves it, it can spread by word-of-mouth and certain reviews. Provided your budget was appropriate to build a niche game, you can have a success... while some game that everyone merely tolerates probably makes no impact and loses money.
@@stuartconrod8364 exactly :)
Better to be hated by 90% ignored by 5% and loved by 5%
Then hated by 20% ignored by 80% and loved by none.
Who is going to spend money on a product they don’t love when they have so many alternatives. Plus all those haters are free press too!
I think we should lean into the fans more, look at dark souls, its brutal unforgiving and very niche, but clearly doing fine.
League of legends, unforgiving, brutal player interactions, but doing fantastically well. Counter strike, same thing.
Yes i do think we should keep games accessible, but Not at the cost of what the fans love.
I think for example what halo infinite is doing is great, bringing back bots to practice offline before going into the fray.
Allows the multiplayer to be as cutthroat and great as it always was, not with unlockable weapons thate give you an edge at the start of the round,
No everyone starts with the same weapons, and you need to earn and fight over better ones, so its a true skill matchup.
Thats why ists so unforgiving to new players, but also why its so incredibly good.
@@FreekHoekstra it really REALLY depends on how you make money off your game.
If it's some recurring revenue, then you need to retain a decent number of players.
If it's a game which has interaction between users, then you need a decent player pool.
If it's a one-time purchase, you can keep it mediocre across the board.
If it's for e-sport publicity, you better make that as balanced as possible. Make the goals easy to understand and controls simple enough to get players pouring in.
Overall, no matter the game, a larger pool of players will bring more potential spenders, and of those players only 20% of them will be providing your entire income.
Money keeps a business going. So making a game for only 2 people is a ridiculous endeavor unless each piece of content is a guaranteed buy and they cannot continue into the new 'season' without making their purchases... Though if only two people are playing they'll need to be spending hundreds of thousands each time you release content.
in a free to play game competition drives purchases. You need some fodder for the big spenders to show off their purchases/power to, or they have no reason to buy the newest released item/cosmetic the day it comes out.
Happy new year!
I took an entire statistics course on college and I can remember almost everything he said
His style reminds me of the professor that made me fall in love with stats.
"p values" aren't just complicated; they're a root cause of reproduction problems in studies with small sample sizes, and a general frequentist foible. Bayesians of the world, unite!
(Interestingly, the "pick sub-samples" illustrations could lead to an IMO much better solution!)
Bayesians can play around with their Bayes factors all they like, but at the base, they’re still operating under a frequentist model if theyre gunna do any form of null hypothesis testing.
Without a criteria to reject the null (p val), you can’t falsify a hypothesis. So collect all the data you want and build up those Bayes factors, but you’re not escaping the problem of induction. :)
Frequentists of the world, unite (and not be undermined by a single black swan)!
@@hamm8934 The belief that you can "reject" the null hypothesis based on a single yes/no measurement IS THE PROBLEM. (Sorry, got a little loud there.)
Look at the PDF.
Draw conclusions about underlying behaviors.
Make better predictions and test again.
Do not pretend that "there's a 96% probability in this case" and "there's a 94% probability in this case" are vastly different, binary outcomes.
@@jonwatte4293 what statistician or scientist worth their salt believes that a single positive or negative outcome is sufficient? That’s a bit of a straw man. Of course you either (1) directly replicate the result or (2) perform an extension with a different operationalization of the same hypothesis. If it isn’t replicating approximately 95% of the time, it’s quite safe to say the effect isn’t there (assuming adequate power). If it is replicating approximately 95% of the time, it’s quite safe to say the effect is there.
The point I (and other frequentists) make is you have to have a criteria of falsification for null hypothesis testing. If you don’t, the very logic of hypothesis testing collapses as you are no longer able to discern a success from a failure. You have to make a judgement call for null hypothesis testing to exist. This whole notion that Bayesian stats somehow avoids or overcomes this judgement call is a complete failure to acknowledge that you are still making a judgement call, just with a different threshold. (See chp 1 and 2 of The Logic of the Scientific Discovery).
Get those Bayes factors as juicy as you want. It just takes 1 falsification for them to be undone. We’ll see which method is more fruitful :)
@@hamm8934 bruh I feel like you're still being incredibly disingenuous about this whole thing. The key issue with NHST is that a p-value *only* tells you p(Data | H0 = TRUE)-that's it, full stop. The far more interesting question is p(H | D), and that's entirely beyond the realm of classical frequentist methods. 'Rejecting the null' with p < .05 doesn't mean that there's a 95% chance the null is indeed false, or that the alternative is actually true. What we should be doing is systematically pitting models against each other, and this, I think, is something Bayesian methods are exquisitely well-suited for. And sure, there are some rules of thumb when you're doing Bayesian model comparison and trying to figure out how 'meaningful' the difference between models is, but it's a laughably false equivalence to say that the process of multi-model inference (literally comparing the evidence in favor of competing models) is anything close to a binary NHST decision based on differences in means or a correlation. Not to mention you can compare models based not only on the parameters you include, but on your priors, or the underlying likelihood function... Shit, you don't even need to use Bayes Factors-it's super trivial to compare models via their posterior predictive densities using Bayesian cross-validation with PSIS-LOO. All of this ranting is basically just to say that 'all models are wrong, but some are useful'-and I think if we really want to find the best models that explain (or even better, can *generate*) our data, you're gunna have a bad time with frequentist NHST.
@@neur0leptic782 Preach
I always tell people that basic statistics and sourcing should be taught at age 11. Would reduce the number of no-argument-freds and would reduce the fake news plausibility rate
You talk exactly like Jesse Eisenberg from the Social Network when he's coding. It's fantastic.
Actually your boss wants to know how large the probability is of being wrong: that you pay more than you save.
So you want the t-test of the SSDs compared to (HDD minus the time difference needed to pay for the SSDs). You’re not below 0.05 for that with your 4 runs, so your boss cannot not sure enough that she’ll be right.
But that’s nitpicking and I really like your video :-)
But Fred didn't hypothesise that SSDs don't make any difference to build times, he was questioning the return on investment the SSDs would bring. Or am I off the mark here?
He needed to prove SSDs had any improvement at all first. After that he had a good idea on how much it improved, and eventually he proved Fred right. It would take too many daily builds for SSDs to be worth it.
But before that, he needed to know what the difference even was, and after that he used a simple formula to see how much money it saved. Poor Fred just had some words put in his mouth to make the presentation go a little smoother at the beginning.
To be fair, that wasn't daily builds, it was total builds, since SSDs are a one-time investment. Getting even the lowball estimate of 210 builds out of the lifetime of the SSD is probably easily achievable, so SSDs would be a worthwhile investment.
He covered that briefly with the discussion of dev time cost and how many builds you'd have to do for the SSD to pay for itself.
You have to have a null hypothesis to test, and "X isn't worth it" isn't possible, IIRC. It's been a while, but I think your test *has* to basically 'touch zero'; either x=0, x>0, etc. An "even if does save time, does it save *enough* time" hypothesis requires a test that is basically "is x >= y" (where y is the 'threshold' where SSDs pay for themselves). It's either easier to first prove that there *is* a time difference, then calculate the 'value' of the time difference, or it's not even possible to do it the other way (or at least not with 101 statistics).
@@donanderson3653 Also, SSDs can speed up OS and App boot times as well as many other tasks, so it's ignoring a lot of the other benefits they give.
Wasn’t he wrong in choosing two tailed t-test? Since he is testing whether SSDs are faster, not just that SSD load times come from a different population than HDD’s
Yes.
Fair question. His reasoning was pretty sound. He would want the one-tailed t-test if it were a safe assumption that SSDs are always either faster or the same (an assumption about the underlying distribution). Making that assumption (which is a bad assumption) is not the same as being mostly interested in finding out if they are faster (which is valid, but does allow for them being slower). His test concluded that they were different distributions, and he could also see that the difference was to SSDs' benefit.
@@ArsenicDrone The boss was specifically asking if SSDs were worth it (i.e. sufficiently faster that their mean speeds come from a different, faster, population than HDD mean load speeds). Wouldn't it be a mistake to intentionally test a broader hypothesis than you require just to verify your actual, narrower hypothesis by observation at the end?
@@mrichards Ah, one of many not-so-intuitive things about statistics. It really comes down to only making the assumptions that you can justify. What the boss was interested in doesn't determine what's possible to test or what assumptions are valid. Notice that his p-value is half as large for the one-tailed test (the result is even more significant). The test got substantially more powerful, but that power doesn't come for free, it comes by making this unjustified assumption. (It's not justified because before he runs the test, he really doesn't know which outcome will happen, and it could actually be slower.)
@@ArsenicDrone No, he really is mistaken. Whether or not it is a safe assumption that SSDs are always faster is actually irrelevant. What is relevant is that the hypothesis he's testing is a one-sided hypothesis--that SSDs are faster. If he had measured SSDs to be slower, by any magnitude, the hypothesis would have been rejected.
Hours played for different versions being radicalised is pretty normal and there are often very good reasons for that because games have lots of humps or steep curves or brick walls. There might be something _terribly_ wrong in the tutorial that makes x% of people just not get past that.
And, honestly, I prefer 20% of players go "this is amazing" and the other "bad game" than everyone saying it was "just ok".
Worth a listen.
@15:30 "As opposed to 20 to 22", doesnt he mean 21% instead of 22% or am I missing something?
If you have 20% of something... let's say all IBM shares, and you increase your holdings by 5% = now you have 22%. But when you say you have increased it by 5% percentage POINT you went from 20%=>25%.
@@dezimal9143 I bamboozled myself. meant to say 21% sry.
How is 5% of 20 = 2 ?
@@YT775 Actually it isn't 2% I didn't check the math xD. And you are right it should be 21 vs 25%.
Thanks, so I guess theres no hidden meaning, it was just a minor error/inaccuracy of the speaker. :)
the best presentation I have ever seen
20:12 When you say Fred being right is 3%, but we are using a two-tailed test. I think the conclusion should be Orange version is different than the old version, it's either better or worse.
Wonderful talk
Great talk and amazing delivery :)
23:17 That's 45 two-sided tests so you go look for p values below 0.00056. That gives you a 5% false positive rate overall, but I can tell you that you're almost guaranteed to find a true positive unless the classes are carbon copies of one another
That works if you want all 1 vs 1 fights to be mostly fair. Think of something like Street Fighter where you can't change your character mid-match. A rock paper scissors relationship would be fair but then if you are playing rock and the opponent is paper then the match isn't a good test of skill, the game was over at the character select screen.
Depending on your context (something like Team Fortress or StarCraft) you might need to instead find the Nash equilibrium to make sure all units have their niche. But looking purely at win rates might mislead you if your player base is not playing optimally. Even if you can trust your win rate statistics, finding the Nash equilibrium is NP complete, meaning that each new character class exponentially increases the complexity of the problem.
And there's probably units like the SCV where the kill death ratio is exceedingly bad but you can't win without them because their role is non-combat. Or a unit like the carrier (maybe? I'm not a pro) that isn't resource efficient but is a way to force the game to end if you are already ahead in resources and tech. If that's true and you analyze the carrier per unit, it might look overpowered, if you look at it per resource it might look underpowered, but it still has a niche.
I guess that all I'm saying is that it's a hard problem, and game theory might be useful, but could still be difficult to apply if you have a game that is interestingly complex.
Absolutely brilliant.
Great intro, but as an artist, I WISH it took a year to be Rembrandt lol
Great talk too!
Hi! Great talk thanks!! a QUICK TIP for A/B testing! (I'm economist) You could randomly choose who goes into experimental/control group :) That way you don't have to switch, you just have to apply the procedure to many people once, like this: 1) New player enters 2) You generate a random number (between 0 and 1 can be) 3) is it geater than 0.5? experimental, no? control 2) register their group and their target number :D Even if they play only once (you don't need multiple rounds), you can compare the means between those groups ;) Thanks again for the talk!
how he answered that first question was amazing, you can see he knows his shit.
Very good talk, even I'm kind of screaming at the use of p-values as "the chance that Fred is right." But you clearly know that, and are simplifying because p-values are confusing and don't actually measure quite what we use them to measure. Which is a good reason to switch to subjectivist statistics, but you can hardly explain how to responsibly use priors in a 30-minute talk.
As a math dummy, this talk make my brain implode.
Watched the Spiderman talk then this one. Just, damn, passion. Awesome.
24:14
I dunno, I love my coffee black and I think that study has a point xD
Incredible talk
Problem with your cupcake mode example.
Making the game easier may have a positive impact in the short term and may have a negative impact long term. Short term statistics can only measure short term results.
The test was simply to determine whether difficulty had an effect on time played in either direction greater than the margin of error for the sample size.
These are great as backup tests to ensure the results aren't just a fluke without a unreasonably large sample size.
@@jacobb5484 Doesn't matter. If I'm a tester and only testing the game for 10-15 minutes if its too hard I'm going to report that it's too hard. If the game gets made easier and released and I pick it up and find that 30 mins in it's too easy, I'm going to get bored and quit.
I think he oversimplifies the situation.
@@lushen952 Its a simple example of a T test on a paired sample. this isn't for small engaged focus groups with detailed subjective data, but rather big data statistics such as the example of a sub mode being beta tested.
The situation in this example the T test gives a percentage chance of either:
A. The change had the effect of either increasing OR decreasing what's being measured by a notable amount.
B. the data is probably skewed due to bad sampling and falls within the margin of error.
once you rule that out, you can make further changes and run detailed tests to actually make an improvement.
Ill throw a question out of fashion this days. How many players are having fun with my game, and thus, eager to buy anything at all at my shop.
If your game has a shop you've already failed.
@@MrDavidCollins that's just objectively wrong lots of incredibly good and fun games have shops
This kinda stuff is what game dev tycoon is missing
This is enjoyable, thanks :)
youtube: *recommends me this video*
me, who's literally never gonna use any of this: *interesting*
I didn't even watch this but scrolled through a few times and could tell this is an amazing presentation. Will watch later bravo.
it is.. but its is not its a video on how poor ftp gamers flock because lack of money... and how to get them to spend more.... and about how bad ssd are... in 2016 but are now 40-60% cheaper per gigabyte and much much faster... bravo to skipping the description and basic computer imp in the last 5-6 years....
@@simlife445 Moron alert. You confirmed that it is indeed a good presentation then went on some personal rant of the content you didn't like? I don't give a damn sheesh.
More often than not the data you do NOT have is more important than the data you do have. For instance, I and probably millions of other people didn't buy Dead Space 3 BECAUSE it was infested with microtransactions. There is no data for that though since a lost sale literally doesn't show up on the balance sheet.
Game devs who decide NOT to "leave money on the table" by making real games without microtransactions are actually leaving a great deal of money on the table in lost sales for which they don't have any data.
Game devs need leadership, empathy (essential for understanding customers even if you have no moral concerns whatsoever) and common sense to make good decisions. There isn't any amount of data that can substitute for these attributes.
With moderate power comes moderate responsibility
Damn, I'm not a game developer. I have never googled this topic. I just wrote down the idea of a some computer game that accidentally came to mind and described the game mechanics in the note app on my android smartphone and youtube immediately recommended this video to me. Coincidence? Now I do not know whether it is good or bad...
this guy is my new idol lol.
pls tell me the name of the book where i can find all this shit in detal specifically applied for game cases
Okay RUclips recommendations, I clicked it.
Something I'd like to add to the graph at 19:00, the blue analytics are healthier because it produced a stronger reaction. Those are the people who are willing to put money into your game.
It was great, right up to the "always use the 2-tailed value." Tons of circumstances where it's better to use a one-sided t-test.
In fact, his own first example should have been a one-tailed test.
@@tempusername-l5d A 2 tailed test splits your significance level on both tails, so it's only half as strong as a one tailed test when showing a difference between groups IN A SPECIFIC DIRECTION.
Frankly, a 2-tailed test is a sloppy but acceptable way to test, but it really shouldn't be used when you have a specific direction of difference between the groups in mind. A 1 tailed test has more power at the same alpha level. It's basically weakening your hypothesis to hedge your bets by using a 2-tailed test when you should be using one.
That's why I don't like this lecture. It's a computer programmer with a SINGLE statistical tool he knows, so everything looks good to apply that tool on. It's like that old adage that if you have only a hammer, everything looks like a nail. If he were a statistician, he'd know better. But he's sitting there spouting off like he does, when in fact he's dead wrong.
@@tempusername-l5d Sure. I never said that he shouldn't use a 2-tailed test in that situation. I merely said that it's foolish to say "Always use the 2-tailed value."
Edit: In science, if you have a hypothesis, your hypothesis generally has directionality to it, or you've written a piss-poor hypothesis. So, frankly, I'm often using 1-tailed tests to show that X is strictly less than/strictly greater than, on some real life data, such as, "Are female babies truly smaller than male babies?" or "Did the biodiversity index for the Upper Nooksack area truly increase due to our conservation measures?" In those cases, as a scientist trying to get published in a peer reviewed paper, I'd get laughed right out of publication for trying to use a 2-tailed test in those or many other situations where I find myself relying on statistical inference. Just saying.
@@tempusername-l5d That's an out and out lie that "most papers" use a 2-tailed test. In Lombardi and Hurlburt's study (2009), about 20% of papers in the field of biology and animal research were 1-tailed, with another 20% not telling whether their p-value was 1 or 2-tailed. So, anywhere from 24 to 40% in biology. However, I would defer to you that in sociology, for example, probably no more than 5% are one-tailed. And if you want to keep on this conversation just for the sake of being a blowhard pedant, I'd invite you to casually refrain. I'm not about to argue further with some stranger as to why my research sometimes uses a one-tailed test to support a claim. Sometimes deviations are only mathematically possible in one direction. It happens. Get over it.
excellent talk
Statistics are a fun way to compare datasets but unfortunately sample size and methodology usually mean that whatever conclusions you draw might be completely irrelevant. And as he's saying, the more questions you ask, the more likely you are to be completely wrong.
This is probably very helpful, but just forget everything he said if you're taking a class on stats...
EDIT: This does an amazing job of teaching intuition and importance, good talk
Did anyone else look at the picture of Rembrandt that he had up there and think that it looked peculiarly similar to him?
I still have no idea how Fred is not convinced with an upgrade that generates $34 per 100 players of profit.
17:27 Watching him shit on gullible health journalists in the COVID timeline. It's like he knew.
Any presentation that's got a 538 joke in there is a good presentation
I understood only half....need to brush up on my probability
I feel educated
This makes me so happy! Great talk I learned a lot.
We had 100 barrels at work that were documented to have 50 kg in each. You could quickly tell none of them were empty and it looked like our written inventory was close. The account (not my boss) told me to measure all of them to see how accurate we were. I measured 8 and calculated the standard deviation.
Jokes on you I’m not going to break my back and work my ass off to learn something I already know. I’m sorry if you don’t understand what I’m doing I’ll send you a Wikipedia link after I’m done.
12:37 negative less? Wait, that's more!
14:36 - Historical and linguistic horror: Sans-culottes MEANT with pants...
that was awesome
Gold on RUclips :)
Guy dismissed me in the first 21 seconds.
Won't pretend I'm not tempted to continue watching. Statistics as a science (rather than bad statistics as a political tool) is the only kind of math I can say I greatly enjoy.
Sony is following the monocle example right now by giving 10USD credit to random accounts. Fo'sure that's Sony's ulterior motive: Measure how much more likely people ate to engage into the store and (if they're lucky) top that 10USD to buy more expensive games... =)
If I were this guys boss id be like OK, If i buy the f*ck*ng SSDs will you shut up?!!
If you're a game developer, and didn't take AP statistics, please tell me how you became a game developer?
lots of practice by making mods, level design, digital modeling, etc.?
I took statistics and didn't become a game developer (at a company, I just make it all myself now).
College costs too much
The only thing that made me cringe was when he said people should ignore the one-sided p-value, when his example (and most things you'd want to test in real life) is a one-sided hypothesis. It's not necessarily that we assume/know that SSDs are faster, it's that if we find that SSDs are significantly slower, we shouldn't be rejecting the test. He is actually doing a test of size 2.5% instead of 5%.
If only that was the only cringe part... His explanation of p-values is statistical illiteracy 101.
I was really surprised when I heard him making that mistake, interpreting p-values as Pr(H0).
I thought that this statistical concept entered pop culture (kind of like "correlation is not causation" already did)... Amazing that people like him have the confidence to give talks on statistics.
@@tempusername-l5d he always says that the p-value is the probability of the boss (can't remember the name) being right. In other words, the probability of H0 being true. So that p
@@tempusername-l5d But that's the problem, a two-tailed test is NOT the best choice here.
Look at it from the decisional perspective, what evidence would lead you to take action? Only evidence of SSDs being faster than HDDs leads to an action (buying SSDs to replace HDDs).
Evidence of SSDs < HDDs or even SSDs = HDDs would lead to the same (in)action, i.e. not replacing the current HDDs.
Thus, H0 = {speed of SSDs ≤ speed of HDDs} and H1 = H0^C = {speed of SSDs > speed of HDDs}. This is a one-tailed test. Period.
He's using two-tailed tests because obviously he lacks the statistical experience to choose a proper significance threshold, so he chooses the canonical 5% threshold and halves it with this idiotic two-tailed choice.
It's not surprising that he's not able to choose a proper threshold (after all, it's a pretty complex art, definitely inaccessible to someone who can't even understand p-values) but it's pretty surprising that he feels qualified to give a talk on statistics...
@@tempusername-l5d Yes, it's unfortunate that common statistical knowledge is so poor. Basically any soft science (ab)uses statistics and as a consequence, many psychology/economics/medicine/sociology professors feel entitled to teach statistics. They usually don't know what they're talking about, but online sources are filled with their misconceptions.
But trust me when I tell you that I'm not too harsh. My comments may look excessively nitpicky, but they're absolutely fundamental to trustworthy statistics. The very idea of what hypothesis testing is and what it represents, must be 100% clear before trying to use statistics to aid in the decision making process.
People like the speaker in the video, who take decisions using statistical tests that they don't understand, are extremely dangerous and scary. What they're basically doing is rolling a dice and hiding behind the veneer of "science" or "statistics". But don't let them deceive you, they're still just rolling a dice.
Thankfully the speaker works in a relatively "irrelevant" field like game design, but this problem is extremely widespread in more "relevant" disciplines like psychology/medicine/economics. It's not that I'm too harsh, it's that statistics can be extremely dangerous when practiced lightly.
"The relative risk of somebody in control group b buying pants.."
Relative risk of buying pants? What are you making these pants out of?
Kryptonite.
Not that it's the point of the presentation, but this misses the other marginal benefits of working on SSDs all the time, not just in builds.
Additionally, if build time doesn't change when moving to SSDs, then the bottleneck is elsewhere and could be tackled via a different component or algorithmic improvement.
or that his is 5 year old gdc session(read the discription) so this data is insanely old ssd are 60% cheaper per gig and much faster
Yeah and long build times is actually one of the biggest blockers to ci/cd, which the lack of is usually the best indicator of long lead time which is the best indicator for slow development trapping more resources inside the system, increasing the number of bugs, less feedback, less data, less experimentation and less revenue. Overall meaning slower delivery and lower quality product and/or requiring more resources to deliver. And in the end you should not generally build on your local machine but do it automatically on a build server.
God - like speaker
Send this to chris
This could have been half as long, probably
cool math bro
I think 'every game developer' isn't really apt here... more like, 'games as a service' with F2P model developers. Making a single player game has nothing to do with any of these concepts.
What do you mean? Even single player games can gather data about what the players do and then test those stats for valuable insights. Paradox games or Total War series comes to mind.
Holy sh*$ a math talk I could ACTUALLY follow! BEST math GDC talk of all time. *awards him Golden RUclips User ClickedCookie.
I’m more confused after the video~~~
The question posed in the thumbnail says everything about why video game quality has plummeted in the past decade.
This is why gaming is so terrible and toxic now.
Optimizing for maximum amount of time played and money spent above all else.
They used this same sort of statistical analysis to make cigarettes more addictive.
Not even a consideration for the well being of their customers, and not even an acknowledgement that they're spending hundreds of millions of dollars to make their product as addictive as possible so it can be deployed against children.
Well, yeah. Welcome to capitalism. Until the underlying philosophy and society of capitalism is minimized, companies _can't_ do it any other way. It is mandatory for them to optimize profit above all else. Any other consideration is secondary at best.
Dont forget: If you buy SSD to improve build time make sure to put your SWAP memory on the SSD. If you don't have a lot of RAM the extra memory used by the compiler/linker will then go on the ssd as well, drastically improving your build time
P is not the probability that the null is true, p values are the most misunderstood aspect of frequentist statistics there is. The p-value means that if you did an infinite number of these experiments, p% of them would have values as extreme or more extreme. 1/20 experiments you will get a p value that is more extreme. I wouldn't say 1/20 is very unlikely.
@@tempusername-l5d I'm not sure, but I don't think the distribution of the p value is affected by rejecting or assuming the null hypothesis is correct. And of course you will have values that are less extreme.
The first point in the clarifications en.wikipedia.org/wiki/Misuse_of_p-values is basically what I tried to get through, p-values aren't about probablities.
@@tempusername-l5d it depends on the data and test, not all of them have normal distributions. If you aaaume a different model you get a different distribution but i guess you are right but it is important to distinguish the probability of the hypothesis being true (which frequentist statistics does not address) and the probability of seeing a result as extreme or more extreme assuming the hypothesis is true.
The only thing you save with faster build times isn't less time it takes. You make builds less often the more time it takes, it increases your lead time from feature idea to a working feature, therefore trapping value inside the system, slowing down the feedback of data or in some cases revenue. Also opening up your project and files or whatever decreases developer productivity and gets on their nerves. But in the end, all this is a false dichotomy as you should have a build server that does all the builds automatically and not rely on local manual builds.
Continuous integration and delivery are a cornerstone of all high performing engineering cultures for a damn good reason.
I watched this for 10 minutes before I realized he wasent saying memes , I was waiting for a joke that never happened,
I think the hard drive example falls flat. no other data other than building times doesn't show anything of importance. analyzing the stages of the build would be more beneficial. It may have shown results that point to an issue in the build step as opposed to hdd vs ssd performance.