When I saw there was a new Yosh video I creamed my pants!! (Sorry I am too poor to support the patreon, but if my career continues it’s current trajectory I should be able to donate by this time next year 😢😢)
@Seno06 because it means its basically impossible to beat his record since it would take millions if not billions of attempts for a human who cant get the jump consistent everytime to beat his record
I saw the track branch out before me like a thousand shimmering roads on A06, and from the turn of every wheel, like a bright possible future, a different run beckoned and winked. One run was a perfect line, fast and smooth, carrying speed so cleanly it seemed almost impossible. Another was bold and reckless, gaining time for a moment before striking the wall and dying. Another was careful and safe, but too slow to matter. Another found a beautiful angle through the corner, only to lose everything on the landing. And beyond and above these runs were many more runs I could not quite make out. I saw the AI sitting in the middle of these branching paths, sending itself again and again into the same few seconds, searching for the one true way through. It studied each and every one of them, but choosing one input meant losing all the others, and as it watched, unable to know at first which tiny movement would lead to greatness, many of the runs began to fail and fall away. Some plopped uselessly into the wall. Some spun out and vanished. Some came heartbreakingly close. And one by one, through all the failures, a few bright runs remained - leaner, faster, cleaner than the rest - until at last the best path stood alone.
Maybe next time you should try the epsilon greedy approach and the epsilon decay to like 0.5 or even 0.6 So the agent never stop finding new ways of doing things.
Q learning already performs an epsilon decay in most implementations, no? To the best of my knowledge, most implementations have the hyper-parameters of the initial epsilon, the epsilon decay, and the minimum epsilon that it will never go under, stopping the epsilon decay. It's a pretty standard extension, anyway, and I would expect hyper-parameters like this to have been played with. By the way: instead of playing with the epsilon decay, it makes more sense to me to set a higher minimum epsilon, such that the chance of random actions never goes down too far, while not messing with the initial training to get "close enough" initially.
@EgeauTM Yeah you're right. Then he could add noisy network layer to it as an add on. This helps with smarter exploration. While the epsilon decay is at 0.2 or 0.1 which is more stable.
@lovelessissimoI'm an electrical engineer by trade but I did some courses in machine learning. I follow research papers published by researchers which also helps
ironically i feel like this video is a tiny bit disappointing compared to his other videos...... not only did he fold and start forcing the AI to take specific lines by limiting it rather than guiding it via rewards and shit as he usually does, the video also didn't reach a satisfying goal...... although don't get me wrong, this video is still better than most the garbage i get recommended on youtube nowadays lol
@gigabytegeforcehe actually ends up forcing the line on most videos, and compares his times to human records instead of TAS records (because an AI has no consistency issues...). Still waiting for a video about AI beating random map challenge or fast learning, but it would take decades at this point
You need to add mutations to get out of local maximums. It's the same concept as evolution, genes randomly mutates most of the time into something with no benefit. Every once in a while you get something with a huge advantage like another cone so you can distinguish between green and red. Give the AI random inputs at a range of different points and eventually one will stumble into something more optimal.
You could run the ai in groups within groups. Say you have a group with 15 groups in it. Those 15 groups vary quite a lot from eachother. Within those 15 groups you have 100 cars that don't vary as much. You then pick the group with the best individual performance to run the next generation on.
12:50 Link's WR run at that jump is low, so his wheels touch down early. He also hooks left more than the AI was, giving him more room to gun the gas rather than worrying about hitting a wall. I would suggest putting walls in the air along the jump, to force the AI to try and keep low and left. To me, this would mean having to swing more right leading into the jump and then turning left while on the jump, or trying as Link did and cutting in early and then having to hard hook left later to make sure he made the landing.
What seems to happen is that not the fastest AI gets to be built upon, but the fastest ai THAT FINISHES. Once they find a consistent line that gets them to the finish they try and improve upon that line, but it fails to consider different lines that could be faster and in the end you end up with the fastest line that finishes, not the fastest line that could've been. You could try and work with more checkpoints in specific locations. That or insert more randomness so it keeps experimenting with different lines. Cool video though, looking forward to the other videos :)
while i think this makes sense in theory, selecting a fastest run at some checkpoint is also quite difficult and what the reward strategy is trying to avoid doing. things like position, speed, and orientation may not really line up with hitting the local maximum speed from optimizing a run. that said i think here playing with reward penalties instead of barriers may have helped. like penalizing airtime may have helped nudge it into looking for lower jumps, but that probably doesnt help for making a general purpose training alg for trackmania
@Patchypatchypatchy I think thats a good idea, but I have built RL on driving simulations like this (albeit simpler than TM), and it tends to hurt to try to over penalize, causing it to fall into alot of local optima that suck. I believe u may be correct that this is one of the only solutions if there is a viable one with RL tho
As someone who has worked with RL Agents significantly, this is incorrect. The AI learns from every race, whether it finishes or not. When the AI successfully does the flip, it's getting more rewards, and learns that trajectory to be good. When the AI was learning, it learned one specific trajectory to give it a consistent flip that allowed it to make further progress, but that line never gave it WR speed. Once it has learned a line that is pretty good, it searches for the local optimum, but the global optimum could be link's line, or another different line like mentioned in the video. It's risk V reward all over again, which is a very deep topic in RL and is what the text splash at 12:50 is all about.
An idea for a different way to reward the Ai: Reward for time with wheels on the ground Reward for Speed slide Reward for uberbugs etc. That way you reward it for exploring the Techniques the pros use when hunting a map
Reinforcement learning to try to beat the WR is the best way to attract folks who likes to watch all the pretty colors in order to solve the problem in the most inefficient way possible.
The fact that a literal AI didn't beat link after 2000 simulated hours is a true testament to how insanely good link is. Amazing video as always, excited to see what'll happen on the rest of the campaign.
@abebuckingham8198 The flip was a fluke, sure, but the rest of the run was still insane enough that he ended up with a lead of 15 hundredths, saying it has nothing to do with skill is crazy considering he's literally the person with the most world records in the campaign
the AIs are very dumb to learn. We can't compare time for AI to learn to time for human to learn. AI is effective because in most situations, they can make a try in milliseconds, they don't eat and don't sleep, they don't need to rest. A human who has never played the game will be able to complete the track in a few try, while it took hundreds or thousands for the IA. And of course Link is very good.
im not sure but i feel like you could get a better route by: 1. making the punishment for getting a bad run lower 2. reducing the reward for getting an average run 3. greatly increasing the reward for getting a really good run This would punish the AI less for trying out new routes and it could get higher rewards if it finds a really good route once.
I think that would still get it stuck in a local optima a lot of the time. If any change reduces the fitness then it will just try not to change. My guess is that adding some noise to the inputs would be a better option, or just letting it run for a bunch more attempts and praying to the chaos gods like the video showed
@taylorolson6228yeah i kinda agree but if you make the increase in rewards exponentially better then it will almost allways be encouraged to try and get a better run
@taylorolson6228You have a fair point, however, I'd argue that the manual insertion of obstacles blocking the local optima is analogous to implementing a punishment when the AI has stopped improving, but it is known that there is a more optimal path We know there is a more optimal path than what the AI is doing until the AI at least matches the World Record & in many instances it is reasonable to believe that the World Record likely has room for minor improvements, so it wouldn't be too unreasonable to use the logic "there is a more optimal path until the AI has beat the World Record"
@youshtm: I think the next leap here isn’t a better single bot, it’s a whole society of bots with shared memory. One class maps the track from observation "Bonkers", one hunts crazy high-upside glitch tech "Glitchers", one refines stable sectors "Soldiers", and a controller combines all of it into actual WR attempts "Queens". Every run would leave notes and POIs on an internal 3D map, so the system learns the level instead of just converging on one safe line. That feels way bigger than “AI learns a racing line” ... it starts to look like an actual racing research team.
On the last attempt to change the AI’s line, you blocked the line into the flip. That doesn’t necessarily encourage it to change how it approaches the flip, just where on the ramp it will be. If the AI is blocked from landing on the right side, it will have to adjust its line in some way to compensate, if it’s going to make it across. IOW, put the block in front of the landing zone, not the take-off zone.
wait, an AI trained to maximize it's score on only this map, is far from "playing", is pure deterministic function, exactly the same as using TAS. Both optimized just to perform well on this case, and 100% reproducible.
If you train an AI on one map, it learns the core mechanics steering, speed control, drifting, and how the car physics work. When you move it to a new map, it doesn’t have to relearn those fundamentals it just has to learn the layout. That drastically reduces training time. And if the AI learned advanced techniques like speed drifting or air control, those skills transfer too. It can apply them on new maps whenever similar situations appear, only needing a bit of fine-tuning for the layout. At that point it’s technically playing the game much like a human would. using learned skills and adapting them to a new track rather than memorizing a single map.
@FakedPvp ai doesnt learn anything what ai is is just a function that takes input and produces output theres no awareness of anything just a bunch of multiplication and addition if you plop ai trained strictly on one map it wont suddenly be better if you plop it onto another map due to overfitting
And mind you AIs crushign humans at video game is nothign new at all. Years ago we already got AIs who destroyed world champions at Starcraft 2 and Dota 2. And since theses game are pvp , their victories weren't a fluke by pourring thousand of attempts until threadign the needle.
The physics is deterministic but *chaotic*, this fact is covered in this creator's "AI Learns to Drive on Pipes" video. I don't know why he opted to say "random" instead of chaotic this time.
@Vaaaaadim I mean, Wirtual said over and over again that it's deterministic, meaning that you could copy inputs and you get the exact same run, that's how some cheated runs were found in the early days. Also, press forward maps are based on this too.
@JojoChannel-o4g I agree it's deterministic. But the physics system is chaotic. Even waiting a random amount of time at the starting line before going affects what happens, because the physics engine still runs even when you're completely still and will imperceptibly affect your physics state (this is noted in his pipes video). A deterministic outcome does not mean a predictable outcome, unless you have perfect knowledge of the state of your car.
As an engineer, I want you to consider the undeniably great opportunity of partnering with educational institutions that make learning fun, as this is amazing content that could be used to teach the fundamentals of reinforcement learning, machine, learning, and the components that actually make artificial intelligence possible. I know that value exists because you made this video without those intentions in mind, but if you catered it a little bit more to the educational side of things, I think your scope of audience increases multifold. I personally have never heard of this game in my life, but this video helped to visualize and understand many fundamental concepts that are used to improve the results of artificial intelligence, and I think that much of your viewership is watching for somewhat the same reasons as well. All I would worn against is to not make it overly educational as your current video is already amazing quality, and I would say it would only require 10 to 15% tweaking to really align it with what I’m talking about, which is why I’ve feel that it reinforces the fact that there is such a great opportunity laying at your fingertips to further your life. And I don’t even think these partnerships that I mentioned have to be necessarily RUclips related, although they definitely could be.
A thought on the jump: you might be able to calculate a probabilistic fastest speed for each run. Say the AI starts a jump (during training only). 1) Spawn 100 copies of the current AI in the current location. 2) Wiggle the location of the car a sub-pixel amount according to a 2D gaussian distribution. 3) Run all 100 copies to the next "predictable locations" (eg. the next checkpoint). 4) Continue only the fastest run and reward based on its results. The idea is that you're giving the AI the benefit of the doubt on the RNG sections of the track. Based on the RNG required, you create a "carpool" of enough cars that the AI is likely to pass the RNG a single time (given it was on a good line). If the RNG is 1/10, you split to 15 or 20; if the RNG is 1/1000, you maybe split to 1500 cars. Obviously the splitting system would be VERY expensive computationally. I suggest (like a threadpool or, in some languages, workers) a "carpool" : a bank of a few instances of the game ready for the gamestate to be transferred to during the splitting phase. The best gamestate result is passed back to the original copy so it can continue its run from there. If there isn't enough RAM for the 1/1000 case, you can obviously run 5 or 10 cases at a time, storing the current best run until you find a better one.
I havent worked with RL in a while but this seems destined to find local minima, something like an outside line maximizes speed would just not be found if I understand you correctly. There is a concept called intrinsic reward specifically for sparse reward environments, basically during training there is an exploration phase in which there is a world model that tries to predict how the environment responds and that world models loss gets added as intrinsic reward. The concept is to explore areas of the environment that are not as well understood by the model. Of course in trackmania this would baloon compute time. But if used in conjunction with your idea it would probably find optima. Later on you could group the sections you mentioned together and create larger continuous parts to get closer to a full run
@haubiwanb769 yeah! That makes a ton of sense. I think in my suggestion, I was assuming he would do the work of narrowing down the possible range (like with the walls in the video) and then let the AI find the local maximum from there
Link went trough the middle of the 2 poles (11:21) you said Link is only doing 1 thing diff which was his approach angle but he also lands and drives through the middle of the pole and he gains 0.01 sec on the next check point but he stays on the left while the ai goes to the right for the long U-turn what if the Ai went through the middle of the poles and went on the left but swerved to the right before the big u-turn jump?
Wow! What a great video. Not only from the perspective of technology, but also from the perspective of script writing. What a nail-biter. Great entertainment! Thank you! 🙏
This series has to be one of the most liked ever on RUclips. Seriously, with currently 5k+ views and 3.3k likes, that’s 33 likes to every 50 people who’ve seen it. Keep em coming Yoshi! Make those little AI sonovaguns earn those 🥕!!!
What an absolute turtle tease. P.s. AI beat Link everywhere except the barrel roll. Seemed worthy of noting. The crossover roll creates a longer path to maintain the same ish average amount of velocity. Basically, Link hit the NOS too early. On accident. Badass runs from both sides 😊
There is a reason im watching this... I have never seen someone with such a big patience and big brain, always be able to train an AI (from scratch, as usual) on a map to beat records that were found on accident or were forced... I dont care how long you take to make those videos, I will always be here to enjoy the video and see what kind of absolute masterpiece you created Keep up the good work!
You could try multiplying the loss if the reward predictor over predicts by a constant under 1 to try to force the average to be skewed so it prioritizes even a chance of a good run. Just a thought.
it seems like you had 2 things to optimise: fastest possible ramp jump, and fastest possible finish. if you trained the AI by spawning it near the ramp in a range of positions and speeds and placed the finish line a bit after the jump ends, you'd focus it's reward system entirely on optimizing the jump, then after that could put it back on the original map and have it find a way to incorporate that in its full map strategy
My takes on your notes: I think two key improvements you can make to ensure correct exploitation vs exploration is using Advantage instead of Value and well-tuned SAC. Briefly, advantage does not consider agent reward based on your reward function directly, but instead gives a reward proportional to the value above expectation (i.e. instead of considering Q(s, a), consider giving reward A(s, a) = Q(s, a) - V(s)). There are other ways to define a baseline, but this is an easy one. And in your SAC implementation, your agent should generate a distribution over actions (i.e. instead of generating a single output value for each button and then pressing the button if it‘s above a threshold, it should generate a mean and variance for each button press and then sample the action from that distribution, and compare that to the threshold). You can then define some minimum variance used during training, making sure the agent is never too sure which actions are best; for the actual runs you can then just use the mean value of the distribution as the action, since that should be optimal. TD learning might also help, but there I am less sure (also not clear whether you‘re already using it). I expect that spawning the agent in the trajectory of the WR run could have a benefit, but you have to be careful about how you choose where to insert the agent. Your goal is not to make the agent handle the jump like the WR, because it will never arrive in those situations itself (ie. this is wasted training imo). Instead, your goal is to make sure the agent arrives at the section ready to take the jump in the right way, so it‘s exit out of the previous corner should match Link‘s. Another approach you can take to increase willingness to take risk is rewarding the entropy in agent action: evaluate how common/rare a given trajectory is and give reward for trajectories with high entropy. the problem is that this too becomes a reward tuning task. Regarding your last section about why the AI is getting confused by the wall before the wall: this could be caused by anything, and without further details I can‘t give an exact reason. It might be related to reward shaping, feature selection or feature processing. Hope this helps!
this is a nicely formatted way of expressing the same thing i was going to suggest - rewarding unexpected outcomes that deviate from the norm. adding some randomness would probably also help, or at least be an interesting experiment.
@diyartaskiran Thanks for your comment! -I'm already using Advantage, with the Dueling Network architecture. -I've tried SAC several times in the best but couldn't get better performance. Maybe some hyperparameters were not properly tuned though. -Regarding entropy and introducing some variances in the policy: from my experience it has a negative effect regarding risk-taking in Trackmania. When the agent knows there is some random noise in its policy, I think it tends to drive safer in some sections to take this randomness into account. For example it drives further away from walls, which is often slower. -Regarding the WR spawns, I agree this approach isn't ideal! I should have at least added noise in the position/speed of these spawns I think
@yoshtm I wonder whether an easy way to achieve all of that would be to aggregate the reward of your policy over several runs near the ramps through a max function, if there indeed is some randomness in the outcome (if it's not really randomness but chaos => small action noise allows you to say the chaos = randomness). This has the benefit of privileging policies that have a better chance of establishing a record (you don't care if 9 out of 10 of your runs are terrible if 1 out of 10 is way above what anyone can do). I am not an RL expert, just a mathematician slightly versed in ML
@yoshtm I think the problem isn't the method. The problem is to much freedom. All players drive past the pole on the right side. Except for one Link drives in between them. I think this is the key to being faster, his jump is lined up differently and the result is passing in between the poles. So instead of restricting the track on the way to the jump, I think you could try to block the path so that the AI can only run in-between the poles. This should force it to try out different jumps, since getting a jump and hitting the wall will be very slow.
From content to editing, everything is near perfection. You sound a little bit french, alors merci pour ces vidéos, le montage est incroyable, et le travail derrière monstrueux. Franchement, tu mérites plus de reconnaissance, tes vidéos sont niches et en même temps tellement abordables et passionnantes, ça force le respect. Chaque fois, c'est un plaisir de les regarder, et tu m'as fait découvrir l'univers du speedrun de trackmania d'une manière tellement originale, c'est sûrement devenu ma licence de jeu de course préférée! Alors merci encore, et bonne chance!
@maestroeragon While true, perhaps it's not enough? Many behaviours account for laptime, and thus compete for the same reward. By adding an explicit reward for less airtime, it should focus more on it.
@Mauz1 Nah cause too low airtime isn't always the best and could eat up other strategies. Rewarding for time is enough cause AI will find the most ideal trajectory
The post-landing trajectory shoots across the track, you could try putting a gateway over there that rewards the AI the closer they get to that shoot across action.
Considering the game uses floating point calculations especially for collisions, you may want to consider putting in each individual bit of the float for the car's position, speed, and/or orientation. The randomness is still deterministic since you can play it back, however when you calculate floats sequentially you can end up with calculations that end up operating on denormals and values that are off by one or a few ulps. This can be compounded especially if round, floor, or ceil are used if the ulp crosses a threshold. The AI should be able to pick up the pattern on bit changes even though it would effectively look like noise.
What does this comment even mean... none of this makes sense... are you suggesting converting the floating point into induvial bits and passing each one as a input parameter? That makes no sense, it would remove all meaningful structure for the model... What does "AI should be able to pick up the pattern on bit changes" even mean... this is also a RL model, which is even more sensitive to noise than supervised learning, which would make this approach make even less sense... It would also make the training super slow, you'd need a larger input layer with more weights and way more gradient updates, all you'd get is slower convergence, worse policies and tripled training time. I mean just think about it for a second, you got the car going at 100mph or however Track Mania works I don't play this game, so your input is like: 10010000000000000000000 Then we go ever so slightly faster, 100.1, now you got: 10010000011001100110011 So a bunch of things change without the ability to really see a clear pattern. This entire comment makes no sense. I mean you can try it, train a little supervised learning model (which should perform better with your freaky architecture and look at the results), here is your thing: Epoch 10/50 | Val MSE: 0.4058 Epoch 20/50 | Val MSE: 0.3614 Epoch 30/50 | Val MSE: 0.3478 Epoch 40/50 | Val MSE: 0.3990 Epoch 50/50 | Val MSE: 0.3527 And here is a normal model: Epoch 10/50 | Val MSE: 0.3898 Epoch 20/50 | Val MSE: 0.3488 Epoch 30/50 | Val MSE: 0.3243 Epoch 40/50 | Val MSE: 0.3137 Epoch 50/50 | Val MSE: 0.3066 What so weird about this to me is this isn't even a concept, not a theoretical thing or something even discussed, you just like made this up in the moment off your noggin
@illumiyaa go to your browser, hit f12, type 0.1+0.2 in the console and hit enter you will not get 0.3 This is a simplified example, but basically, in floating point, you get accumulating arithmetic error which is essentially a "randomizer" games which have continuous coordinate systems + don't use integers = there are actually errors in the code! that are rare, but still happen ruclips.net/video/lEBQveBCtKY/video.html
For tomorrow, I have 3 tests and a lot of homeworks... I was working during 1 hour in a row, and for my break I just saw a notification on yosh profile. YAY !!
having a fucking god like machine try 700k times in the span of a few days and still barely win you when it never gets bored, tired or unmotivated or loses focus and energy or forgets and loses muscle memory, this is a great indication of how EPIC we are as species.
Incredible video yet again! This AI is part of what motivated me to create my channel, and getting to watch the continued progress makes me want to get my own AI beating more WRs. 12:51 As for why the AI couldn't match the WR here: I think you're 100% on the money. I think the problem is that because the learning algorithm struggles to even learn the flip in the first place, once it finds a method that works, it becomes very hard to find a second or third method that works. Because the section is so noisy, and you've forced it to take that very noisy path, the amount of actual learning it can do is reduced. I think this is a limitation with the learning algorithm itself, and not something that can be solved with sophisticated reward functions. Good reward tuning will compensate and help to improve (I would consider blocking the normal ramp jumps part of this tuning) but that only raises the ceiling of the agent; Improving the algorithm raises both the floor and the ceiling simultaneously. I've been looking into DQN extensions I can add to Linesight (although not with much luck, I'm not exactly an RL scientist, moreso a python hobbyist) and UPER (Uncertainty Prioritized Experience Replay) caught my attention as something that could help to reduce noise in the network when dealing with heavy Pseudo-RNG sections. One idea for reward functions though: Directly comparing to whatever the agent's current PB is. A moving target that gives a bonus reward to the AI when it is closer to the current fastest run. In theory this would solve the problem of the AI finding an 'ok' solution and never deviating away from it, because if it can drive significantly faster in a certain section (with some tuning to make sure it's not taking a bad line to do so) then it should prefer to beat its 'ok' solution in that section, learning to take more risk. It might be a bit more unstable, as the reward function changes dynamically, but it's something I'm going to be testing with Linesight soon™ to see if it can be effective for learning difficult tricks.
Hi, thanks for this nice comment :) I watched your mario kart videos, very cool! I've experimented with PER, but I haven't heard about UPER, I'll check that I've been thinking about use a moving target for the reward like you said. But when you are in the middle of the track, it's hard to know if your current run is better than the fastest run. Sometimes you can be ahead at some point, but with a lower speed, and thus have a strategy that will result in a slower finish time. So it might be hard to tune this kind of reward system for intermediate sections of the track. Unless it's a sparse reward system where only the final finish step is rewarded?
Gran Turismo time trial GOAT here. 10:22 look where the red car is. I predicted the red car was the one who landed properly because of its weight transfer. At speed, tiny changes in momentum add up to major changes in results, especially in the case of a brief traction loss. The red car starts further to the left, so it has to throw its body to the right at a slightly higher rate, leading to a slightly higher momentum and greater loading. They all cut back to the left at your "indistinguishable" moment, but with the visually hidden variable of weight transfer to the right side, it clips the ramp more gently with less-loaded tires, retaining the forward vector momentum more and increasing axial rotation on that vector. Nothing mysterious at all if you know what you're looking at. AWESOME video though, as usual, not to take away, you're just wrong. Nothing random is happening.
12:28 you are gating where you think the key moment is, but the key moment is an 1/8th of a mile back setting up the weight transfer. You're hand selecting based on improper inference.
I'll give you my two cents on why the fastest line isn't learned. Source: I'm a PhD student in a research group on "AI" (as so many are these days.) RL is not exactly my area of expertise (I'm originally in formal methods, the group is mostly graph-learning as well as some NLP 'cause that stuff gets funding), but I helped teach a master course on it for a while. I think your explanation is correct that your current reward structure incentivises consistency over single great runs. All runs that crash still get their gradients added to the Q-table. To be more precise, default DQN optimises the expected value of the reward of a single run, which is not what you are after. I think many of the people suggesting randomness are flat-out wrong. Adding more randomness in RL is known to frequently end up with the opposite effect, where you train to get runs that are robust against random inputs, as the previous inputs end up being penalised for them. (The expected value of the reward of a single run that might have random inputs is higher for safer runs.) Q-learning innovates on this by performing random actions off-policy. There is a really simple school exercise you can do where you train a reward table to find a path from A to B where the fastest line walks past some "cliffs" that end the run with negative reward. Q-learning, which you are already using, will find the path walking next to the cliff, but other random approaches like SARSA do not. Q-learning's hyper parameter epsilon (or group of hyper parameters if used epsilon-decay) are what you need to play with to get different randomness, and I presume you already did. These kind of approaches, as well as other suggested ones like increasing the reward for single great runs (however that is defined) will only come so far. The one thing I did not see suggested you could still try is to multiply all gradients by 0.01 or something for failing runs, but ,still, if your reward structure is over a single run you are going to see somewhat "safe" behaviour. This means the trivial thing to try, if you have not already, is to re-structure the rewards to go over multiple runs: Instead of training over the expected value of reward, try re-calculating the loss such that it trains over the expected value of the max of ten, or a hundred rewards. The generalisation of this is called quantile regression, where you learn the whole distribution of returns instead of only the mean, after which you can disproportionately act upon the policies that have the possibility to end up in the higher quantiles. There are several algorithms for this you might want to play with, none of which I am familiar with, though I know these are known to work quite well for video games and such. Not sure if you are already doing these things, though. I did not look at your code, and you have spent a lot of time on this problem already. ^.^
Those seem like good suggestions but I cant help thinking that RL is just the wrong algo for this job. Wouldnt something "genetic" like NEAT be a better fit to the problem? Selection of the next generation could be based on fractions of the track ie: most improved position from this bridge to the next earns you a place in the next gen as well as overall performance, to bubble up local "great moves"
@tommurphy1153NEAT learns way slower and would likely not work. It is effectively surpassed by DQN in all but the simplest tasks. The sentiment you and others are expressing about the gradient decent of deep learning and how an evolutionary optimisation algorithm should be better is very old. It's intuitive, but it experimentally doesn't hold. Stuff like this depends a lot on the task, but the closest personal example I can give is from a MSc course I helped teach for four semesters (twice as a student one year, twice as a PhD the next) on learning behavior of real robots using a mix of lab training and simulation. Students were free to chose their own training algorithms, as the course was on the lab/simulation interaction. We held a competiton for bonus points at the end of it with the final and hardest task, and all winning groups I was there to see (including when I took the course myself) used DQN or Actor-Critic. There were always people who tried something more evolutionarily, but it never scales to complex tasks, which is also coroborated by the literature.
I played this map for about 6 hours myself without any flips being even close to good, untill i suddently got a flip ahead of 2nd place, sadly i failed it in the last part of the map as i didnt do the ending properly and landed in a bug right infront of the finish. either way this goes to show just how insanely random this flip is, and how incredible links WR is.
you could've just added a variable for max inputs per ms to avoid cheat detection and taken every world record. but you gave us something great to watch for the past few years, and a few more to come
Hey Yosh huge fan! RL expert here. Been watching since the first one. Have you seen any of the Robotics RL work adjacent to “Hierarchical Entity-centric Reinforcement Learning with Factored Subgoal Diffusion?” Any of the recent work in robots from Aviv Tamar might be a good place to look for inspiration. Also might be worth looking into the new C-JEPA paper from Yann LeCun.
12:28 if you wanted it to imitate link, you should be blocking off the right side of the landing pad and force it to find a line that lands a flip in the center? You are forcing a start condition, not an intended and partially known end-result. As long as it hugs the right it will never find his line. As for deviations in driving, maybe you need some sort of mutators that encourage wider or tighter lines. As for the 'this is a RNG jump' I kind of don't buy that, it's more that the angles for entry it's picking, cause a wildly fluctuating output based on the most subtle game conditions, collision tick checks, accel values, etc. Changes to the line entry angles might very well make the output side more contiguous and controllable. Force it into sharper angles, force it into wider angles, then evaluate those outcomes. It's stuck on an 'optimum angle' that is likely limiting what future maneuvers are possible.
I think it would be also possible to work on rewards based on the rotation speed, or the longer contact between the left wheels and the ramp (seems to be at least).
@SkopOneFourI like this idea, but maybe one step further is just rewarding for landings that carry good speed and an acceptable exit angle? Then it can do whatever voodoo it likes, like clipping the back wheels on the edge of the crash barrier to soften the twist of the landing like it did in one of the replays. Something I don't think any of the players even did.
My flip isn't even the fastest in the top10, logic's was slightly faster but his landing was bad on the next jump. The flip is actually functionally random, trackmania is a deterministic game but it's more deterministically chaotic. Float point differences in position can radically change the outcome hell starting to accelerate later than instantly will radically change what happens even if you do nothing but press forward, his "AI learns to drive on pipes" video is a good demonstration for that. Despite how it looks everyone who has hunted this flip will know how disgustingly random it is. it's most likely possible for the ai to get the flip somewhat more cons but it will never be the fastest if consistency is what it is going for as the fastest flips are always outliers.
My two cents As you mentioned, the RL agent goes for the local and is avoiding bad attempts Another agent could pick the macro strategy so different valleys are found Let the AI imagine the result, humans usually know what the desired result looks like, not just how fast it will be (to elaborate, the AI is focused on the timer, not the desired replay which is opposite from what humans are focused on) You already segment some attempts by hard cutting/starting runs, but perhaps: limit how far the AI sees and let it focus on what's right ahead (let the AI figure out how it's impact on the result increases the closer it gets to the jump) ps. my pet peeve is that you already know what the record looks like and try to aim for that. possibly look at the same data the AI is looking at (i.e. how high the rewards are), not the line comparison :) *sorry for wall of text*
AI PhD Student here: To get the AI to get more innovative, often an Evolutionary approach is used: 1. Train a set of AI nets (e.g. 40) at the same time, let them perform some runs while optimizing via Q-Learning and compare their reward at the end. 2. Take the top 20 of the nets and mix the weights together by taking half of the weights from one and half from the other. Make 40 new nets from this 3. Add a bit of randomness to a few percent of new weights. 4. Go to step 1 The last step "encourages" the existing net to innovate by "mutating" the behaviour. This setup tries to emulate nature's survival of the fittest setup and mutation helps moving out of a local minima.
I commented the same: evolutionary algorithms ftw! My idea was more to let the different AIs drive their runs without training, and then only keeping the best runs and training a new generation of models by combinations of the training runs. probably ends up very similar, but a different approach nonetheless
I dont see why this would be the case. why couldnt it come up with new ways? in this specific setup there is no high variance in approaches once a certain level ist met. But this is setup specific not AI specifc.
And also why calling this an intelligence is misleading. It's an optimization algorythm. Everything this AI "learns" will be useless on a different track.
I mean, it has to be herded into copying the human strats after running like a million attempts one after the other. It literally can not "come up" with anything, there is no thought process, creativity or skill whatsoever.
this is exactly why A.I physically can not, and will not ever replace humans with our current understanding. It can't innovate on it's own, only "achieve the objective." This is also why it can't make art, only copy. Not until it can draw of it's own volition, without any input.
AlphaGo showed that it certainly can invent and innovate. It found new strategies no one found before and top players felt playing against an alien. This is more of an example of overfitting and to narrow loss/fitting functions.
This is the greatest video i have ever seen in my life, and I feel like it has a deeper meaning. Showing that ai is not human, and never will be. Side note your french accent is perfection.
Thanks for watching!! To support this work: patreon.com/yoshtm
thanks for you letting us watch
Does the ai have precise control like it can move by 0,2% or 1% like the last time?
did you try making a "mold" of the WR?
Set the reward for beating Link's time to 1000000 times what you have it right now.
When I saw there was a new Yosh video I creamed my pants!!
(Sorry I am too poor to support the patreon, but if my career continues it’s current trajectory I should be able to donate by this time next year 😢😢)
Link must feel like a god after watching this
Why? He said it was an accident and luck
@Seno06 because it means its basically impossible to beat his record since it would take millions if not billions of attempts for a human who cant get the jump consistent everytime to beat his record
@Leoiles-o8z It would likely take from less than to a couple of million, and nowhere close to a billion attempts.
GOD BLESS YOU ALL JESUS LOVES YOU SO MUCH RETURN TO HIS CATHOLIC CHURCH AMEN
@dablob4491 so why has nobody done it yet?
The bucketloads of cars pouring across the track will never not be comical
truly, its like a fluid sim
I saw the track branch out before me like a thousand shimmering roads on A06, and from the turn of every wheel, like a bright possible future, a different run beckoned and winked.
One run was a perfect line, fast and smooth, carrying speed so cleanly it seemed almost impossible. Another was bold and reckless, gaining time for a moment before striking the wall and dying. Another was careful and safe, but too slow to matter. Another found a beautiful angle through the corner, only to lose everything on the landing. And beyond and above these runs were many more runs I could not quite make out.
I saw the AI sitting in the middle of these branching paths, sending itself again and again into the same few seconds, searching for the one true way through.
It studied each and every one of them, but choosing one input meant losing all the others, and as it watched, unable to know at first which tiny movement would lead to greatness, many of the runs began to fail and fall away.
Some plopped uselessly into the wall.
Some spun out and vanished.
Some came heartbreakingly close.
And one by one, through all the failures, a few bright runs remained - leaner, faster, cleaner than the rest - until at last the best path stood alone.
And the "Hall of the mountain king" music 😂😂
This is how Dr. Strange saw all the timelines
car vomit
The AI is incapable of taking one step back in order to potentially take two steps forward
That’s a hard quote
You know it will read you...
There is an algorithm called hill climbing which allows a model to take a few steps back to experiment
It is kinda like a microbe
his model, yes.
Maybe next time you should try the epsilon greedy approach and the epsilon decay to like 0.5 or even 0.6 So the agent never stop finding new ways of doing things.
Q learning already performs an epsilon decay in most implementations, no? To the best of my knowledge, most implementations have the hyper-parameters of the initial epsilon, the epsilon decay, and the minimum epsilon that it will never go under, stopping the epsilon decay.
It's a pretty standard extension, anyway, and I would expect hyper-parameters like this to have been played with. By the way: instead of playing with the epsilon decay, it makes more sense to me to set a higher minimum epsilon, such that the chance of random actions never goes down too far, while not messing with the initial training to get "close enough" initially.
@EgeauTM Yeah you're right. Then he could add noisy network layer to it as an add on. This helps with smarter exploration. While the epsilon decay is at 0.2 or 0.1 which is more stable.
. Where did you guys learn this stuff?
@lovelessissimoI'm an electrical engineer by trade but I did some courses in machine learning. I follow research papers published by researchers which also helps
@lovelessissimoI'm a PhD student in a somewhat related area. My LinkedIn is not that hard to find from my trackmania name. :P
There’s something kind of cute about all these cars trying their best
« AI Learns to Uberbug » video next
AI beats hockalicious wr next
The title of Wirtual's next video
@therisingf1champion459 reminds me a fallout new vegas custom campaign in L4D2 workshop
just wait until the end of the video
?
Pre liking the video cause I've yet to see a disappointing video from this guy lol
Same here
Same here
ironically i feel like this video is a tiny bit disappointing compared to his other videos...... not only did he fold and start forcing the AI to take specific lines by limiting it rather than guiding it via rewards and shit as he usually does, the video also didn't reach a satisfying goal...... although don't get me wrong, this video is still better than most the garbage i get recommended on youtube nowadays lol
@gigabytegeforce Yeah I agree, thought he was gonna go after the tas like he did in the a01 video, then I saw there was 2 mins left in the video xD
@gigabytegeforcehe actually ends up forcing the line on most videos, and compares his times to human records instead of TAS records (because an AI has no consistency issues...). Still waiting for a video about AI beating random map challenge or fast learning, but it would take decades at this point
You need to add mutations to get out of local maximums. It's the same concept as evolution, genes randomly mutates most of the time into something with no benefit. Every once in a while you get something with a huge advantage like another cone so you can distinguish between green and red. Give the AI random inputs at a range of different points and eventually one will stumble into something more optimal.
Yea he needs to add randomness to the gradient descent algorithm so that it finds the global minimum instead of local minima
You could run the ai in groups within groups. Say you have a group with 15 groups in it. Those 15 groups vary quite a lot from eachother. Within those 15 groups you have 100 cars that don't vary as much. You then pick the group with the best individual performance to run the next generation on.
Kind of like different sub-spiecies
Genetic alg reinforced learning ftw
yeah and this would make it so you don't have to force the conditions with walls
12:50 Link's WR run at that jump is low, so his wheels touch down early. He also hooks left more than the AI was, giving him more room to gun the gas rather than worrying about hitting a wall.
I would suggest putting walls in the air along the jump, to force the AI to try and keep low and left. To me, this would mean having to swing more right leading into the jump and then turning left while on the jump, or trying as Link did and cutting in early and then having to hard hook left later to make sure he made the landing.
14:26 I mean damn just run it another week of training and see if it will do even better
What seems to happen is that not the fastest AI gets to be built upon, but the fastest ai THAT FINISHES. Once they find a consistent line that gets them to the finish they try and improve upon that line, but it fails to consider different lines that could be faster and in the end you end up with the fastest line that finishes, not the fastest line that could've been. You could try and work with more checkpoints in specific locations. That or insert more randomness so it keeps experimenting with different lines. Cool video though, looking forward to the other videos :)
while i think this makes sense in theory, selecting a fastest run at some checkpoint is also quite difficult and what the reward strategy is trying to avoid doing. things like position, speed, and orientation may not really line up with hitting the local maximum speed from optimizing a run.
that said i think here playing with reward penalties instead of barriers may have helped. like penalizing airtime may have helped nudge it into looking for lower jumps, but that probably doesnt help for making a general purpose training alg for trackmania
Would make sense.
@Patchypatchypatchy I think thats a good idea, but I have built RL on driving simulations like this (albeit simpler than TM), and it tends to hurt to try to over penalize, causing it to fall into alot of local optima that suck. I believe u may be correct that this is one of the only solutions if there is a viable one with RL tho
As someone who has worked with RL Agents significantly, this is incorrect. The AI learns from every race, whether it finishes or not. When the AI successfully does the flip, it's getting more rewards, and learns that trajectory to be good. When the AI was learning, it learned one specific trajectory to give it a consistent flip that allowed it to make further progress, but that line never gave it WR speed. Once it has learned a line that is pretty good, it searches for the local optimum, but the global optimum could be link's line, or another different line like mentioned in the video. It's risk V reward all over again, which is a very deep topic in RL and is what the text splash at 12:50 is all about.
What about most rewarded
5 hours till release is an unbearable amount of time to wait
Versus the 6,000-hour training? Fair's fair
An idea for a different way to reward the Ai:
Reward for time with wheels on the ground
Reward for Speed slide
Reward for uberbugs etc.
That way you reward it for exploring the Techniques the pros use when hunting a map
Uberbugs aren't allowed in tmx records btw
@nullbind Every Uber bug? even A12?
@sheepox573 oh my am I a right numpty, I was thinking of nose bugs, no Uber bugs are fine 😅
Perhaps the AI could be praised depending on how close it is to the world record path, then, purely by chance, she should defeat him
@nullbind nose bugs are fine on the TMX boards too, there's just very few opportunities to do them humanly
your editing is immaculate, dude. just a joy to witness
Reinforcement learning to try to beat the WR is the best way to attract folks who likes to watch all the pretty colors in order to solve the problem in the most inefficient way possible.
The fact that a literal AI didn't beat link after 2000 simulated hours is a true testament to how insanely good link is. Amazing video as always, excited to see what'll happen on the rest of the campaign.
It's pretty clear that it was a fluke and had nothing to do with their skill. He just got really lucky on that one run.
He’s very good, but on that WR run he himself admitted he got very lucky.
@abebuckingham8198 The flip was a fluke, sure, but the rest of the run was still insane enough that he ended up with a lead of 15 hundredths, saying it has nothing to do with skill is crazy considering he's literally the person with the most world records in the campaign
well the AI isn't very good
the AIs are very dumb to learn. We can't compare time for AI to learn to time for human to learn. AI is effective because in most situations, they can make a try in milliseconds, they don't eat and don't sleep, they don't need to rest.
A human who has never played the game will be able to complete the track in a few try, while it took hundreds or thousands for the IA.
And of course Link is very good.
pure cinema, the way he makes the videos, the editinng, the storytelling and the way he explains hard things in an easy way. Its just special
im not sure but i feel like you could get a better route by:
1. making the punishment for getting a bad run lower
2. reducing the reward for getting an average run
3. greatly increasing the reward for getting a really good run
This would punish the AI less for trying out new routes and it could get higher rewards if it finds a really good route once.
I think that would still get it stuck in a local optima a lot of the time. If any change reduces the fitness then it will just try not to change.
My guess is that adding some noise to the inputs would be a better option, or just letting it run for a bunch more attempts and praying to the chaos gods like the video showed
@taylorolson6228yeah i kinda agree but if you make the increase in rewards exponentially better then it will almost allways be encouraged to try and get a better run
@gogo23166not really, The problem with local optima is that it's punished for any change, and that it has no bridge to those point gains
@gogo23166 hes right the ai isnt smart enough to realize it needs to try new stuff for the bigger rewards because its punished for doing bad
@taylorolson6228You have a fair point, however, I'd argue that the manual insertion of obstacles blocking the local optima is analogous to implementing a punishment when the AI has stopped improving, but it is known that there is a more optimal path
We know there is a more optimal path than what the AI is doing until the AI at least matches the World Record & in many instances it is reasonable to believe that the World Record likely has room for minor improvements, so it wouldn't be too unreasonable to use the logic "there is a more optimal path until the AI has beat the World Record"
@youshtm: I think the next leap here isn’t a better single bot, it’s a whole society of bots with shared memory. One class maps the track from observation "Bonkers", one hunts crazy high-upside glitch tech "Glitchers", one refines stable sectors "Soldiers", and a controller combines all of it into actual WR attempts "Queens". Every run would leave notes and POIs on an internal 3D map, so the system learns the level instead of just converging on one safe line. That feels way bigger than “AI learns a racing line” ... it starts to look like an actual racing research team.
This kind of AI isn't like LLMs. You can't really have agents like that
I am working on this sort of approach...
@mattymerr701 trust me, you really can.
This illustrates some of the real pitfalls of AI favouring incremental improvement over radical new approaches.
On the last attempt to change the AI’s line, you blocked the line into the flip. That doesn’t necessarily encourage it to change how it approaches the flip, just where on the ramp it will be. If the AI is blocked from landing on the right side, it will have to adjust its line in some way to compensate, if it’s going to make it across. IOW, put the block in front of the landing zone, not the take-off zone.
that is also what seemed to me a way to force the landing in the correct path.
an AI that plays videogames better than humans??
and I thought that being uneployed would save me from AIs stealing jobs :'(
wait, an AI trained to maximize it's score on only this map, is far from "playing", is pure deterministic function, exactly the same as using TAS. Both optimized just to perform well on this case, and 100% reproducible.
If you train an AI on one map, it learns the core mechanics steering, speed control, drifting, and how the car physics work. When you move it to a new map, it doesn’t have to relearn those fundamentals it just has to learn the layout. That drastically reduces training time.
And if the AI learned advanced techniques like speed drifting or air control, those skills transfer too. It can apply them on new maps whenever similar situations appear, only needing a bit of fine-tuning for the layout. At that point it’s technically playing the game much like a human would. using learned skills and adapting them to a new track rather than memorizing a single map.
@FakedPvp ai doesnt learn anything
what ai is is just a function that takes input and produces output
theres no awareness of anything
just a bunch of multiplication and addition
if you plop ai trained strictly on one map it wont suddenly be better if you plop it onto another map due to overfitting
And mind you AIs crushign humans at video game is nothign new at all.
Years ago we already got AIs who destroyed world champions at Starcraft 2 and Dota 2.
And since theses game are pvp , their victories weren't a fluke by pourring thousand of attempts until threadign the needle.
@RedstonekPL yes awareness isn't a thing but it does reduce training time. Cause it's not relearning what it already knows
Song in the 2:00 minute mark is called sumerian paradise BTW
I love that no one asked yet you decide to be a goat
10:48 But Trackmania is deterministic, it might be because the approach angle is different beforehand.
The physics is deterministic but *chaotic*, this fact is covered in this creator's "AI Learns to Drive on Pipes" video.
I don't know why he opted to say "random" instead of chaotic this time.
butterfly effect
@rotli2189 exactly, one tiny difference in the angle coming in and the outcome will be quite different.
@Vaaaaadim I mean, Wirtual said over and over again that it's deterministic, meaning that you could copy inputs and you get the exact same run, that's how some cheated runs were found in the early days. Also, press forward maps are based on this too.
@JojoChannel-o4g I agree it's deterministic. But the physics system is chaotic. Even waiting a random amount of time at the starting line before going affects what happens, because the physics engine still runs even when you're completely still and will imperceptibly affect your physics state (this is noted in his pipes video). A deterministic outcome does not mean a predictable outcome, unless you have perfect knowledge of the state of your car.
As an engineer, I want you to consider the undeniably great opportunity of partnering with educational institutions that make learning fun, as this is amazing content that could be used to teach the fundamentals of reinforcement learning, machine, learning, and the components that actually make artificial intelligence possible. I know that value exists because you made this video without those intentions in mind, but if you catered it a little bit more to the educational side of things, I think your scope of audience increases multifold. I personally have never heard of this game in my life, but this video helped to visualize and understand many fundamental concepts that are used to improve the results of artificial intelligence, and I think that much of your viewership is watching for somewhat the same reasons as well. All I would worn against is to not make it overly educational as your current video is already amazing quality, and I would say it would only require 10 to 15% tweaking to really align it with what I’m talking about, which is why I’ve feel that it reinforces the fact that there is such a great opportunity laying at your fingertips to further your life. And I don’t even think these partnerships that I mentioned have to be necessarily RUclips related, although they definitely could be.
now imagine if he had RAM
He does tho
Why RAM? It's more about CPU I expect.
this isnt that kind of AI
@artnaz GPU and VRAM actually for computing power on AI.
@ATTILA0769I do ML on CPU, it's pain stacked on double pain whopper combo with pain fries
A thought on the jump: you might be able to calculate a probabilistic fastest speed for each run.
Say the AI starts a jump (during training only).
1) Spawn 100 copies of the current AI in the current location.
2) Wiggle the location of the car a sub-pixel amount according to a 2D gaussian distribution.
3) Run all 100 copies to the next "predictable locations" (eg. the next checkpoint).
4) Continue only the fastest run and reward based on its results.
The idea is that you're giving the AI the benefit of the doubt on the RNG sections of the track. Based on the RNG required, you create a "carpool" of enough cars that the AI is likely to pass the RNG a single time (given it was on a good line). If the RNG is 1/10, you split to 15 or 20; if the RNG is 1/1000, you maybe split to 1500 cars.
Obviously the splitting system would be VERY expensive computationally. I suggest (like a threadpool or, in some languages, workers) a "carpool" : a bank of a few instances of the game ready for the gamestate to be transferred to during the splitting phase. The best gamestate result is passed back to the original copy so it can continue its run from there.
If there isn't enough RAM for the 1/1000 case, you can obviously run 5 or 10 cases at a time, storing the current best run until you find a better one.
Yeah I’m no expert at all but I was thinking something like this would make sense
I havent worked with RL in a while but this seems destined to find local minima, something like an outside line maximizes speed would just not be found if I understand you correctly.
There is a concept called intrinsic reward specifically for sparse reward environments, basically during training there is an exploration phase in which there is a world model that tries to predict how the environment responds and that world models loss gets added as intrinsic reward. The concept is to explore areas of the environment that are not as well understood by the model. Of course in trackmania this would baloon compute time. But if used in conjunction with your idea it would probably find optima. Later on you could group the sections you mentioned together and create larger continuous parts to get closer to a full run
@haubiwanb769 yeah! That makes a ton of sense. I think in my suggestion, I was assuming he would do the work of narrowing down the possible range (like with the walls in the video) and then let the AI find the local maximum from there
@abraxas2658 oh yea if there is an optimal strategy already discovered that should work. Afaik Trackmania is deterministic
A06 saga is iconic. Great video
Link went trough the middle of the 2 poles (11:21) you said Link is only doing 1 thing diff which was his approach angle but he also lands and drives through the middle of the pole and he gains 0.01 sec on the next check point but he stays on the left while the ai goes to the right for the long U-turn what if the Ai went through the middle of the poles and went on the left but swerved to the right before the big u-turn jump?
Wow! What a great video. Not only from the perspective of technology, but also from the perspective of script writing. What a
nail-biter. Great entertainment! Thank you! 🙏
This series has to be one of the most liked ever on RUclips. Seriously, with currently 5k+ views and 3.3k likes, that’s 33 likes to every 50 people who’ve seen it. Keep em coming Yoshi! Make those little AI sonovaguns earn those 🥕!!!
Even better now at 5,6k whit 4,3k likes
View counts on recently uploaded videos are notoriously error prone
It's at 11k likes for 33k views, but that's still an insane ratio.
I've also discovered the hype feature on this video by accident.
THE HYPE IS REAL
Very interesting 14:34 can't wait to see the full results on this :)
What an absolute turtle tease.
P.s. AI beat Link everywhere except the barrel roll. Seemed worthy of noting. The crossover roll creates a longer path to maintain the same ish average amount of velocity. Basically, Link hit the NOS too early. On accident. Badass runs from both sides 😊
@brettgrindel9017lets go gambling!
I’ve never heard of this game in my life
There is a reason im watching this...
I have never seen someone with such a big patience and big brain, always be able to train an AI (from scratch, as usual) on a map to beat records that were found on accident or were forced...
I dont care how long you take to make those videos, I will always be here to enjoy the video and see what kind of absolute masterpiece you created
Keep up the good work!
I can't wait. I was just scrolling when this got to my recommended
You could try multiplying the loss if the reward predictor over predicts by a constant under 1 to try to force the average to be skewed so it prioritizes even a chance of a good run. Just a thought.
it seems like you had 2 things to optimise: fastest possible ramp jump, and fastest possible finish. if you trained the AI by spawning it near the ramp in a range of positions and speeds and placed the finish line a bit after the jump ends, you'd focus it's reward system entirely on optimizing the jump, then after that could put it back on the original map and have it find a way to incorporate that in its full map strategy
Yes, great intuition. Not sure how easy this is to actually implement though.
14:42 like how you're speaking as you write the script. It's a creative touch.
that's so fascinating, I love this analysis on rng
One of the only, if not the only youtubers videos I instaclick without looking at what the video is.
My takes on your notes: I think two key improvements you can make to ensure correct exploitation vs exploration is using Advantage instead of Value and well-tuned SAC. Briefly, advantage does not consider agent reward based on your reward function directly, but instead gives a reward proportional to the value above expectation (i.e. instead of considering Q(s, a), consider giving reward A(s, a) = Q(s, a) - V(s)). There are other ways to define a baseline, but this is an easy one. And in your SAC implementation, your agent should generate a distribution over actions (i.e. instead of generating a single output value for each button and then pressing the button if it‘s above a threshold, it should generate a mean and variance for each button press and then sample the action from that distribution, and compare that to the threshold). You can then define some minimum variance used during training, making sure the agent is never too sure which actions are best; for the actual runs you can then just use the mean value of the distribution as the action, since that should be optimal. TD learning might also help, but there I am less sure (also not clear whether you‘re already using it).
I expect that spawning the agent in the trajectory of the WR run could have a benefit, but you have to be careful about how you choose where to insert the agent. Your goal is not to make the agent handle the jump like the WR, because it will never arrive in those situations itself (ie. this is wasted training imo). Instead, your goal is to make sure the agent arrives at the section ready to take the jump in the right way, so it‘s exit out of the previous corner should match Link‘s.
Another approach you can take to increase willingness to take risk is rewarding the entropy in agent action: evaluate how common/rare a given trajectory is and give reward for trajectories with high entropy. the problem is that this too becomes a reward tuning task.
Regarding your last section about why the AI is getting confused by the wall before the wall: this could be caused by anything, and without further details I can‘t give an exact reason. It might be related to reward shaping, feature selection or feature processing.
Hope this helps!
this is a nicely formatted way of expressing the same thing i was going to suggest - rewarding unexpected outcomes that deviate from the norm. adding some randomness would probably also help, or at least be an interesting experiment.
@diyartaskiran Thanks for your comment!
-I'm already using Advantage, with the Dueling Network architecture.
-I've tried SAC several times in the best but couldn't get better performance. Maybe some hyperparameters were not properly tuned though.
-Regarding entropy and introducing some variances in the policy: from my experience it has a negative effect regarding risk-taking in Trackmania. When the agent knows there is some random noise in its policy, I think it tends to drive safer in some sections to take this randomness into account. For example it drives further away from walls, which is often slower.
-Regarding the WR spawns, I agree this approach isn't ideal! I should have at least added noise in the position/speed of these spawns I think
@yoshtm I wonder whether an easy way to achieve all of that would be to aggregate the reward of your policy over several runs near the ramps through a max function, if there indeed is some randomness in the outcome (if it's not really randomness but chaos => small action noise allows you to say the chaos = randomness). This has the benefit of privileging policies that have a better chance of establishing a record (you don't care if 9 out of 10 of your runs are terrible if 1 out of 10 is way above what anyone can do).
I am not an RL expert, just a mathematician slightly versed in ML
@yoshtm I think the problem isn't the method. The problem is to much freedom.
All players drive past the pole on the right side. Except for one Link drives in between them. I think this is the key to being faster, his jump is lined up differently and the result is passing in between the poles.
So instead of restricting the track on the way to the jump, I think you could try to block the path so that the AI can only run in-between the poles.
This should force it to try out different jumps, since getting a jump and hitting the wall will be very slow.
From content to editing, everything is near perfection. You sound a little bit french, alors merci pour ces vidéos, le montage est incroyable, et le travail derrière monstrueux. Franchement, tu mérites plus de reconnaissance, tes vidéos sont niches et en même temps tellement abordables et passionnantes, ça force le respect. Chaque fois, c'est un plaisir de les regarder, et tu m'as fait découvrir l'univers du speedrun de trackmania d'une manière tellement originale, c'est sûrement devenu ma licence de jeu de course préférée! Alors merci encore, et bonne chance!
The best RL videos in RUclips, which explain intuition about tuning. I wish to have similar videos, but for robots/humanoids too😅
wow i loved every moment of this, i couldnt do what you do with programming but it really does interest me
6:37 - Is it possible to reward it for having less air time?
less air time = it goes faster, so it's implicitely being rewarded for less air time
@maestroeragon While true, perhaps it's not enough? Many behaviours account for laptime, and thus compete for the same reward. By adding an explicit reward for less airtime, it should focus more on it.
I think the correct approach is to give rewards for quick rotation speed. The AI never did the kind of aggressive flip people do in hunt sessions.
@Mauz1 I think the real issue here is not enough randomness in each generation. So it is locking in on a single solution, then not varying much.
@Mauz1 Nah cause too low airtime isn't always the best and could eat up other strategies. Rewarding for time is enough cause AI will find the most ideal trajectory
The post-landing trajectory shoots across the track, you could try putting a gateway over there that rewards the AI the closer they get to that shoot across action.
13:21 is that a nose boost
lmfao
Nose spin
I Saw the same
very interesting to watch this, ty for your work
I really really liked this video. I've not played this game in years but I found this video so absolutely fascinating and well made.
Considering the game uses floating point calculations especially for collisions, you may want to consider putting in each individual bit of the float for the car's position, speed, and/or orientation. The randomness is still deterministic since you can play it back, however when you calculate floats sequentially you can end up with calculations that end up operating on denormals and values that are off by one or a few ulps. This can be compounded especially if round, floor, or ceil are used if the ulp crosses a threshold. The AI should be able to pick up the pattern on bit changes even though it would effectively look like noise.
just reward it going between the two columns after the flip
What does this comment even mean... none of this makes sense... are you suggesting converting the floating point into induvial bits and passing each one as a input parameter? That makes no sense, it would remove all meaningful structure for the model... What does "AI should be able to pick up the pattern on bit changes" even mean... this is also a RL model, which is even more sensitive to noise than supervised learning, which would make this approach make even less sense... It would also make the training super slow, you'd need a larger input layer with more weights and way more gradient updates, all you'd get is slower convergence, worse policies and tripled training time. I mean just think about it for a second, you got the car going at 100mph or however Track Mania works I don't play this game, so your input is like:
10010000000000000000000
Then we go ever so slightly faster, 100.1, now you got:
10010000011001100110011
So a bunch of things change without the ability to really see a clear pattern. This entire comment makes no sense. I mean you can try it, train a little supervised learning model (which should perform better with your freaky architecture and look at the results), here is your thing:
Epoch 10/50 | Val MSE: 0.4058
Epoch 20/50 | Val MSE: 0.3614
Epoch 30/50 | Val MSE: 0.3478
Epoch 40/50 | Val MSE: 0.3990
Epoch 50/50 | Val MSE: 0.3527
And here is a normal model:
Epoch 10/50 | Val MSE: 0.3898
Epoch 20/50 | Val MSE: 0.3488
Epoch 30/50 | Val MSE: 0.3243
Epoch 40/50 | Val MSE: 0.3137
Epoch 50/50 | Val MSE: 0.3066
What so weird about this to me is this isn't even a concept, not a theoretical thing or something even discussed, you just like made this up in the moment off your noggin
@illumiyaa go to your browser, hit f12, type 0.1+0.2 in the console and hit enter
you will not get 0.3
This is a simplified example, but basically, in floating point, you get accumulating arithmetic error which is essentially a "randomizer"
games which have continuous coordinate systems + don't use integers = there are actually errors in the code! that are rare, but still happen
ruclips.net/video/lEBQveBCtKY/video.html
@MichaelMikeMigos I feel like guiding the AI to the solution defeats the purpose.
For tomorrow, I have 3 tests and a lot of homeworks... I was working during 1 hour in a row, and for my break I just saw a notification on yosh profile. YAY !!
Beautiful video, can't wait to see the rest of the tracks your AI masters!
thanks for this amazing journey!
your visualizations of machine learning data are really top notch. Well done!
Port the AI to Trackmania 2020 and make it play Deep Dip!
Better to train how to slide in mayonessie
having a fucking god like machine try 700k times in the span of a few days and still barely win you when it never gets bored, tired or unmotivated or loses focus and energy or forgets and loses muscle memory, this is a great indication of how EPIC we are as species.
This man just bruteforces a map with AI every few months and makes bank.
infinite money glitch
LMAO!!!
Gotta pay the bills somehow
Man, never played at this game but your content is pure gold, keep it up !
The man has done it again. Great job!! 👑
Incredible video yet again! This AI is part of what motivated me to create my channel, and getting to watch the continued progress makes me want to get my own AI beating more WRs.
12:51 As for why the AI couldn't match the WR here: I think you're 100% on the money. I think the problem is that because the learning algorithm struggles to even learn the flip in the first place, once it finds a method that works, it becomes very hard to find a second or third method that works. Because the section is so noisy, and you've forced it to take that very noisy path, the amount of actual learning it can do is reduced. I think this is a limitation with the learning algorithm itself, and not something that can be solved with sophisticated reward functions. Good reward tuning will compensate and help to improve (I would consider blocking the normal ramp jumps part of this tuning) but that only raises the ceiling of the agent; Improving the algorithm raises both the floor and the ceiling simultaneously. I've been looking into DQN extensions I can add to Linesight (although not with much luck, I'm not exactly an RL scientist, moreso a python hobbyist) and UPER (Uncertainty Prioritized Experience Replay) caught my attention as something that could help to reduce noise in the network when dealing with heavy Pseudo-RNG sections.
One idea for reward functions though: Directly comparing to whatever the agent's current PB is. A moving target that gives a bonus reward to the AI when it is closer to the current fastest run. In theory this would solve the problem of the AI finding an 'ok' solution and never deviating away from it, because if it can drive significantly faster in a certain section (with some tuning to make sure it's not taking a bad line to do so) then it should prefer to beat its 'ok' solution in that section, learning to take more risk. It might be a bit more unstable, as the reward function changes dynamically, but it's something I'm going to be testing with Linesight soon™ to see if it can be effective for learning difficult tricks.
Hi, thanks for this nice comment :) I watched your mario kart videos, very cool!
I've experimented with PER, but I haven't heard about UPER, I'll check that
I've been thinking about use a moving target for the reward like you said. But when you are in the middle of the track, it's hard to know if your current run is better than the fastest run. Sometimes you can be ahead at some point, but with a lower speed, and thus have a strategy that will result in a slower finish time. So it might be hard to tune this kind of reward system for intermediate sections of the track. Unless it's a sparse reward system where only the final finish step is rewarded?
Gran Turismo time trial GOAT here. 10:22 look where the red car is. I predicted the red car was the one who landed properly because of its weight transfer. At speed, tiny changes in momentum add up to major changes in results, especially in the case of a brief traction loss. The red car starts further to the left, so it has to throw its body to the right at a slightly higher rate, leading to a slightly higher momentum and greater loading. They all cut back to the left at your "indistinguishable" moment, but with the visually hidden variable of weight transfer to the right side, it clips the ramp more gently with less-loaded tires, retaining the forward vector momentum more and increasing axial rotation on that vector. Nothing mysterious at all if you know what you're looking at. AWESOME video though, as usual, not to take away, you're just wrong. Nothing random is happening.
Again at 11:18 look at Link's blue line. It is obvious he is shifting the weight to the right side more dramatically.
12:28 you are gating where you think the key moment is, but the key moment is an 1/8th of a mile back setting up the weight transfer. You're hand selecting based on improper inference.
What
Cool video. Did you use some kind of temperature / creativity parameter?
I just really enjoy your vids they’re like my mental floss just what your doing your presentation and I just generally like trackmania
I'll give you my two cents on why the fastest line isn't learned.
Source: I'm a PhD student in a research group on "AI" (as so many are these days.) RL is not exactly my area of expertise (I'm originally in formal methods, the group is mostly graph-learning as well as some NLP 'cause that stuff gets funding), but I helped teach a master course on it for a while.
I think your explanation is correct that your current reward structure incentivises consistency over single great runs. All runs that crash still get their gradients added to the Q-table. To be more precise, default DQN optimises the expected value of the reward of a single run, which is not what you are after.
I think many of the people suggesting randomness are flat-out wrong. Adding more randomness in RL is known to frequently end up with the opposite effect, where you train to get runs that are robust against random inputs, as the previous inputs end up being penalised for them. (The expected value of the reward of a single run that might have random inputs is higher for safer runs.) Q-learning innovates on this by performing random actions off-policy. There is a really simple school exercise you can do where you train a reward table to find a path from A to B where the fastest line walks past some "cliffs" that end the run with negative reward. Q-learning, which you are already using, will find the path walking next to the cliff, but other random approaches like SARSA do not. Q-learning's hyper parameter epsilon (or group of hyper parameters if used epsilon-decay) are what you need to play with to get different randomness, and I presume you already did.
These kind of approaches, as well as other suggested ones like increasing the reward for single great runs (however that is defined) will only come so far. The one thing I did not see suggested you could still try is to multiply all gradients by 0.01 or something for failing runs, but ,still, if your reward structure is over a single run you are going to see somewhat "safe" behaviour. This means the trivial thing to try, if you have not already, is to re-structure the rewards to go over multiple runs: Instead of training over the expected value of reward, try re-calculating the loss such that it trains over the expected value of the max of ten, or a hundred rewards. The generalisation of this is called quantile regression, where you learn the whole distribution of returns instead of only the mean, after which you can disproportionately act upon the policies that have the possibility to end up in the higher quantiles. There are several algorithms for this you might want to play with, none of which I am familiar with, though I know these are known to work quite well for video games and such.
Not sure if you are already doing these things, though. I did not look at your code, and you have spent a lot of time on this problem already. ^.^
Those seem like good suggestions but I cant help thinking that RL is just the wrong algo for this job. Wouldnt something "genetic" like NEAT be a better fit to the problem?
Selection of the next generation could be based on fractions of the track ie: most improved position from this bridge to the next earns you a place in the next gen as well as overall performance, to bubble up local "great moves"
@tommurphy1153NEAT learns way slower and would likely not work. It is effectively surpassed by DQN in all but the simplest tasks.
The sentiment you and others are expressing about the gradient decent of deep learning and how an evolutionary optimisation algorithm should be better is very old. It's intuitive, but it experimentally doesn't hold.
Stuff like this depends a lot on the task, but the closest personal example I can give is from a MSc course I helped teach for four semesters (twice as a student one year, twice as a PhD the next) on learning behavior of real robots using a mix of lab training and simulation. Students were free to chose their own training algorithms, as the course was on the lab/simulation interaction. We held a competiton for bonus points at the end of it with the final and hardest task, and all winning groups I was there to see (including when I took the course myself) used DQN or Actor-Critic. There were always people who tried something more evolutionarily, but it never scales to complex tasks, which is also coroborated by the literature.
I played this map for about 6 hours myself without any flips being even close to good, untill i suddently got a flip ahead of 2nd place, sadly i failed it in the last part of the map as i didnt do the ending properly and landed in a bug right infront of the finish. either way this goes to show just how insanely random this flip is, and how incredible links WR is.
Whats the game called?
I like how, after 5 years, every single reinforced learning video explains what reinforced learning is at the beginning still.
reinforcement learning has been around a lot longer than 5 years.
Beautifully made content. Thank you for your hard work!
It's always a treat to watch your videos. Thank you!
you could've just added a variable for max inputs per ms to avoid cheat detection and taken every world record. but you gave us something great to watch for the past few years, and a few more to come
I'm pretty sure the ai input system is taken using mods, so it doesn't matter, it won't be playable online
1:34 So did most of us humans also have trouble with that part! You have to hit that jump _perfectly_ to get the author medal.
I ragequit the A levels after getting +0.03 behind AT 5 times in a row. I hate that track so much.
Hey Yosh huge fan! RL expert here.
Been watching since the first one. Have you seen any of the Robotics RL work adjacent to “Hierarchical Entity-centric Reinforcement Learning with Factored Subgoal Diffusion?”
Any of the recent work in robots from Aviv Tamar might be a good place to look for inspiration.
Also might be worth looking into the new C-JEPA paper from Yann LeCun.
Hi thanks for your comment, I'll check that!
This was an amazing video. This deserves a subscribe.
great work! thanks for the great content and insight of your project
12:28 if you wanted it to imitate link, you should be blocking off the right side of the landing pad and force it to find a line that lands a flip in the center? You are forcing a start condition, not an intended and partially known end-result. As long as it hugs the right it will never find his line. As for deviations in driving, maybe you need some sort of mutators that encourage wider or tighter lines. As for the 'this is a RNG jump' I kind of don't buy that, it's more that the angles for entry it's picking, cause a wildly fluctuating output based on the most subtle game conditions, collision tick checks, accel values, etc. Changes to the line entry angles might very well make the output side more contiguous and controllable. Force it into sharper angles, force it into wider angles, then evaluate those outcomes. It's stuck on an 'optimum angle' that is likely limiting what future maneuvers are possible.
I think it would be also possible to work on rewards based on the rotation speed, or the longer contact between the left wheels and the ramp (seems to be at least).
@SkopOneFourI like this idea, but maybe one step further is just rewarding for landings that carry good speed and an acceptable exit angle? Then it can do whatever voodoo it likes, like clipping the back wheels on the edge of the crash barrier to soften the twist of the landing like it did in one of the replays. Something I don't think any of the players even did.
My flip isn't even the fastest in the top10, logic's was slightly faster but his landing was bad on the next jump. The flip is actually functionally random, trackmania is a deterministic game but it's more deterministically chaotic. Float point differences in position can radically change the outcome hell starting to accelerate later than instantly will radically change what happens even if you do nothing but press forward, his "AI learns to drive on pipes" video is a good demonstration for that. Despite how it looks everyone who has hunted this flip will know how disgustingly random it is. it's most likely possible for the ai to get the flip somewhat more cons but it will never be the fastest if consistency is what it is going for as the fastest flips are always outliers.
My two cents
As you mentioned, the RL agent goes for the local and is avoiding bad attempts
Another agent could pick the macro strategy so different valleys are found
Let the AI imagine the result, humans usually know what the desired result looks like, not just how fast it will be (to elaborate, the AI is focused on the timer, not the desired replay which is opposite from what humans are focused on)
You already segment some attempts by hard cutting/starting runs, but perhaps: limit how far the AI sees and let it focus on what's right ahead (let the AI figure out how it's impact on the result increases the closer it gets to the jump)
ps. my pet peeve is that you already know what the record looks like and try to aim for that. possibly look at the same data the AI is looking at (i.e. how high the rewards are), not the line comparison :)
*sorry for wall of text*
7:16 prxper, you say?
Fun video. Beyond the technical part, I love the drama in the story!
incredible video as always, pls dont stop
2:29 Wirtual being top 150 is crazy
How so?
AI PhD Student here: To get the AI to get more innovative, often an Evolutionary approach is used:
1. Train a set of AI nets (e.g. 40) at the same time, let them perform some runs while optimizing via Q-Learning and compare their reward at the end.
2. Take the top 20 of the nets and mix the weights together by taking half of the weights from one and half from the other. Make 40 new nets from this
3. Add a bit of randomness to a few percent of new weights.
4. Go to step 1
The last step "encourages" the existing net to innovate by "mutating" the behaviour.
This setup tries to emulate nature's survival of the fittest setup and mutation helps moving out of a local minima.
I commented the same: evolutionary algorithms ftw! My idea was more to let the different AIs drive their runs without training, and then only keeping the best runs and training a new generation of models by combinations of the training runs. probably ends up very similar, but a different approach nonetheless
4:23 This is exactly how human ingenuity will defeat Ai in the future. Once it hits its limits it can’t come up with creative plans
I dont see why this would be the case. why couldnt it come up with new ways? in this specific setup there is no high variance in approaches once a certain level ist met. But this is setup specific not AI specifc.
And also why calling this an intelligence is misleading. It's an optimization algorythm. Everything this AI "learns" will be useless on a different track.
This is essentially an important premise to the entire Expeditionary Force sci-fi book series
defeat in what? ai is a human made thing lol
I mean, it has to be herded into copying the human strats after running like a million attempts one after the other. It literally can not "come up" with anything, there is no thought process, creativity or skill whatsoever.
This is really well produced, great work
Incredible combination of technical knowledge, scientific thinking and great video editing, combined with an amazing game. Peak content, kudos!!
Français ici, on reconnaît
De ouff je trouve qu’il a un accent a couper au couteau c’est sur il et français 😂
this is exactly why A.I physically can not, and will not ever replace humans with our current understanding.
It can't innovate on it's own, only "achieve the objective." This is also why it can't make art, only copy. Not until it can draw of it's own volition, without any input.
AlphaGo showed that it certainly can invent and innovate. It found new strategies no one found before and top players felt playing against an alien. This is more of an example of overfitting and to narrow loss/fitting functions.
7:30 what song?
Cold War Games by Gabriel Lewis, I'm pretty sure
Idfk Shazam it or sum
That's really impressive. Great work
crazy videos as always, thanks for the quality
That was awesome keep up the excellent work
This video is amazing! The editing and the explanation where spot on! I can't wait for future video's!
Incredible work as always! Can't wait the next video 😊
This was awesome! I can’t wait till the next videos
Awesome! Keep up the good training!
Phenomenal content, great job.
I love these AI videos, You're one of my only notification i let ring.
This is the greatest video i have ever seen in my life, and I feel like it has a deeper meaning. Showing that ai is not human, and never will be. Side note your french accent is perfection.
10:20 bro we talked about it already in previous video 😂 CHAOS ohhh scarrrwwy 😅
Should let it go for a month. I know the difference would be subtle, but 1 second quicker would would look insane