Agreed. Instant thumbs down and unsubscribe (if applicable) and I turn on ad-block when they do this clickbait foolishness. We're all adults here. No need for childish titles.
I'm seeing this more and more where I might love a channel love their videos but I swear to God I cannot unsubscribe fast enough if they do that clickbait crap I can't stand it You know how many videos say oh my God scientists fine asteroid has life guaranteed announcement to come soon and then you watch the video and it's nothing about that nothing oh they should be able to be put in jail
I like how Deep Mind doesnt promise anything. They just publish stuff. 'Hey! Look! We can now fold protiens. We can now create new synthetic materials. We can now do videogames.'
Can, could, might, may ... yep, that's all I've heard from the AI field in general recently. Wake me up when we these things are actually available to the general public, until then, it's all vaporware.
Diffusion models are a true "black box" problem! Excellent breakdown but I don’t think music in the background is necessary IMHO. I Love the channel though.
@@pvanukoff Oh it's not all vaporware. It's just that the good stuff is secured deep within the vaults of the hyper-wealthy. It's real, it's just not for you.
I remember plating that game with this weird joystick where you could screw a little peg in the middle of the D-Pad. I lost that piece almost immediately xD
This will mean that most of the in-game architecture and objects don't need to be built. They only need text-based descriptions. Any actions on those objects can be saved as text. The AI interprets it all. The NPC's will be the same. Games are going to change in a big way.
I only have one question... How much Memory Data do you think these games will require? Algorithm Generation on a dynamic level into the game requires a lot of memory over time unless the player is able to delete those files/data in-game. Without deleting the Save File. Astroneer and many Sand Box games have a cap on how much data the game can handle before crashing.
@@absolstoryoffiction6615 Probably not that much data for the actual gameplay, anything created during the game would probably only need to be saved as text files, it doesn't need to track separate objects since it's just a picture. It's like an LLM for video games, though the generator would then be far too large for anyone to store locally so I'd assume online play only if you want real time generation. There are AI models that can generate images already so this is "just" asking it to "predict the next frame", but with player input and fast enough frame generation.
@@absolstoryoffiction6615 It is pretty big. Tesla is already been doing this type of thing training the FSD for it's fleet. They can simulate the roads, and it's conditions using all the Data they have and generate a scenario and situations on the road.
in a weird sense, The Sims actually did something slightly similar 20 years ago. The key is that in The Sims, it's not the characters that look for objects, it's the objects that advertise their descriptions to the characters and this is for performance reasons.
Simulating different pi worlds: We could create virtual environments where pi has different values, allowing us to observe and study the effects. Scientific breakthroughs: New perspectives on the anthropic principle and fine-tuning in cosmology Insights into the role of fundamental constants in complex systems Potential discovery of hidden mathematical relationships
Think of the process like sculpting from a block of stone. The random noise is like the raw stone block, and the model learns how to chisel away the noise (the excess stone) to reveal the final image (the sculpture). By learning how to remove noise in this controlled manner, the model can generate complex images that look like they were naturally created, even though they started from nothing but noise.
The best analagy we have at the moment is 'we live in a simulation. Code creates the mountains, the rivers, the trees, the birds in the wind...the wind itself. In this simulation, our code is consciousness. There are innumerable 'games' being played at the same time, in the same space. We call these dimensions. Consciousness can and does transcend space, time, and dimensions. Edit: and it is our unconscious/subconscious aspects that are reflected to us ""In the province of the mind what one believes to be true, either is true or becomes true within certain limits. These limits are to be found experimentally and experientially. When so found these limits turn out to be further beliefs to be transcended." John C Lilly "The universe is an engine of narrative" Terrence McKenna "This is my real country! I belong here. This is the land I have been looking for all my life, though I never knew it till now... Come further up, come further in!" CS Lewis
I used to paint a bit and have no technical AI chops, but my naïve intuition is diffusion models act a bit like artistic imagination to focus ideas onto images. Similar to looking at a bunch of dots or clouds then conceiving there might be a face in there somewhere, abstract thought processes bias the brain's expectation to actually imagine faces in the noise or clouds. Seems to me the denoising training is just a way to train associations in the latent space of concepts to bias the generative process post training. ie prompting a diffusion model is just priming the salient connections in the latent space concepts prompts allude to, to preferentially manifest the related image representations...
Imagine an AI singularity being born into a fully realized virtual reality? Then who's to say it hasn't already happened? If we can do that, then it's probably already happened to us. Like in the plot of The 13th Floor. 2nd best sci fi concept since The Matrix IMHO. We become god at that point. I'm not even sure if True AI is possible because it leads. I don't think we will see it in our lifetime if it is even possible.
If life is but an AI-genned dream or the imagination of GaWd or a nightmare of a butterfly or whatever else, then I accept my position in the dream or imagination or nightmare or whatever. I will laugh and cheerfully go on about my absurd life, like nothing ever mattered. 😂
I believe that image diffusion models work by stretching vectors and tweaking pixel color values in latent space based on the training data, weights, conditioning (prompt, ControlNet, , etc), CLIP (how it identifies objects/text in the image), and the current noise. If you tell it to make a cat, it will start with noise or an image with noise added. It will stretch image vectors and adjust the pixel RGB values to more closely match the cats in the training data that had the most similar noise pattern with the most similar prompt/terms being used at that specific noise level. This will usually be a blend between multiple images in the training that have a similar noise profile, this is why it makes new cats. It starts with the most commonly seen part, usually a face. It will then add a very specific calculated amount of noise each step, and subtract it based on the CFG scale (more CFG will denoise faster, in effect following the prompt more by setting more groundwork early on) and the denoise amount. It's much more complicated, but this is how I've came to understand it over the last couple years.
It's an interesting technology but not market viable yet. Imagine an 8 Ball which can auto generate images on the fly. That's novelty. But selling pictures that was generated?... ... ... I rather use Google Search Image for the same product but better & free.
@@dj007twk Yeah, basically. Similar to an LLM, just with vectors and RGB values as output predicted off the noise/conditioning input rather than text input. For video, they add a temporal prediction component to predict the change between frames, given the previous frames. This requires another model, such as AnimateDiff. I think they basically trained their own AnimateDiff model on Doom with added parameters for keyboard input. This isn't doing anything new. This is why it's able to keep track of ammo accurately, it's temporally consistent, but not as good with less predictable values that it has less training on.
6:15 - Intuitive Diffusion Explanation - I think diffusion models will start to make sense to you once you introduce the idea of blending as a step-by-step process, which learns the patterns in each step and associated them with words in the prompt ("iPhone"). Let’s revisit the blending example with this in mind. In the first step of blending, the components of the iPhone might still be mostly intact, just slightly damaged or bent. Now just go backwards - It’s relatively straightforward to imagine reconstructing the original iPhone from this slightly altered state. In the next step, the phone might be cut a bit-again, you could think about just gluing the parts back together. With each successive step, the damage increases, but simultaneously, the model learns how to reconstruct from a slightly more degraded state back to a more complete form. Each iteration teaches the model both a pattern of "deconstruction" and a pattern of "reconstruction." I think the main roadblock you were having had to do with trying to imagine how you could reconstruct a full phone from its fully blended components, but this mental model doesn't include within it the learning that occurs at each step to go from fully functioning phone to blended pulp of a phone and back again. If you include that "intelligence" in the process and look at each step as learning representations to go back and forth between the more blended and less blended phone iterations, you can imagine how you could take a phone pulp and reconstruct a phone out of it (at least visually, not materially). Think of the "noise" or initial state like a set of scattered puzzle pieces that are random enough and numerous enough to form any image we might want if we modify them slightly. We can dip the puzzle pieces in paint to change their color or cut their edges a little more if we really want to. The prompt acts as a guide for assembling, cutting, and coloring these pieces. When the model is effectively trained, it learns how to use the prompt to modify and organize the pieces together, step-by-step, to form a coherent picture. It learned this from going backwards thousands of times and learning "in general" what to do each step. How was that for an intuitive explanation? I tried to keep all the math out and just give the concept. Does it make more sense how it is possible and how it works?
There were loads of 3D games for pc before doom, and Wolfenstein. For instance, 1983 Star wars, or 1989 MechWarrior. In fact ID software made a game in 1991 that was the predecessor to Wolfenstein 3D, called catacomb 3D. One of my favorites was the D&D game by ssi called eye of the beholder. I think it might have been the first 3D PC game that I played. People mistake a lot and say that Doom was the first 3D game, Doom was the first popular 3D game, and Wolfenstein 3D was the runner-up to that. There weren't enough people in to PCS in the '80s for the early 3D titles to shine. Some of the best PC games were made from the late '80s to the mid '90s. Hands down.
Imagine playing a game with flux or midjorney type detailed graphics. When the graphics shift from geometry-based polygons to highly detailed 2D images that appear 3D, complete with fake volymetric particles, ray tracing, and super smooth angles in every frame. Achieving that level of detail at 60 fps is a long journey ahead. But once it's achieved, the graphics will be more beautiful than real life.
Commander Keen is a series of video games developed by id Software, starting in 1990. The main character, Billy Blaze, also known as Commander Keen, is an eight-year-old genius who travels through space and battles enemies using his raygun and pogo stick. The first set of episodes, "Invasion of the Vorticons," was released for MS-DOS in 1990. A standalone game called "Keen Dreams" was developed as a prototype for new systems and ideas for future games. It was completed in less than a month while simultaneously working on another game. The series has since gained popularity, with over 80,000 owners of the Keen Dreams release and 200,000 owners of the Commander Keen Complete Pack on Steam.
This is actually a crazy development. I didnt think we'd get this so soon. Doesn't stop at video games. Think about the simple things like applications. If a model learns how to just generate UX/UI without code, AND personalized...then programming probably really is cooked
Don't be silly a UX is only the interface to the underlying logic. LLM are much better at writing code than this technique which is hard to verify that is correct, hence they are trying it on games.
@TheReferrer72 In other words... Working Progress and not yet market viable. Good for devs, but a bad game is still a bad game. It certainly cuts costs, but I would still hire someone who greatly knows C Sharp or C++, etc. Or, who knows the game engine. Just in case the game breaks.
@TheReferrer72 No, the AI is the UI AND Logic. Wes doesn't mention it but in Matt Berman's vid he points out that the AI does visually keep up with stats like the amount of ammo. Maybe not 100% accurate but this is just the beginning. LLMs will create "world models" of applications, generate your UI and translate your usage/intent direct to compute and storage
Diffusion models start with a noisy image because it's a random blank slate. You could train it to start with a white background or something, but you get better results with noise, similar to having the weights and biases random in the model when you start training. I think of the slow diffusion as similar to chain of thought reasoning. It produces a better image if it slowly steps towards it from the random beginning rather than taking a giant leap. It has to slowly push each pixel to the right RGB values. That's how I think of them anyway.
About intuitively understanding Diffusion models, here are a couple of ideas, the second one more general and metaphoric, but way more relatable also: -1 A Glorified (Intelligent) Denoising Algorithm: a traditional denoising algorithm, knowing nothing about what kind of picture/subject is trying to rebuild, may work by deducing the values of random pixels by other pixels around them. Instead, the Diffusion algorithm knows what kind of picture is trying to rebuild. It will accommodate the random pixel values to a pattern that matches the prompted picture/subject (in which has been previously thoroughly trained). If you can imagine this process successfully removing the noise from a picture with, say, just a 10% of noise, in order to get a perfectly clear picture (much more efficiently than the "dumb" traditional denoising algorithm), then you must remember that the model has also been trained with labelled pairs of pictures for the previous, intermediate, steps for the prompted subject, like from 20% noise to the 10%, 30% to 20…up to the 100% (I made up the exact values, but you get the idea). -2 An Acute (Artificial) Case of Pareidolia: which is that thing that happens to us humans when we “see” something very familiar on a random configuration, like to see animal shapes in clouds, or JesusChrist’s face on a toasted slice of bread. The Diffusion algorithm has been trained on lots of patterns (they’r imprinted in the weights of the neural network kind of like the patterns of faces or familiar animals are imprinted in our brain) and by prompting we’r asking the IA to extract those very well known patterns from a totally random configuration. - I’m a graphic designer that knows nothing about mathematics or programming (plus, I’m not a native english speaker, in case anything sounds funny). These ideas are made up from watching some technical demonstrations and my own experience working with Stable Diffusion, but I think they may be kind of right since they’v helped me to understand the limitations and nuances of generative art. If @Wes Roth himself or anyone tech-savvy can elaborate of them (or correcting/criticize them) it would be great. Cheers.
Yeah, I agree. Pareidolia is a sort of decompression, or hallucination of what you want to see in randomness, and is definitely connected to generative A.I. It's crazy how learning about generative A.I. has so many insights into how the human mind works, and possible implications about reality itself.
Im 50yo now. I played this game the first time +-1995 with a Sinclair ZX Spectrum 128k in Portugal. And i have it until today! The game was amazing at the time. Hours and hours playing.. I remember so well as it was yesterday!
ChatGTP rough explanation for image diffusion models: An Intuitive AnalogyThink of it like sculpting a statue from a block of marble. Initially, the marble is just a rough shape (like noise). With each chisel stroke (denoising step), the sculptor refines the shape until a clear statue (image) emerges. The diffusion model is like a sculptor that knows how to chisel away noise to reveal the image underneath.
Premises: 1. The Loab is a product of several signals dropping off into a common, latent, noisy space, but this space is the "outer bullseye" of 100% Pure Noise, more like 80%. The weak signals produce a messy gestalt. 2. This works from the image end, and produces more noise, 100% pure noise, and then asks for a particular image, cuing an iterative process that *can* produce the image requested. 3. Requesting the same still image continuously causes something like a "reversion to the mean", where the noise gradually creeps in, as no *new*, stronger signal is being generated.Only producing a NEW image renews the signal strength. 4, As these images were formed from noise, they're inherently resistant to not only image noise, but are more intensely integrated into temporal sequences. Halt the temporal sequence, the noise starts to creep in. The momentum is everything. You've got yourself a hyperdimensional flywheel.
Excellent video. I appreciate the work you put into this one. Especially the South Park reference, and the other little gems, that raise the entertainment levels.
This is fascinating. At present it doesn't seem to have a "memory" of the game state: you go through a door, shoot an enemy to leave a corpse, go away and come back to the door, the corpse is gone and the zombie is back? (video clip from 13:04 onwards)
John Carmack should get some kind of evolutionary genius make a huge difference award or something. I STILL play DOOM, kids are STILL playing doom, either todays doom or yesterdays doom...lol.
7:00 It's more like taking apart the iPhone, piece by piece. In the AI's case, pixel by pixel. If you do that a few thousand times, you'd be able to build an iPhone from scratch parts. Likewise, after taking apart enough images pixel by pixel, it can rebuild them from pixels.
Its honestly kind of poetic the first game this technology is being used on is Doom. Doom, and Quake are pretty much responsible for and inspired most the games we play today.
If you watch 1000 videos of a stickperson being drawn, guaranteed you will learn exactly what it takes to draw a stick person, even if they are in different poses or multiple types of them, now imagine the same thing for everything else you see. This is why diffusion models work
That wouldn’t seem to work for something like playing an instrument. You could “look” like someone who is actually playing an instrument, but you wouldn’t without actually playing and “learning” the instrument. Unless, you were able to simulate lessons and “playing/learning” an instrument. (Perhaps) that’s what it is doing(?). Is this the general idea?
@@milesgrooms7343 if you think of it from the perspective of simulating the output, you’ll have the idea. Music generation models “listen” to the sounds being played for them in the same way image models “look” at images. They then try to simulate the sounds based on the prompt they receive. Their goal isn’t to actually play the instrument but to generate a sound that sounds like the output of someone playing the instrument/singing/rapping etc.
@@NA18NA makes sense, I follow. I have not used or played around with any of the generative AIs. Amazing to me to hear/see what is possible….wondered how it’s doing it though.
I would be very surprised to find out that the doom that was generated actually represented a consistent world with real physical laws. Although the randomness of it might add something to the game. But for example, the enemies might have somewhat random hit points
@WesRoth Generative image diffusion with noise basically works in the same way that humans look at clouds and imagine they see familiar shapes. In real life, clouds are really just a bunch of random noise, but humans look for patterns, so when it starts to resemble something similar to what a human has seen before, the human starts to think they see the clouds shaped like that object. Diffusion models work in the same way, where computers are taught to try to see shapes in random noise, and basically try to imagine that they see something in the noise.
Negative reinforcement = punishment for an incorrect action Positive reinforcement = praise for a correct action The noise is a reference point. You have the canvas within which to generate the image, some width by some height, it has two extremes; completely empty/blank where no Pixel is filled in and full noise where every single Pixel is filled in randomly. In between those two are a whole range of possibilities with pixels filled in with specific values. Filling specific pixels with the right values can give you an image of anything. Scanning over and over again different images of a given type (say cat) and looking at every level of noise for each image, you can map out the rules/patterns for how to alter individual pixels in order to produce that type of image….
5:00 Think of the image as signal or wave represented by sampling of the signal in 0s and 1s. Add noise, meaning samples randomly placed. Now add a filter. Because you know what you are filtering, you clean the noise leaving a higher resolution image. You have effectively upscaled. Meaning there's more points to represent the wave or image.
Just to get this out there - "positive reinforcement" is when you add something nice after the person being reinforced does something desirable. "Here's your chocolate, Penny." "Negative reinforcement" is when you remove something bad after the subject does something desirable. Like that annoying beeping before you buckle your seatbelt; it stops (is removed) when you do the desired thing, buckling your seatbelt. The term for poking someone with a stick when they do something undesirable is "punishment," not "negative reinforcement." Almost everyone gets this wrong. 🤪
6:50 I'm not an expert in any respect regarding how a neural net works on images but I do know that for each data point that must be learned, there must be a corresponding weight in the model in at least one layer. This doesn't take into account clever methods employed by scientists to reduce computation time. So, using this schema, the output determines the weights required and by the process of backwards propagation, a neural net during learning adjusts its weights so they match the input. In the case of an image with a resolution of 1024 X 786, in terms of the crude schema that I have proposed, the model would require 1024 X 786 weights. Once the weights are established across all the inputs for a given phenomenon, e.g., "cat", then the model can be put to work generating a cat from the noise from grosser to increasingly finer gradations at each point in the available matrix according to the available weights. The choice of gradations would be based on the input words, e.g., a cat sat on Wesley's laptop computer keyboard while he's filming another shocking RUclips video", such that the various phenomena would be integrated together into the weights as I imagine some sort of vector numerical operation. That's my best guess for what it's worth.
Denoise is common in photography applications. How it works is pretty simple to describe, in very high level terms...The original image was not random. It had a very specific pattern that humans (and the AI) can recognize as a dog or a cat or whatever. Then you add random noise to it until we can no longer recognize the original pattern. That does not mean the pattern is gone. It is simply hidden in the randomness of the noise. However, the pattern is not completely gone. It is only hidden from our limited human ability to recognize. Not so for an AI. It simply removes the random parts of the image and what is left is the original pattern. Yes, there is quite a bit of image degradation, but it works.
The bit about noise is simply a pattern recognition process. Like bluring a camera and the as you unblur the lens what the image could be becomes more obvious. A filter basically.
Ok here is your explanation: one necessary condition for a backpropagation model to learn is that the vectors in the entry layer must be separable by hyperplanes. Noise is not separable. So it delays and make the network stabilise only in the best minimum. Minimum in the hydimentional space of the entry vectors. Easy 😮
What's wild is I could see how someone training an AI model to play Doom could probably collect a large number of demo files storing playthroughs and use those to teach the model. This kind of took a different turn than I expected.
Regarding how the denoising in the diffusion model works, I think of it similar to an extreme form of compression. You can take a song in .wav format, and with a good compression algorithm, compress it way down to an .mp3, then in real time, as it is playing, the algorithm tries to recreate the original. If instead, you took input from many songs and compressed them, then added some flexibility in the decompression, you have a music generator. The brain of human artists and musicians works in a similar way. A great artist has seen tens of thousands of drawings, paintings, and reference images, and have arranged their neurons to be able to hallucinate various images, so when you ask them to draw a horse, they can hallucinate an image of a horse in their mind and put it on paper. Same with a musician writing musical notes. They hallucinate a song in their mind using decompression of the compressed imprint of all the music they've heard.
Of all the possible ways to remove noise from an image, some are more consistent than others with those arrangements of pixels that the network has learned are associated with a label of “cat”. At each step, the network prefers that pathway of maximum consistency. Noise removal in the direction of “cat”.
Some people say taking human brain cells and forcing them to play doom or do other tasks is a form of slavery because you're taking cells from a sentient being and forcing them to do something that they didn't volunteer to do.
You can write 3D graphics to run on a CPU w/o a graphics card and before OpenGL, they taught courses on this at my university during my CompSci Degree. It’s not easy though. The class was all about using math to do 3D graphics. You had to write your own code draw basic primitives like lines, circles, etc. and then write the code to draw a 3D scene and then rotate it, move through it, etc. That is all linear algebra math for transforming and projecting through a viewport. Not impossible, but not a walk in the part either. Carmak is definitely a smart dude to pull this off. Particularly b/c getting this to run in a performant way w/o a graphics card would be a feat just in and of itself.
Game devs will now have to develop AIs that can simulate the environment they intend, while still having to manage the game mechanics and story. Must me really cool to learn all that :)
I've been telling folks for about 2 years now that AI-genned games are coming. This is an example of it. We're going to see more games like this soon. Perhaps a prompt will guide the story and game mechanics to a degree, but AI-genned games may be fairly unique during each playthrough. The story and mechanics will develop around how the player plays the game, and the AI engine will be a kind of Dungeon Master or Game Master who develops the game as the player interacts with the game. Interstingly enough, this might lower requirements for hard drive space on folks' computers. It might also provide higher detail audiovisuals for roughly the same processing power because it's all being made or buffered by the AI server. Maybe two bad things about this are that we will need solid internet connections and we will likely get charged monthly subs for the games. Until folks can easily store their own AIs at home, I guess we will be connected to servers where the AI generates the games. Unfortunately, I can see game devs charging monthly subscriptions for games like this. I hope small, powerful AI game engines become a thing soon so we can play the game w/o internet connection or monthly subs.
The part @6:00 really has me stumped. Not in the sense that I can't understand it, but that I can't explain it, and I'm usually pretty good at that. The best I can say is that the AI first learns how to convert an image to a distorted image. Then, it learns how to read that distorted image and convert it back to the original image. It is then fed an extremely large database of images so that it can find a pattern and use it to learn to 'create' an image on its own. I hope that makes sense. 😂
Ok, as a neuroscientist, I think that what happens is similar to biological nets: If you tell to the computer to noise the picture a million times and then you take the "noised" pics to teach him how to reconstruct the pictures, the neurons will learn each little step of denoising. You may not see the picture intuitively, but WE ACTUALLY DO THIS. Want to see? Look a toddle learning how to walk, he have a lot of problems, his arms and legs do not answers adequately at all, they have too much noise and the brain needs to learn how to make the connections more stable and precise. We repeat this thousands, maybe millions of times through life. I just don't know if the non biological networks can "regenerate" while working or it needs a stage of learning like us (we do the learning in the sleep, it is physical reconstruction of the brain)
That's just a matter of semantics and a control system, really. It's only a clip because nobody is controlling it. I think it was nVidia who relatively recently showed that you could 'hint' a similar engine, though designed to simulate driving, with controlling inputs via a 'plainclothes' prop. In other words, the prop it's drawing it's images _over_ (think image-to-image) steers it's wheels to the right and the AI full on hallucinates an appropriate right turn, replete with terrain and road-markers moving across the field of view. It's really just about keying in the concept of turning, shooting, etc... with some sort of indicator or placeholder, so the engine has something to work with. Look, I'm not trying to put too fine a point on it, but I'm in my 40s. I've _never_ seen any technology, including internet and cell phones, advance at such a rapid pace. AI is actualizing innovations which _should,_ by all rights, be _decades_ away, but in mere _months._ Furthermore, the naysayers who claim the bubble will pop haven't actually done any real research as to what's being experimented with in academic circles. Reading the papers, there's a real claim to be made that we crossed Singularity with GPT3. Not full on AGI or ASI, but that point of inability to predict the future. We have, in labs, all the pieces to produce an ASI. Full stop. It's just a matter of integration hell, now.
@@eSKAone-how is it interactive? From what I am seeing, it seems to be a system that can convincingly choose next frame outputs based off of current context and learned patterns, but it does this in raw chunks. You couldn’t just take over for the ai and do any particular action, because it doesn’t actually understand what that action is, it just knows what the next frame should look like. I think in theory there might be a way to train a model to understand how any particular user input would change game state and then output this as the new world, but that doesn’t seem to be what they are doing here. Disclaimer: I am going off of this video and did not read the paper, so if you have more info on this lmk.
As impressive as this is, it's still totally intuitive to anyone who has sunk hours (days) into these games. You develop an inner sense of the rules of the game. You know exactly how long an Imps fireball will take to reach you, and can look at a room full of a hundred spawns of demons and instantly assess the relative threat level and form an attack plan that gives you the best chance of survival. If we adopt that generative models can learn anything that the average person can through repetition, it makes sense that it can not only learn to play the game, but to *become the game*
When entire games can be created from a simple prompt, with extra prompts to tweak the gameplay, design and setting, the industry will die, but the true metaverse / oasis from Ready Player One will be born. Imagine: Create a Witcher 3 style game set in a 1:1 scale Middle Earth where you play as a young Aragorn learning to become a legendary warrior and experienced Ranger. Or Create a Gears of War style world War 2 game set in the Battle of Stalingrad with both a German and a Russian campaign. The Narrative’s are true to historical records. Make it epic, horrific and tragic, with smooth gameplay and ultra realistic graphics.
6:59 I think at every iteration it is told to make it more cat-like. It starts with complete noise, for the sake of originality. Humans would start with a blank paper.
humans work with whtie sheet of paper slowly adding colors, ai works from random noise slowly addign and removign colors, but heres the thing, the ai can work from white page also, adding colors till it gets an image too, humans will say a cat is this shape so usualyl does a outline, then work on next feature one or 2 at a time kinda but ai will often think a cat is this shape or idea and shade in the entire cat in rough all at once, kinda outline and fill at the same time not concentrating on one detail at a time, but a tiny bit of the whole image like our minds if we are given 1 second to imagine a cat, vs 20 secodns to imagine a cat itd get mroe detailed isntantly, just we are slower at puttign it form mind to paper
@@any1alive Also... The AI has no frame of reference. While humans can simply find a cat picture. Which is why every new iteration of AI has to begin from the ground up from a clean Data Base, as not all iterations can use the same Data Base. While humans can always access a cat picture. Memory Loss is a big issue with LLMs. They can create new art and solve problems. But they can not replicate those solutions. Always having to start from step 1, then to step 2.
Another key idea here might be Divide et impera. I guess telling the AI to turn complete noise into a cat, will fail, need to much to focus on all at once. Telling it to do it by iteration, just make it slightly more cat-like for now, can make it happen. That's why soetimes it ends up with five paws: The fifth one was intended to be a shadow or anything in the background, but the next iteration redefines its purpose, this happens repeatedly, and at some point it looks too paw-like to change it back into a shadow. Why noise: Turning something into something else seems to be more feasible for AI than create from nothing. It's easy to predict what you'll type when there is a previous word. It's easier to predict what changes we should make to a noisy picture to correct it rather than adding things to an empty area. A cloudy sky helps you imagine shapes and animals, and a bit of photoshopping can turn it into actual animals. But no clouds -> no idea. AI probably watched how humans draw but that wasn't helping it to learn image composition. Turning a more noisy image into a less noisy one is a good approach.
@@absolstoryoffiction6615 yep pretty much its easier to fed a human than a ai, if we see a wierd lookign cat itll seeem wierd for a few secodns then yep thats a cat, but if the ai sees a wierd cat, its not jsut a few secodns to learn and adapt its net as its usually pretrty fixed and you can only store so much data in so many points/nodes/neurons or brances, data cramming, thats where data progrgramming and other scientific fields come in, like how many numbers can you count with 4 fingers, , well 4, but you can also do 8 if you use them to count in binary, or 81 if you count in trinary or 3 per digit, , or in 4's 3 digits plus it beign down as 0, you could count to 256 all that with just 4 fingers, but its a matter of trying to remember or store more than that or you cant kinda without cramming or stuff i went off on a tangent there and forgot my point i was gonna make lol but yeah, its hard for you to store more information in a ertain amoutn of points, thats why larger models can fit more information kinda and its so big when a small model is efficently trained and works good
@@Slaci-vl2io yeah, thats usually the steps or iterations, in some image generators, how many times to do the loop vs 1 hard big step, cut ti up into many small ones
I usually don't click your videos anymore because I do not want to reward clickbait titles. I am glad I made an exception here. I am sure you are using tons of AI tools to help write the script and for the video footage. But the production value of this video - apart from the interesting content - is still insane! Well done.
It's a good start. It means people can design games without spending a fortune for the next 5 years. Unless of course... Companies and Governments begin to rot it, as always.
Well, its almost exactly like doing that, only the laws of thermodynamics still hold so you have to do work to get it to happen. AI consumes massive amounts of CPU cycles.
imagine staring at the clouds or at the texturing on the wall for hours on end. you start to see things... shapes and patterns in the randomness. this is kinda like what's going on. multiple iterations of denoising that eventually look like the thing that was suggested (the prompt) you see in the cloud.
diffusion works by noticing features inside images as they are given the noise, patterns of colors and shapes. Think of a feature as a pixel made of pixels, and like paint by layering these features together it is possible to construct any image. The magic is in the labels on the input images such as "man with a beard" and then "man with a cat", with enough samples man beard and cat all can get their own features extracted so that cat with a beard is possible to layer out of features.
6:59: This method won't work for us because we don't have perfect memory. The AI does think about how drawing works for us. You need to remember the shape of the object where a line goes a long it is and so on.
Watch this become a DRM method. Instead of allowing downloads of your actual code and assets, you could train a model on your game and only release access to the AI model. Now no one can hack, mod, backwards engineer, etc.
Since everything is entropy and noise is an analogue for entropy, all you need to do is reverse entropy by expending energy and that's how I comprehend it with my limited understanding. With an advanced enough (atomic/molecular) AI it could theoretically transform any matter into any matter with sufficient energy/time.
You film yourself dissembling phones for thousands of hours. Then you reverse all the videos, and you train an AI on how to assemble an iPhone with that reversed footage. That would be a closer analogy since they train the AI on the reverse process of noise to image. They add noise, then they reverse, then they train on the reversed "de-noising" process.
This is so meta. AI that recreates the world in real-time as you are playing the game. More we advanced in AI everyday, the more I think that we are in a simulation. Whoever or whatever overloads put us in this sim I think they're having a pretty good time watching us lmao
Im qorld building for a game and i have an idea. I was recently using Talkie, Character AI, and some other app and i was thinking why not implement that tech in games? So my idea is: 1. Build a world with deep lore. 2. Come up with stories in said universe with really no real main quest. 3. Make a AI (Game Master) 4. Let said Game Master make changes on the fly based off of what the player does. This include dialog, choices from all NPCs and so on. I was also thinking, why not make a lot of different emotional animations and put it in a folder for the Game Master to pull from to make the game feel more lively. Building a SciFi Universe set in Sol and the AI name will be Sol as well. I believe this tech could work well. It does in Talkie, and Character AI so why not an actual game? I got this idea when i already built this universe within Character AI. I have bot made it public because its for me only but that universe i made in Character AI is pretty fun. I made a smaller version in Talkie called Sol RPG or something like that. Its smaller because i had less space to work with in Talkie. Im building the full world in Character AI and getting quest ideas as i go through it over and over again so if i ever make the actual game i have a solid foundation.
When i was a child i used to draw squigly lines all over a piece of paper, just to ' see' what shape or drawing i saw in the chaos, i figure AI learns or works like that too in this field
I like your coverage. Literally HATE your clickbait titles. Nothing is broken or a game changer every week. It's just a demo.
Agreed. Instant thumbs down and unsubscribe (if applicable) and I turn on ad-block when they do this clickbait foolishness.
We're all adults here. No need for childish titles.
The clickbait is why I clicked, so I see nothing wrong with it
I'm seeing this more and more where I might love a channel love their videos but I swear to God I cannot unsubscribe fast enough if they do that clickbait crap I can't stand it You know how many videos say oh my God scientists fine asteroid has life guaranteed announcement to come soon and then you watch the video and it's nothing about that nothing oh they should be able to be put in jail
Bad bot
Well it’s a complex problem. If you want views you have to play th algorithm game.
That's not the kind of AI Doom scenario i expected.
Superb comment 🤘🏼
Haha!
@@OscarTheStrategist Read, as always, in Shao Kahn's voice
Nice lol Cheers
It's a real doom scenario. For all humanity. Taking all of our jobs and replacing human connection
I like how Deep Mind doesnt promise anything. They just publish stuff.
'Hey! Look! We can now fold protiens. We can now create new synthetic materials. We can now do videogames.'
Can, could, might, may ... yep, that's all I've heard from the AI field in general recently. Wake me up when we these things are actually available to the general public, until then, it's all vaporware.
💯
Demis Hassabis > Sam Altman?
Diffusion models are a true "black box" problem! Excellent breakdown but I don’t think music in the background is necessary IMHO. I Love the channel though.
@@pvanukoff Oh it's not all vaporware. It's just that the good stuff is secured deep within the vaults of the hyper-wealthy. It's real, it's just not for you.
@@pvanukoffyou're so extremely dense bro
Commander Keen!
We’re old 😂
That's back when i programmed in GWBASIC
On 3.5 lol
I remember plating that game with this weird joystick where you could screw a little peg in the middle of the D-Pad.
I lost that piece almost immediately xD
Lol yes! 486….shareware….Doom!
This will mean that most of the in-game architecture and objects don't need to be built. They only need text-based descriptions. Any actions on those objects can be saved as text. The AI interprets it all. The NPC's will be the same. Games are going to change in a big way.
I only have one question... How much Memory Data do you think these games will require?
Algorithm Generation on a dynamic level into the game requires a lot of memory over time unless the player is able to delete those files/data in-game. Without deleting the Save File.
Astroneer and many Sand Box games have a cap on how much data the game can handle before crashing.
@@absolstoryoffiction6615 Probably not that much data for the actual gameplay, anything created during the game would probably only need to be saved as text files, it doesn't need to track separate objects since it's just a picture. It's like an LLM for video games, though the generator would then be far too large for anyone to store locally so I'd assume online play only if you want real time generation. There are AI models that can generate images already so this is "just" asking it to "predict the next frame", but with player input and fast enough frame generation.
We could basically give it a novel or script from a movie/series and have it create a game universe in that world
@@absolstoryoffiction6615 It is pretty big. Tesla is already been doing this type of thing training the FSD for it's fleet. They can simulate the roads, and it's conditions using all the Data they have and generate a scenario and situations on the road.
in a weird sense, The Sims actually did something slightly similar 20 years ago. The key is that in The Sims, it's not the characters that look for objects, it's the objects that advertise their descriptions to the characters and this is for performance reasons.
Simulating different pi worlds:
We could create virtual environments where pi has different values, allowing us to observe and study the effects.
Scientific breakthroughs:
New perspectives on the anthropic principle and fine-tuning in cosmology
Insights into the role of fundamental constants in complex systems
Potential discovery of hidden mathematical relationships
Think of the process like sculpting from a block of stone. The random noise is like the raw stone block, and the model learns how to chisel away the noise (the excess stone) to reveal the final image (the sculpture). By learning how to remove noise in this controlled manner, the model can generate complex images that look like they were naturally created, even though they started from nothing but noise.
Noise has everything, It selects what is required for a given frame.
An infinite ad hoc FPS is just entertainment, but an infinite ad hoc RPG is an early version of the Matrix.
The best analagy we have at the moment is 'we live in a simulation.
Code creates the mountains, the rivers, the trees, the birds in the wind...the wind itself.
In this simulation, our code is consciousness.
There are innumerable 'games' being played at the same time, in the same space. We call these dimensions.
Consciousness can and does transcend space, time, and dimensions.
Edit: and it is our unconscious/subconscious aspects that are reflected to us
""In the province of the mind what one believes to be true, either is true or becomes true within certain limits. These limits are to be found experimentally and experientially. When so found these limits turn out to be further beliefs to be transcended." John C Lilly
"The universe is an engine of narrative" Terrence McKenna
"This is my real country! I belong here. This is the land I have been looking for all my life, though I never knew it till now... Come further up, come further in!" CS Lewis
I used to paint a bit and have no technical AI chops, but my naïve intuition is diffusion models act a bit like artistic imagination to focus ideas onto images. Similar to looking at a bunch of dots or clouds then conceiving there might be a face in there somewhere, abstract thought processes bias the brain's expectation to actually imagine faces in the noise or clouds. Seems to me the denoising training is just a way to train associations in the latent space of concepts to bias the generative process post training. ie prompting a diffusion model is just priming the salient connections in the latent space concepts prompts allude to, to preferentially manifest the related image representations...
I have thought about this a lot previously, and I think you're exactly right.
it's the closest to robot dreams. People think it's just colorful wording, but it really is the closest layman description.
Imagine getting born just to exist inside a DOOM game for all eternity as a research subject
I don't have to imagine it, I was born to just respond to youtube comments
@@potat0-c7q the future is here 😞
Imagine an AI singularity being born into a fully realized virtual reality? Then who's to say it hasn't already happened? If we can do that, then it's probably already happened to us. Like in the plot of The 13th Floor. 2nd best sci fi concept since The Matrix IMHO. We become god at that point. I'm not even sure if True AI is possible because it leads. I don't think we will see it in our lifetime if it is even possible.
If life is but an AI-genned dream or the imagination of GaWd or a nightmare of a butterfly or whatever else, then I accept my position in the dream or imagination or nightmare or whatever. I will laugh and cheerfully go on about my absurd life, like nothing ever mattered. 😂
@@Paraselene_Tao You will gladly eat the Steak knowing your trapped in a VR Prison.. as long as your Rich! Ha
I believe that image diffusion models work by stretching vectors and tweaking pixel color values in latent space based on the training data, weights, conditioning (prompt, ControlNet, , etc), CLIP (how it identifies objects/text in the image), and the current noise. If you tell it to make a cat, it will start with noise or an image with noise added. It will stretch image vectors and adjust the pixel RGB values to more closely match the cats in the training data that had the most similar noise pattern with the most similar prompt/terms being used at that specific noise level. This will usually be a blend between multiple images in the training that have a similar noise profile, this is why it makes new cats. It starts with the most commonly seen part, usually a face. It will then add a very specific calculated amount of noise each step, and subtract it based on the CFG scale (more CFG will denoise faster, in effect following the prompt more by setting more groundwork early on) and the denoise amount.
It's much more complicated, but this is how I've came to understand it over the last couple years.
It's an interesting technology but not market viable yet.
Imagine an 8 Ball which can auto generate images on the fly. That's novelty.
But selling pictures that was generated?... ... ... I rather use Google Search Image for the same product but better & free.
so more token prediction but jwith image vectors?
@@dj007twk Yeah, basically. Similar to an LLM, just with vectors and RGB values as output predicted off the noise/conditioning input rather than text input. For video, they add a temporal prediction component to predict the change between frames, given the previous frames. This requires another model, such as AnimateDiff. I think they basically trained their own AnimateDiff model on Doom with added parameters for keyboard input. This isn't doing anything new. This is why it's able to keep track of ammo accurately, it's temporally consistent, but not as good with less predictable values that it has less training on.
Thanks for sharing your understanding
@@BlakeEM
True
6:15 - Intuitive Diffusion Explanation - I think diffusion models will start to make sense to you once you introduce the idea of blending as a step-by-step process, which learns the patterns in each step and associated them with words in the prompt ("iPhone"). Let’s revisit the blending example with this in mind. In the first step of blending, the components of the iPhone might still be mostly intact, just slightly damaged or bent. Now just go backwards - It’s relatively straightforward to imagine reconstructing the original iPhone from this slightly altered state. In the next step, the phone might be cut a bit-again, you could think about just gluing the parts back together. With each successive step, the damage increases, but simultaneously, the model learns how to reconstruct from a slightly more degraded state back to a more complete form. Each iteration teaches the model both a pattern of "deconstruction" and a pattern of "reconstruction."
I think the main roadblock you were having had to do with trying to imagine how you could reconstruct a full phone from its fully blended components, but this mental model doesn't include within it the learning that occurs at each step to go from fully functioning phone to blended pulp of a phone and back again. If you include that "intelligence" in the process and look at each step as learning representations to go back and forth between the more blended and less blended phone iterations, you can imagine how you could take a phone pulp and reconstruct a phone out of it (at least visually, not materially).
Think of the "noise" or initial state like a set of scattered puzzle pieces that are random enough and numerous enough to form any image we might want if we modify them slightly. We can dip the puzzle pieces in paint to change their color or cut their edges a little more if we really want to. The prompt acts as a guide for assembling, cutting, and coloring these pieces. When the model is effectively trained, it learns how to use the prompt to modify and organize the pieces together, step-by-step, to form a coherent picture. It learned this from going backwards thousands of times and learning "in general" what to do each step.
How was that for an intuitive explanation? I tried to keep all the math out and just give the concept. Does it make more sense how it is possible and how it works?
@Wes Roth - good explanation here
An AI never ending map of GTA would be great.
Wolfenstein 3D was the first 1st person shooter, released in 1992 2:11
There were loads of 3D games for pc before doom, and Wolfenstein. For instance, 1983 Star wars, or 1989 MechWarrior. In fact ID software made a game in 1991 that was the predecessor to Wolfenstein 3D, called catacomb 3D. One of my favorites was the D&D game by ssi called eye of the beholder. I think it might have been the first 3D PC game that I played. People mistake a lot and say that Doom was the first 3D game, Doom was the first popular 3D game, and Wolfenstein 3D was the runner-up to that. There weren't enough people in to PCS in the '80s for the early 3D titles to shine. Some of the best PC games were made from the late '80s to the mid '90s. Hands down.
Imagine playing a game with flux or midjorney type detailed graphics. When the graphics shift from geometry-based polygons to highly detailed 2D images that appear 3D, complete with fake volymetric particles, ray tracing, and super smooth angles in every frame. Achieving that level of detail at 60 fps is a long journey ahead. But once it's achieved, the graphics will be more beautiful than real life.
It will be Gumball, with different animation styles and art styles being dynamic instead of static to the player.
Add some extra years for the VR versions. What a time to be alive !
Maybe the simulation we live in is more beautiful than real life
VR adult movies generating on the fly depending on your arousal state 🤤
we'll soon be playing with electric sheep
Commander Keen! ⛑️
You beat me to it..
ChrisSherlock@@ChrisSherlockgood times, I loved those games
Commander Keen is a series of video games developed by id Software, starting in 1990. The main character, Billy Blaze, also known as Commander Keen, is an eight-year-old genius who travels through space and battles enemies using his raygun and pogo stick. The first set of episodes, "Invasion of the Vorticons," was released for MS-DOS in 1990. A standalone game called "Keen Dreams" was developed as a prototype for new systems and ideas for future games. It was completed in less than a month while simultaneously working on another game. The series has since gained popularity, with over 80,000 owners of the Keen Dreams release and 200,000 owners of the Commander Keen Complete Pack on Steam.
This is actually a crazy development. I didnt think we'd get this so soon. Doesn't stop at video games. Think about the simple things like applications. If a model learns how to just generate UX/UI without code, AND personalized...then programming probably really is cooked
Don't be silly a UX is only the interface to the underlying logic.
LLM are much better at writing code than this technique which is hard to verify that is correct, hence they are trying it on games.
@TheReferrer72
In other words... Working Progress and not yet market viable.
Good for devs, but a bad game is still a bad game.
It certainly cuts costs, but I would still hire someone who greatly knows C Sharp or C++, etc. Or, who knows the game engine. Just in case the game breaks.
@@absolstoryoffiction6615 No way Devs jobs are secure for at least a couple of years, these tools may enhance their toolset not take jobs.
@TheReferrer72
Correct... Because having experienced employees long term (5+ years) is not common in the gaming industry. Rare but unlikely.
@TheReferrer72 No, the AI is the UI AND Logic. Wes doesn't mention it but in Matt Berman's vid he points out that the AI does visually keep up with stats like the amount of ammo. Maybe not 100% accurate but this is just the beginning. LLMs will create "world models" of applications, generate your UI and translate your usage/intent direct to compute and storage
Diffusion models start with a noisy image because it's a random blank slate. You could train it to start with a white background or something, but you get better results with noise, similar to having the weights and biases random in the model when you start training. I think of the slow diffusion as similar to chain of thought reasoning. It produces a better image if it slowly steps towards it from the random beginning rather than taking a giant leap. It has to slowly push each pixel to the right RGB values.
That's how I think of them anyway.
About intuitively understanding Diffusion models, here are a couple of ideas, the second one more general and metaphoric, but way more relatable also:
-1 A Glorified (Intelligent) Denoising Algorithm: a traditional denoising algorithm, knowing nothing about what kind of picture/subject is trying to rebuild, may work by deducing the values of random pixels by other pixels around them. Instead, the Diffusion algorithm knows what kind of picture is trying to rebuild. It will accommodate the random pixel values to a pattern that matches the prompted picture/subject (in which has been previously thoroughly trained). If you can imagine this process successfully removing the noise from a picture with, say, just a 10% of noise, in order to get a perfectly clear picture (much more efficiently than the "dumb" traditional denoising algorithm), then you must remember that the model has also been trained with labelled pairs of pictures for the previous, intermediate, steps for the prompted subject, like from 20% noise to the 10%, 30% to 20…up to the 100% (I made up the exact values, but you get the idea).
-2 An Acute (Artificial) Case of Pareidolia: which is that thing that happens to us humans when we “see” something very familiar on a random configuration, like to see animal shapes in clouds, or JesusChrist’s face on a toasted slice of bread. The Diffusion algorithm has been trained on lots of patterns (they’r imprinted in the weights of the neural network kind of like the patterns of faces or familiar animals are imprinted in our brain) and by prompting we’r asking the IA to extract those very well known patterns from a totally random configuration.
-
I’m a graphic designer that knows nothing about mathematics or programming (plus, I’m not a native english speaker, in case anything sounds funny). These ideas are made up from watching some technical demonstrations and my own experience working with Stable Diffusion, but I think they may be kind of right since they’v helped me to understand the limitations and nuances of generative art. If @Wes Roth himself or anyone tech-savvy can elaborate of them (or correcting/criticize them) it would be great. Cheers.
Yeah, I agree. Pareidolia is a sort of decompression, or hallucination of what you want to see in randomness, and is definitely connected to generative A.I. It's crazy how learning about generative A.I. has so many insights into how the human mind works, and possible implications about reality itself.
It's a great day when you see a Wes Roth video :))
Im 50yo now. I played this game the first time +-1995 with a Sinclair ZX Spectrum 128k in Portugal. And i have it until today!
The game was amazing at the time. Hours and hours playing.. I remember so well as it was yesterday!
Diffusion model is essentially reverse entropy, which if it’s possible in code would be fascinating to explore how this relates to physical reality
ChatGTP rough explanation for image diffusion models: An Intuitive AnalogyThink of it like sculpting a statue from a block of marble. Initially, the marble is just a rough shape (like noise). With each chisel stroke (denoising step), the sculptor refines the shape until a clear statue (image) emerges. The diffusion model is like a sculptor that knows how to chisel away noise to reveal the image underneath.
Premises:
1. The Loab is a product of several signals dropping off into a common, latent, noisy space, but this space is the "outer bullseye" of 100% Pure Noise, more like 80%. The weak signals produce a messy gestalt.
2. This works from the image end, and produces more noise, 100% pure noise, and then asks for a particular image, cuing an iterative process that *can* produce the image requested.
3. Requesting the same still image continuously causes something like a "reversion to the mean", where the noise gradually creeps in, as no *new*, stronger signal is being generated.Only producing a NEW image renews the signal strength.
4, As these images were formed from noise, they're inherently resistant to not only image noise, but are more intensely integrated into temporal sequences. Halt the temporal sequence, the noise starts to creep in. The momentum is everything. You've got yourself a hyperdimensional flywheel.
Excellent video. I appreciate the work you put into this one. Especially the South Park reference, and the other little gems, that raise the entertainment levels.
This is fascinating. At present it doesn't seem to have a "memory" of the game state: you go through a door, shoot an enemy to leave a corpse, go away and come back to the door, the corpse is gone and the zombie is back? (video clip from 13:04 onwards)
I'm surprised we don't yet have an AI making WADS/levels, that's what I'd like to see (and I don't mean procedural like SLIGE, I mean deep learning)
John Carmack should get some kind of evolutionary genius make a huge difference award or something. I STILL play DOOM, kids are STILL playing doom, either todays doom or yesterdays doom...lol.
7:00 It's more like taking apart the iPhone, piece by piece. In the AI's case, pixel by pixel. If you do that a few thousand times, you'd be able to build an iPhone from scratch parts. Likewise, after taking apart enough images pixel by pixel, it can rebuild them from pixels.
FASCINATING!!!! I'm not even a gamer and this video fascinates, also love the side story about putting Doom on tiny screens - I had no idea
Brilliant! One of the best youtube videos I have seen in a while.
Nice voice man, nice presentation.... Noice.
Its honestly kind of poetic the first game this technology is being used on is Doom. Doom, and Quake are pretty much responsible for and inspired most the games we play today.
This is AMAZING and this is TODAY!!! Next year it will be 10x better easily!!! great video!!!
If you watch 1000 videos of a stickperson being drawn, guaranteed you will learn exactly what it takes to draw a stick person, even if they are in different poses or multiple types of them, now imagine the same thing for everything else you see. This is why diffusion models work
That wouldn’t seem to work for something like playing an instrument. You could “look” like someone who is actually playing an instrument, but you wouldn’t without actually playing and “learning” the instrument.
Unless, you were able to simulate lessons and “playing/learning” an instrument. (Perhaps) that’s what it is doing(?).
Is this the general idea?
@@milesgrooms7343 if you think of it from the perspective of simulating the output, you’ll have the idea. Music generation models “listen” to the sounds being played for them in the same way image models “look” at images. They then try to simulate the sounds based on the prompt they receive. Their goal isn’t to actually play the instrument but to generate a sound that sounds like the output of someone playing the instrument/singing/rapping etc.
@@NA18NA makes sense, I follow. I have not used or played around with any of the generative AIs.
Amazing to me to hear/see what is possible….wondered how it’s doing it though.
I would be very surprised to find out that the doom that was generated actually represented a consistent world with real physical laws. Although the randomness of it might add something to the game. But for example, the enemies might have somewhat random hit points
Can’t believe number theory has lead to this. Amazing 🔥
Maybe whatever traumatized the AI agents in that building is what Ilya saw.
You had me at "positive reinforcement", but the "ASMR voice" was pure joy!
@WesRoth Generative image diffusion with noise basically works in the same way that humans look at clouds and imagine they see familiar shapes. In real life, clouds are really just a bunch of random noise, but humans look for patterns, so when it starts to resemble something similar to what a human has seen before, the human starts to think they see the clouds shaped like that object. Diffusion models work in the same way, where computers are taught to try to see shapes in random noise, and basically try to imagine that they see something in the noise.
Negative reinforcement = punishment for an incorrect action
Positive reinforcement = praise for a correct action
The noise is a reference point. You have the canvas within which to generate the image, some width by some height, it has two extremes; completely empty/blank where no
Pixel is filled in and full noise where every single Pixel is filled in randomly. In between those two are a whole range of possibilities with pixels filled in with specific values. Filling specific pixels with the right values can give you an image of anything. Scanning over and over again different images of a given type (say cat) and looking at every level of noise for each image, you can map out the rules/patterns for how to alter individual pixels in order to produce that type of image….
5:00 Think of the image as signal or wave represented by sampling of the signal in 0s and 1s. Add noise, meaning samples randomly placed. Now add a filter. Because you know what you are filtering, you clean the noise leaving a higher resolution image. You have effectively upscaled. Meaning there's more points to represent the wave or image.
Just to get this out there - "positive reinforcement" is when you add something nice after the person being reinforced does something desirable. "Here's your chocolate, Penny."
"Negative reinforcement" is when you remove something bad after the subject does something desirable. Like that annoying beeping before you buckle your seatbelt; it stops (is removed) when you do the desired thing, buckling your seatbelt. The term for poking someone with a stick when they do something undesirable is "punishment," not "negative reinforcement."
Almost everyone gets this wrong. 🤪
Computer games are the modern day puzzle to be read/listened and enjoyed, thank you dude :)
0:42 The most underrated statement of the entire video. You're basically hijacking someone else's dreams at that point. That's kinda cool tbh
So... It's like the Matrix?
Bro that quick shot of Commander Keen!! I totally remember playing that game. Havent thought about that in years😂
This type of research is how the UAC was born
@Wes Roth Commander Keen!! I actually had that on a 3.5" floppy waaaay back in the day lol
6:50 I'm not an expert in any respect regarding how a neural net works on images but I do know that for each data point that must be learned, there must be a corresponding weight in the model in at least one layer. This doesn't take into account clever methods employed by scientists to reduce computation time. So, using this schema, the output determines the weights required and by the process of backwards propagation, a neural net during learning adjusts its weights so they match the input. In the case of an image with a resolution of 1024 X 786, in terms of the crude schema that I have proposed, the model would require 1024 X 786 weights. Once the weights are established across all the inputs for a given phenomenon, e.g., "cat", then the model can be put to work generating a cat from the noise from grosser to increasingly finer gradations at each point in the available matrix according to the available weights. The choice of gradations would be based on the input words, e.g., a cat sat on Wesley's laptop computer keyboard while he's filming another shocking RUclips video", such that the various phenomena would be integrated together into the weights as I imagine some sort of vector numerical operation. That's my best guess for what it's worth.
Denoise is common in photography applications. How it works is pretty simple to describe, in very high level terms...The original image was not random. It had a very specific pattern that humans (and the AI) can recognize as a dog or a cat or whatever. Then you add random noise to it until we can no longer recognize the original pattern. That does not mean the pattern is gone. It is simply hidden in the randomness of the noise. However, the pattern is not completely gone. It is only hidden from our limited human ability to recognize. Not so for an AI. It simply removes the random parts of the image and what is left is the original pattern. Yes, there is quite a bit of image degradation, but it works.
The bit about noise is simply a pattern recognition process. Like bluring a camera and the as you unblur the lens what the image could be becomes more obvious. A filter basically.
Ok here is your explanation: one necessary condition for a backpropagation model to learn is that the vectors in the entry layer must be separable by hyperplanes. Noise is not separable. So it delays and make the network stabilise only in the best minimum. Minimum in the hydimentional space of the entry vectors. Easy 😮
What's wild is I could see how someone training an AI model to play Doom could probably collect a large number of demo files storing playthroughs and use those to teach the model. This kind of took a different turn than I expected.
Imagine reincarnating and you're in Doom on an internet server somewhere unknown and you're only purpose is to wander around endlessly to train AI.
Regarding how the denoising in the diffusion model works, I think of it similar to an extreme form of compression. You can take a song in .wav format, and with a good compression algorithm, compress it way down to an .mp3, then in real time, as it is playing, the algorithm tries to recreate the original. If instead, you took input from many songs and compressed them, then added some flexibility in the decompression, you have a music generator. The brain of human artists and musicians works in a similar way. A great artist has seen tens of thousands of drawings, paintings, and reference images, and have arranged their neurons to be able to hallucinate various images, so when you ask them to draw a horse, they can hallucinate an image of a horse in their mind and put it on paper. Same with a musician writing musical notes. They hallucinate a song in their mind using decompression of the compressed imprint of all the music they've heard.
Of all the possible ways to remove noise from an image, some are more consistent than others with those arrangements of pixels that the network has learned are associated with a label of “cat”. At each step, the network prefers that pathway of maximum consistency.
Noise removal in the direction of “cat”.
Yes, I'm old enough and nerdy enough to remember Commander Keen. Great game.
Some people say taking human brain cells and forcing them to play doom or do other tasks is a form of slavery because you're taking cells from a sentient being and forcing them to do something that they didn't volunteer to do.
You can write 3D graphics to run on a CPU w/o a graphics card and before OpenGL, they taught courses on this at my university during my CompSci Degree. It’s not easy though. The class was all about using math to do 3D graphics. You had to write your own code draw basic primitives like lines, circles, etc. and then write the code to draw a 3D scene and then rotate it, move through it, etc. That is all linear algebra math for transforming and projecting through a viewport. Not impossible, but not a walk in the part either. Carmak is definitely a smart dude to pull this off. Particularly b/c getting this to run in a performant way w/o a graphics card would be a feat just in and of itself.
Game devs will now have to develop AIs that can simulate the environment they intend, while still having to manage the game mechanics and story. Must me really cool to learn all that :)
Wolfenstein 3D released one year prior directly lead to the development of Doom
Wasn't it the same guy?
@@andrasbiro3007 Yep lol. And Doom is a better game in every way, so I don't know why this was even brought up.
@@delphicdescant
Of course Doom is better, it's the next iteration of the game engine.
@@andrasbiro3007 Right, that's what I'm saying.
I've been telling folks for about 2 years now that AI-genned games are coming. This is an example of it. We're going to see more games like this soon. Perhaps a prompt will guide the story and game mechanics to a degree, but AI-genned games may be fairly unique during each playthrough. The story and mechanics will develop around how the player plays the game, and the AI engine will be a kind of Dungeon Master or Game Master who develops the game as the player interacts with the game.
Interstingly enough, this might lower requirements for hard drive space on folks' computers. It might also provide higher detail audiovisuals for roughly the same processing power because it's all being made or buffered by the AI server. Maybe two bad things about this are that we will need solid internet connections and we will likely get charged monthly subs for the games. Until folks can easily store their own AIs at home, I guess we will be connected to servers where the AI generates the games. Unfortunately, I can see game devs charging monthly subscriptions for games like this. I hope small, powerful AI game engines become a thing soon so we can play the game w/o internet connection or monthly subs.
The response time of the AI Doom to your key presses would be insanely high. If the reviewers had actually played the game, there'd be no confusion.
The part @6:00 really has me stumped. Not in the sense that I can't understand it, but that I can't explain it, and I'm usually pretty good at that. The best I can say is that the AI first learns how to convert an image to a distorted image. Then, it learns how to read that distorted image and convert it back to the original image. It is then fed an extremely large database of images so that it can find a pattern and use it to learn to 'create' an image on its own. I hope that makes sense. 😂
Ok, as a neuroscientist, I think that what happens is similar to biological nets:
If you tell to the computer to noise the picture a million times and then you take the "noised" pics to teach him how to reconstruct the pictures, the neurons will learn each little step of denoising. You may not see the picture intuitively, but WE ACTUALLY DO THIS. Want to see? Look a toddle learning how to walk, he have a lot of problems, his arms and legs do not answers adequately at all, they have too much noise and the brain needs to learn how to make the connections more stable and precise. We repeat this thousands, maybe millions of times through life. I just don't know if the non biological networks can "regenerate" while working or it needs a stage of learning like us (we do the learning in the sleep, it is physical reconstruction of the brain)
ASMR voice threatening 😂👍
I am old. Played Commander Keen as a kid. Liked it.
This is not a video game generation model. It's a video game clip generation model.
That's just a matter of semantics and a control system, really. It's only a clip because nobody is controlling it. I think it was nVidia who relatively recently showed that you could 'hint' a similar engine, though designed to simulate driving, with controlling inputs via a 'plainclothes' prop. In other words, the prop it's drawing it's images _over_ (think image-to-image) steers it's wheels to the right and the AI full on hallucinates an appropriate right turn, replete with terrain and road-markers moving across the field of view. It's really just about keying in the concept of turning, shooting, etc... with some sort of indicator or placeholder, so the engine has something to work with.
Look, I'm not trying to put too fine a point on it, but I'm in my 40s. I've _never_ seen any technology, including internet and cell phones, advance at such a rapid pace. AI is actualizing innovations which _should,_ by all rights, be _decades_ away, but in mere _months._ Furthermore, the naysayers who claim the bubble will pop haven't actually done any real research as to what's being experimented with in academic circles. Reading the papers, there's a real claim to be made that we crossed Singularity with GPT3. Not full on AGI or ASI, but that point of inability to predict the future. We have, in labs, all the pieces to produce an ASI. Full stop. It's just a matter of integration hell, now.
But it's interactive
@@eSKAone-how is it interactive? From what I am seeing, it seems to be a system that can convincingly choose next frame outputs based off of current context and learned patterns, but it does this in raw chunks. You couldn’t just take over for the ai and do any particular action, because it doesn’t actually understand what that action is, it just knows what the next frame should look like. I think in theory there might be a way to train a model to understand how any particular user input would change game state and then output this as the new world, but that doesn’t seem to be what they are doing here. Disclaimer: I am going off of this video and did not read the paper, so if you have more info on this lmk.
@@jswew12 "GameNGen can interactively simulate the classic game DOOM at over 20 frames per second on a single TPU"
it's also flat out said in the video at 0:36 to 0:40
I was there. I played Commander Keen when it was out for the first time. Duke Nukem too. And Lemmings.
As impressive as this is, it's still totally intuitive to anyone who has sunk hours (days) into these games. You develop an inner sense of the rules of the game. You know exactly how long an Imps fireball will take to reach you, and can look at a room full of a hundred spawns of demons and instantly assess the relative threat level and form an attack plan that gives you the best chance of survival. If we adopt that generative models can learn anything that the average person can through repetition, it makes sense that it can not only learn to play the game, but to *become the game*
AI Doom could have also been the name of the Terminator movie.
When entire games can be created from a simple prompt, with extra prompts to tweak the gameplay, design and setting, the industry will die, but the true metaverse / oasis from Ready Player One will be born.
Imagine:
Create a Witcher 3 style game set in a 1:1 scale Middle Earth where you play as a young Aragorn learning to become a legendary warrior and experienced Ranger.
Or
Create a Gears of War style
world War 2 game set in the Battle of Stalingrad with both a German and a Russian campaign. The Narrative’s are true to historical records. Make it epic, horrific and tragic, with smooth gameplay and ultra realistic graphics.
Loved commander keen’s game in a game on his wristwatch ❤
6:59 I think at every iteration it is told to make it more cat-like. It starts with complete noise, for the sake of originality. Humans would start with a blank paper.
humans work with whtie sheet of paper slowly adding colors, ai works from random noise slowly addign and removign colors,
but heres the thing, the ai can work from white page also, adding colors till it gets an image too,
humans will say a cat is this shape so usualyl does a outline, then work on next feature one or 2 at a time kinda
but ai will often think a cat is this shape or idea and shade in the entire cat in rough all at once, kinda outline and fill at the same time not concentrating on one detail at a time, but a tiny bit of the whole image
like our minds if we are given 1 second to imagine a cat, vs 20 secodns to imagine a cat itd get mroe detailed isntantly, just we are slower at puttign it form mind to paper
@@any1alive
Also... The AI has no frame of reference. While humans can simply find a cat picture.
Which is why every new iteration of AI has to begin from the ground up from a clean Data Base, as not all iterations can use the same Data Base. While humans can always access a cat picture.
Memory Loss is a big issue with LLMs. They can create new art and solve problems. But they can not replicate those solutions. Always having to start from step 1, then to step 2.
Another key idea here might be Divide et impera. I guess telling the AI to turn complete noise into a cat, will fail, need to much to focus on all at once. Telling it to do it by iteration, just make it slightly more cat-like for now, can make it happen. That's why soetimes it ends up with five paws: The fifth one was intended to be a shadow or anything in the background, but the next iteration redefines its purpose, this happens repeatedly, and at some point it looks too paw-like to change it back into a shadow.
Why noise: Turning something into something else seems to be more feasible for AI than create from nothing. It's easy to predict what you'll type when there is a previous word. It's easier to predict what changes we should make to a noisy picture to correct it rather than adding things to an empty area. A cloudy sky helps you imagine shapes and animals, and a bit of photoshopping can turn it into actual animals. But no clouds -> no idea.
AI probably watched how humans draw but that wasn't helping it to learn image composition. Turning a more noisy image into a less noisy one is a good approach.
@@absolstoryoffiction6615 yep pretty much its easier to fed a human than a ai, if we see a wierd lookign cat itll seeem wierd for a few secodns then yep thats a cat, but if the ai sees a wierd cat, its not jsut a few secodns to learn and adapt its net as its usually pretrty fixed
and you can only store so much data in so many points/nodes/neurons or brances, data cramming, thats where data progrgramming and other scientific fields come in, like how many numbers can you count with 4 fingers, , well 4, but you can also do 8 if you use them to count in binary, or 81 if you count in trinary or 3 per digit, , or in 4's 3 digits plus it beign down as 0, you could count to 256 all that with just 4 fingers, but its a matter of trying to remember or store more than that or you cant kinda without cramming or stuff i went off on a tangent there and forgot my point i was gonna make lol
but yeah, its hard for you to store more information in a ertain amoutn of points, thats why larger models can fit more information kinda and its so big when a small model is efficently trained and works good
@@Slaci-vl2io yeah, thats usually the steps or iterations, in some image generators, how many times to do the loop
vs 1 hard big step, cut ti up into many small ones
When you asked to like the video I realized I unconsciously already had done that. Great channel.
I usually don't click your videos anymore because I do not want to reward clickbait titles. I am glad I made an exception here. I am sure you are using tons of AI tools to help write the script and for the video footage. But the production value of this video - apart from the interesting content - is still insane! Well done.
Ah the timeless benchmark for electronic devices. Amazing
Eventually AI will be able to generate classic games with today's graphics and functionality
It's a good start. It means people can design games without spending a fortune for the next 5 years.
Unless of course... Companies and Governments begin to rot it, as always.
@@absolstoryoffiction6615 they gota protect there monopolies.
They probably make barriers to entry expensive
The backwards defuser is like changing the direction of entropy. I know it's not actually doing that but conceptually that's what it's like.
Well, its almost exactly like doing that, only the laws of thermodynamics still hold so you have to do work to get it to happen. AI consumes massive amounts of CPU cycles.
Building your own Doom maps is tons of fun and a great skill to learn. AI needs to stand a back bit and stop thinking what humans 'need'
imagine staring at the clouds or at the texturing on the wall for hours on end. you start to see things... shapes and patterns in the randomness. this is kinda like what's going on. multiple iterations of denoising that eventually look like the thing that was suggested (the prompt) you see in the cloud.
diffusion works by noticing features inside images as they are given the noise, patterns of colors and shapes. Think of a feature as a pixel made of pixels, and like paint by layering these features together it is possible to construct any image. The magic is in the labels on the input images such as "man with a beard" and then "man with a cat", with enough samples man beard and cat all can get their own features extracted so that cat with a beard is possible to layer out of features.
6:59: This method won't work for us because we don't have perfect memory. The AI does think about how drawing works for us. You need to remember the shape of the object where a line goes a long it is and so on.
Amazing that Doom is being used once again in pioneering tech.
-What is my purpose?
-You play Doom.
-Oh my god.
Watch this become a DRM method. Instead of allowing downloads of your actual code and assets, you could train a model on your game and only release access to the AI model. Now no one can hack, mod, backwards engineer, etc.
Since everything is entropy and noise is an analogue for entropy,
all you need to do is reverse entropy by expending energy and that's how I comprehend it with my limited understanding.
With an advanced enough (atomic/molecular) AI it could theoretically transform any matter into any matter with sufficient energy/time.
You film yourself dissembling phones for thousands of hours. Then you reverse all the videos, and you train an AI on how to assemble an iPhone with that reversed footage.
That would be a closer analogy since they train the AI on the reverse process of noise to image. They add noise, then they reverse, then they train on the reversed "de-noising" process.
True... Do the same work motion over and over until you get it to a tea. Instead of throwing random data which messes up the learning process.
Computerphile has great videos explaining diffusion! Can recommend!
This most interesting part of this is how AI naturally learned to avoid something it should do because it's punished so heavily for it.
video game emulation is about to be insane.
Carefully curated, clearly classic
The most impressive thing is Doomguy's hand doesn't appear to contain extra digits. THAT is frightening progress.
This is so meta. AI that recreates the world in real-time as you are playing the game.
More we advanced in AI everyday, the more I think that we are in a simulation. Whoever or whatever overloads put us in this sim I think they're having a pretty good time watching us lmao
Im qorld building for a game and i have an idea.
I was recently using Talkie, Character AI, and some other app and i was thinking why not implement that tech in games?
So my idea is:
1. Build a world with deep lore.
2. Come up with stories in said universe with really no real main quest.
3. Make a AI (Game Master)
4. Let said Game Master make changes on the fly based off of what the player does. This include dialog, choices from all NPCs and so on.
I was also thinking, why not make a lot of different emotional animations and put it in a folder for the Game Master to pull from to make the game feel more lively.
Building a SciFi Universe set in Sol and the AI name will be Sol as well. I believe this tech could work well. It does in Talkie, and Character AI so why not an actual game?
I got this idea when i already built this universe within Character AI. I have bot made it public because its for me only but that universe i made in Character AI is pretty fun. I made a smaller version in Talkie called Sol RPG or something like that. Its smaller because i had less space to work with in Talkie. Im building the full world in Character AI and getting quest ideas as i go through it over and over again so if i ever make the actual game i have a solid foundation.
Love the bullet holes appearing half way up every wall. Assuming training involved a lot of firing at walls :p
Keen is a reference to Commander Keen, a Jump and Run from the late 80s. The first game i watched my dad playing.
When i was a child i used to draw squigly lines all over a piece of paper, just to ' see' what shape or drawing i saw in the chaos, i figure AI learns or works like that too in this field
Commander Keen was a hero of my childhood 😄
Holy shit... this is insanely accurate...
Very interesting video. Thanks for the content!