👉 To try everything Brilliant has to offer for free for a full 30 days, visit brilliant.org/godago + You'll also get 20% off an annual premium subscription.
The reason why 70% of the data is synthetic is so obvious. It's data that they didn't have rights to use verbatim so they used LLM to summarise or rewrite the data in a way where it's not possible to get sued for using it.
It's also able to be formatted in a way that likely is better for training. In 2 years, I imagine they have done small scale training tests to see how the density and style of text/tokens impacts the model. Once they know the best, synthesis of ALL data makes sense. Hell cluade 3.5 articulates many subjects better than many humans, even those who write books. This clearly the case the synthetic data is the best way to train. For example, fill a context window with a scientific paper, summarise the text, articulate it in leymens terms and compare contrast against other similar papers to create a literature review on this topic. Synthetic data... that will be pretty useful.
@@SearchingForSounds I was replying to OP, you’re fine, you at least somewhat get it. It’s called distillation. You can train expert models specifically on that topic, and use it to “teach” other models in a concise, but dense way. Like a professor to a university student. You can even have an intermediate model that specialises in translating to general models, like a high school teacher, that doesn’t even necessarily understand the subject entirely, but is good at rewording in a way to bridge the gap.
It doesn't matter if the data is synthetic or not, it matters about the quality of the data. only a human can truely check the quality with a degree of certainty, so the big question is not if the data is synthetic but whether it has been verified and validated by a Human.
I think the more humans put their fingers in data, the more corrupt and inaccurate it becomes. A simple method to reach artificial super intelligence would probably simply be to let ai "clean up" the human produced data, with all of its emotions and stupid human bias.
I think "accuracy" of the Data, whether it is natural or synthetic, is far more important than how much Data you have. One accurate Formula beats a million inaccurate Formulas in the same Domain.
Yes that is for sure a problem based on my tests (mostly physics questions). Internet is full of incorrect info and that is integrated in the LLM's. Maybe problem could be solved by passing all the data trough a chain of reasoning filter.
yeah, and when dealing with AI, it tends to try to find the lowest common denominator. In many cases, this lowest common denominator tends to be incorrect info - it learns that it can just create information it can't link to a valid enough source, and models like Claude have a habit of generating sources to validate themselves since they can't double check information. This is the core of hallucinations, that and link rot. When you make ai a primary source, this lowest common denominator carries even more weight in your future iterations - it confirms that, yes, this thing that showed up only sometimes in the primary dataset also shows up even more in the dataset as a whole, which must mean it's the simplest path to the solution.
thank you for parsing this information and providing more quality information. Glad you are part of the few still making the active effort to add quality information to the Internet.
Synthetic becomes fake, when the data created refers to something that can be idea fact or not fact e.g. "Who is the 3rd president of the United States". The factual answer is Thomas Jefferson. If an AI is used to create wrong data such that the 3rd president was hallucinated as "Tony Blair", then the outcome of the AI's inference later would be wrong. This makes the synthetic data, fake. If it is used to create art, or write stories etc, then we can call it "synthetic" and move on. However, imaging using hallucinated data to train a medical diagnosis AI. That is dangerous. The data was fake. It should be factual, researched, verified and expertedly labeled data.
If biases can be removed from the data and, with that, LLMs can excel, then synthetic data will be much better than the data we currently have. / According to ChatGPT: Based on these estimates, it could be calculated that approximately 60-80% of the information available on the internet may be influenced by some type of bias, whether conscious, unconscious, intentional, or derived from structural factors such as algorithms or economic pressures.
Even human generated data is terrible if we're talking about ethics, politics and areas of science where checking the results and the progress of research is very hard. It is much easier with engineering. Studies are flawed, there's no unbiased way to 'curate' say reddit and forum discussions and decide what is is logical/correct, and now we'll get 'synthetic data'. IMO it is unlikely they'll use these models for code generation, engineering and similar tasks. It would become apparent very quickly what a bunch of nonsense is that. What they may end up doing is to use say mixture of experts or similar to switch to GPT4 and similar or specialized models, on the fly for engineering/coding tasks, then use the synthetic nonsense for fiction and propaganda. Maybe it could work for things like medicine too, because good chunk of it is anyway 'theories' which are basically pure speculation and educated guesses based on chemistry and our very limited understanding of human physiology.
Not sure why ethics would make data bad. There is a lot of data that is unethical, that doesn't mean it's fake. Jailbroken models that can answer any questions without restrictions prove this.
@@Leonhart_93 'ethics' will not make data bad. You and me have 100% different views and understanding of what's ethical, what not, and what ethics even is. It's philosophical subject that's has been discussed and debated since the dawn of humanity. You can't just 'objectively' program ethics and logic into 'AI'. Math has axioms, logic generally is way more complicated. Like in ethics here you have terms like true or false, right or wrong, but to precisely define these you would need a framework and something like axioms where 'everyone' would agree on rules, terminology etc. Not going to happen, unless one single world view was enforced on everyone (so you would need an emperor, and a very crazy one. Even within a single religion or philosophical doctrine normaly not possible).
@@denissorn People these days have whole meltdowns over what's "ethical", and they make up their own definitions on the fly. Each and every one of them. That's why I don't care about most of them, it's not relevant. The only valid way becomes to answer with just the information the model has, ignoring any subjective ethics.
@@Leonhart_93 you seem to be confused about the way LLMs work. They just regurgitate training data (books, articles, forum/reddit discussions etc) based on data alone, plus weights that are adjusted by whomever is controlling the model. A model cannot 'objectively' provide information, unless you mean to simply repeat the info (e.g. How does the first chapter of Moby Dick start)
@@denissorn I am not confused about anything. I locally run models with every single type of artificial alignment removed. They won't refuse to answer anything, as long as they have an answer in the data. But of course it doesn't influence hallucinations, just that they don't care about subjective ethics and artificial restrictions. But this also proves that data is no pre-selected to contain ethical stuff, most likely because humans are too limited to manually filter it. They just add easily removable restrictions afterwards.
In the data set is also a segment for user communication data, so I think text, but yeah, audio is key, for emotional and sentiment analysis. These models are already trained on voice. And in future will be trained on synthetic voice data. so yes.
People don't want a tool that isn't doing its job or that is making decisions for them. Ideological limitations and using synthetic data are two of the biggest reasons why people avoid tools like GPT and why models like sus-column-r become so popular so fast despite their obvious downsides.
I still wonder if synthetic data is better or not. I am guessing it is. I could see an argument for how it could be structured in ways that increase the models ability to learn. On the other hand, what if it's hard to learn any new insights from all fake data?
It's not quite obvious to me whether this synthetic data is used in pre-training or in post-training. If it is used in post-training, then it not only makes perfectly sense, but it also solves the issue of having enough data to force the model answering questions in a specific fashion. While synthetic data in the pre-training might lead to a decrease of quality of the overall system, since too much redundant data in the model might lead to increase the importance of certain neural paths too early in the model, even if many of these data are actually produced with different LLM in the first place.
Learned SO much! Wow what an eye opener. Thank you for once again bringing the most important information to us, and explaining it in a way that is easy to understand. ❤
He provides all the sources and links to if it was confirmed by primary source (eg. OpenAI) or if it is secondary (eg. someone working there, or research papers). And if it is not confirmed but all things point to it, he also transparently points that out.
@@godago thanks for that. Btw I realized that it’s best for me to delete my responses to the original comment because what I was saying was actually inaccurate
We have been in a situation where a 10x increase in the training data is barely noticeable for 2 years. So we're definitely well into diminishing returns.
This makes sense to me. Most LLM's are trained in one epoch--they only train on each chunk of data one time. Going through most of the data multiple times tends to result in over-fitting. However if you get an LLM to translate or summarize a bunch of human data in ten ways, then they can likely get the equivalent of 10 epochs worth of training without running into the over-fitting problem. Presumably LLM's are already being used to help curate the original human data too. However this is still technically synthetic data. One can further enhance the synthetic data by doing stuff like whatever Strawberry is doing, likely criticizing the synthetic data multiple times and improving it. Maybe you can use a slow/accurate LLM to train a faster LLM. Using large slow LLM's to fine-tune small fast LLM's for specific use cases works well, problems happen when you get stupid LLM's to train other stupid LLM's.
Think about it as processing steps Has downsides and positives but you can squeeze information more efficiently But unfortunately the noise equivalent of ai models is hallucinations so basically training on compressed (partly pre trained ) data would obviously have that negative outcome they would have to deal with
Really interesting video. With model collapse taking place with synthetic data, I always remember that when humans use synthetic data that data is now changed due to its use context. A poem posted on a LinkedIn profile means something quite different than a poem posted on a Facebook profile. The perceived intentions, reactions and environment of a piece of data affects its overall meaning, which means even two pieces of identical synthetic data might be captured with different associations in a training set. Alongside this, humans selecting to use synthetic data is functionally a form of reinforcement with human feedback, as humans are selecting from multiple pieces of synthetic data when deciding which one to use in their own work, which in effect creates a new layer of curation. That in-situ curatorial layer might also benefit these models in a similar way to standard RLHF but in arguably a higher-quality way as it takes place in a real-world enviroment.
I think synthetic data is needed especially for logic thinking. I think it is not nessecarry used for information but teaching thinking patterns so that ai can better use given context data. Maybe we head to information free ai, which is only trained on problem solving and logic thinking, so you can inject any information to it to work with
If i understand correctly it seems to me GPT5 would be worse than gpt4O. I assume will have sone strong hallucinations. I already see so weirder outputs with gpt40 specially if i get very specific on the topic.
Hello Goda, I enjoy your channel very much. At almost 65 years old, I've seen many technology transformations. From rotary telephones to Facetime with iPhone, "don't talk to strangers" to hop in strangers car with Uber. LOL. I'm fascinated with ai perhaps more that other people in my age group, and if I'm understanding your argument correctly. With all the electrical power requirements, expensive computer chips for faster processing, etc. Very few companies such as(NVIDIA, META, GOOGLE) etc will have "control" of this original raw human data, which as you say, they have already produced regenerated "synthetic" data in GPT5. This will equate to the saying "Garbage in, Garbage out". I believe this synthetic data is useless. Even with CHATGPT 4, we should not blindly accept whatever it spits out as absolute truth. Specially with critical data/information that many people will make life altering decisions with. I'm not a programmer or engineer, but my gut tells me that with all these different models churning/mining this data it will become more diluted as time goes by. Looking forward to your thoughts and of your community on this topic. Thank you for sharing.
Synthetic data can be much better than human data because with syntheyic daya trainig direction can be steered much more precise than human data. For example, for training to code some special ways of generating synthetic code might give the model much more diverse and clear data than human. Its possible to generate indefinite much code with variations and these variations are not random but precise.
There is no upper limit to LLMs yet. There's a misnomer about "synthetic" data that's being used to train the next gen LLMs. It's not like taking a photocopy of a photocopy of a photocopy...they're feeding them logic problems, and the LLM has to deduce the answer. Like the game Clue where you figure out who the killer was. Q* (Strawberry) likely uses Self-Taught Reasoner (STaR) and Monte Carlo Tree Search and is slow as shit sliding off a glacier... However, it's REALLY good at figuring things out. It can find answers to things it never trained on. That's intelligence. My understanding is that they decided to have Strawberry create a massive amount of logic problems to train GPT-5 (Orion). If you know the question, the answer, and the steps to get to the correct answer, then they can turn GPT-5 into one heck of a smart machine. I've used Grok 2, and you can tell it uses something similar to STaR by reading its logic steps. I'm amazed @elon was able to put together the xAi cluster so quickly. Pretty sure he and his share holders want a return on investment. As these things pay for themselves, it seems logical that you're going to see Super-intelligence within a few years. Clearly I'm bullish on AI for years to come. Buckle up.
A lot of the comment section doesn't understand this, but the video doesn't mention it either. I would have never found out if I hadn't read the comments that do.
Thanks for the video, it might become a serious issue. How about Scholar AI as it claims that all is based on peer reviewed articles, would it be better or more reliable in comparison to CHAT GPT or it’s just another branch of AI as the accuracy of responses cannot be verified without additional investment and it would be just additional “service “, that actually doing the same job that CHAT GPT does? And in relation to the topic of this Video, I would agree that more important the quality of data, the question is more about of verification of it’s quality
But not all synthetic data is the same, some is specifically tailored for a training set, which can be better than natural diagnostic data due to scientist's knowledgr of the scenario at work
Could they translate human data from other languages and use that? Of course that would eventually be finite too. It would also make the AI less biased towards specific cultures and prevailing viewpoints, which would be cool.
Anyone else worried about the ethical implications of curating synthetic data? Feels like we're giving too much power to a select few to shape AI's 'knowledge'.
So how do they fight against model collapse of they use more and more synthetic data? And also if more and more of the internet, books, blogs, journals, etc is itself AI?
I believe in Alan Thompson , he is pretty conservative overall (and much less biased than most others), I like watching the AGI meter monthly to see if it gets that average 1% boost towards it lol
Are you certain it is appropriate to use the words 'control' and 'influence'? Bias typically pertains to opinion and censorship. For instance, gpt4 can perform numerical calculations with a precision of up to three digits. If you want gpt5 to handle calculations with more digits by generating synthetic data, that wouldn't necessarily introduce bias. The implicit biases present in research and history are not changed just because synthetic data generates alternative scenarios; no new interpretations of the existing data are introduced. In other words, synthetic data does not create new interpretations of existing data that were not already possible. Consider market data as an example. When we include synthetic data for options, we are not introducing a new bias that diverges from the underlying market movements. Instead, we are learning how risks or insurance are managed. Therefore, synthetic data generated from existing data does not create a new form of control or influence beyond what already exists.
How about Scholar AI as it claims that all is based on peer reviewed articles, would it be better or more reliable in comparison to CHAT GPT or it’s just another branch of AI as the accuracy of responses cannot be verified without additional investment and it would be just additional “service “, that actually doing the same job that CHAT GPT does. And in relation to the topic of this Video, I would agree that more important the quality of data, the question is more about of verification of it’s quality
I mean.. i think its cool - when you ask the chat to tell you whats 2+2.. it doesnt gives you an answer based on what it finds on internet but calculates it itself.. same with some crazy complex things.. i liek it XD
I asked the original Chat GPT how it would avoid disappearing up its own profundity and it told me it had a GPT challenge partner the allow it to learn to recognize AI content but it said in the same response that it hoped to be able to produce content that could not be recognized. AI is good on hard data and has been for years but generative LLMs cannot follow this pattern. I think the synthetic data direction is BS as is the degeneration to nonsense. What a probabilistic model cannot avoid is norming, ie tending to a central response, bad enough with broadly generated AI content but if you have the ability to flood data input with curated synthetic content you can direct LLM generation - very dangerous as you shift the norm in the way you want.
Great video, I am listening to it a few times and getting something new each time. Listening to David Shapiro for a while, he demonstrates why he is an expert on synthetic data, his recent video, "How OpenAI Strawberry Works ― "ONE TEXTBOOK TO RULE THEM ALL" ― My Educated Guess", it is more about connecting the dots, like the human brain might and writing the textbook on human knowledge, or synthesizing it. So synthesizing yes, some will be false, I think it will fine tuned over time like human synthesized knowledge.
This is probably just a marketting stunt. You can adjust the degree of importance on the data you input. I'd bet the 30% of human data is doing over 90% of the lifting in the actual model.
*CAMBRIDE UNIVERSITY* for Ai ∆p --> 0 as $ --> ∞ The change in performance tends towards zero as the cost tends towards infinity. The exact opposite of what the tch bro's hoped for, they thought they would get exponential improvements instead they get the inverse if that. Exponentially declining improvements for exponentially increasing cost.
Hah, I wrote in my journal two days ago after watching an apparently old video regarding us running out of data, so I wrote about how we could proceed easily with Synthetic Data, well easily being, I have no f(((*****ng idea, lol but, a very LAY idea would be to have it create its data, have that data be scrutinized by teams of relevant researchers, provide feedback removing the poor synthetic scripts/dataz entirely, and yeah, that's kinda where my pups interrupted me so yeah, anywho. I mean, like, having all the data that it has, there's no reason for it not to be able to recreate variations of that which it already has and then check that output and make sure it makes sense, there's no hallucination or perhaps take the one with hallucination, clean it up with again relevant researchers hopefully paid a proper wage O.o ahem, and then train it on the new, perfected synthetic data source... I just said the same thing twice, didn't I? I'm struggling with some severe brain fog, it's been 2 years now, I cannot wait until this long COVID ends... or my doctor takes me seriously and figures out what is really going on vs just using the new catchall...(unless it is just FOREVER-VID but HOLY CRAP I am incapable of NOT rambling... I just can't, NOT-ramble.... jfc lololol..) so yeah I apologize for writing like shih-tzu's.
Synthetic data is a solved problem, it is known and well understood by model makers, it's just the public that keeps incorrectly bringing up model collapse or degradation when it's irrelevant. The ONLY two reasons why synthetic data could hurt a model is the shrinking distribution problem and the ungrounded data problem. The problem isn’t synthetic data, but ungrounded synthetic data that's not representative of reality. You can make infinite synthetic data if it's grounded and has a defined distribution that reflects reality. Humans almost exclusively train on our own grounded synthetic data. It's simply not a problem for models. Grounding a model will always make it better. It needs active ways to interact with reality and learn rather than just static data. It can even train on its own output if you ground the output and feed it back in. If the model is wrong or doesn't know something, tell it, it's literally just that simple. This is why math models are getting better so fast, you can easily check if the output was correct and feed that information back into the model. Data is simply not a bottleneck we will ever have. Data is the hardest part of making a model, we will always have problems curating it but we will never run out of it. The reason why image model collapse happens is that a image model's output is a representation of reality, and is unaffected and ungrounded by reality. So you are trying to train an accurate representation of reality from a non accurate representation of reality. You should have no reason to expect the image model's representation of reality to get better from doing this.
I'm trying to understand the concept of grounded synthetic data. If grounded means data that is representative of reality, ungrounded data would be hallucinations? Saying something is representative of reality is also vague because does that mean material facts or does it mean possible based on reality? I keep thinking of how individuals experience something (grounded data) but previous experiences impact both the present experience, what is learned, and how it is interpreted. Or maybe I'm making this harder than it is.
@@rileylavonne8863 "If grounded means data that is representative of reality, ungrounded data would be hallucinations?" All human and model outputs are hallucinations. Some output is correct (is true) other output isn't. And we don't know what's true until we check. Humans since we live in the real world and get constant feedback can check ourselves to some extent. Models experience no reality, they have no senses and don't interact with the real world at all, only what is in their context. So they have no way to check their outputs against reality. They have no way to check if what they say or the model of the world they build is reflective of reality. "Saying something is representative of reality is also vague because does that mean material facts or does it mean possible based on reality?" I mean data that is true or accurate, and reflective of reality. With grounded data we really want as little interpretation steps as possible. Interpretation steps are bad because a model never learns reality, like humans it learns an internal abstraction of reality. So if you make data with model A and train model B on that. Model B learns an abstraction of model A's abstraction. If you keep doing that you get model collapse. But if you make data with model A and discard the false and change the output to be true and accurate, it's no longer just model A's abstraction of reality. It's now grounded in what's real and no different than human generated data. its representative of reality, not an abstraction of abstractions.
@@rileylavonne8863 "If grounded means data that is representative of reality, ungrounded data would be hallucinations?" All human and model outputs are hallucinations. Some output is correct (is true) other output isn't. And we don't know what's true until we check. Humans since we live in the real world and get constant feedback can check ourselves to some extent. Models experience no reality, they have no senses and don't interact with the real world at all, only what is in their context. So they have no way to check their outputs against reality. They have no way to check if what they say or the model of the world they build is reflective of reality. "Saying something is representative of reality is also vague because does that mean material facts or does it mean possible based on reality?" I mean data that is true or accurate, and reflective of reality. With grounded data we really want as little interpretation steps as possible. Interpretation steps are bad because a model never learns reality, like humans it learns an internal abstraction of reality. So if you make data with model A and train model B on that. Model B learns an abstraction of model A's abstraction. If you keep doing that you get model collapse. But if you make data with model A and discard the false and change the output to be true and accurate, it's no longer just model A's abstraction of reality. It's now grounded in what's real and no different than human generated data. its representative of reality, not an abstraction of abstractions.
They never solved the hallucination problem with the current models. So do you really think they checked the correctness of 20 millions of prompts by using humans before feeding the outputs back into the models? Right now it seems like they are throwing dice to see what happens, as they don't really know.
Good overview, but it feels a bit sparse. To make it perfect I lack some super clear takeaway in the end :) You did cover: 1. Use of synthetic data, you called it fake in Thumbnail which does not mean it's fake. It could be distilled. 2. You showed a Microsoft paper that is a counterargument to human data being good and synthetic data being bad. The problem is that human data is pretty bad. So it should be defined, filtered, cleaned, and distilled to become good raw materials for LLM training. Synthetic data into random BS, its real data curated through existing LLMs. Does it make that data better? Not always, but Microsoft and others working on synthetic data are working on pipelines that do indeed make better quality data this way than original human data. there was also a paper from Meta researches on making LLM create a Dataset for the next LLM and doing it 3 times in a row where the previous model trains the next model. And each next model had better performance on Evals. What this tells me is that human data is not good enough for this. And we are refining it this way. 3. You did mention concerns about synthetic data resulting in the degradation of model quality. But that is a concern, there is not really research that tests this. In fact, there was research that showed that collapse does not happen if you have a mix of synthetic data and user data. But what Meta paper showed, is there is no collapse even if it's 100% synthetic, but the devil is in the details of how that synthetic data is created. If we give LLM a dataset and ask it to filter it, but keep things verbatim otherwise. Is that synthetic data? Well, not really, it's a subset of real data. Also, you look tired here. Take care of yourself! You are not AI to churn such good vides without sleep :D
"1. Use of synthetic data, you called it fake in Thumbnail which does not mean it's fake. It could be distilled." that just means its concentrated slop
"Meta researches on making LLM create a Dataset for the next LLM and doing it 3 times in a row where the previous model trains the next model. And each next model had better performance on Evals." Evals do not = accuracy or quality. Thi si just a desperate attempt to keep the stock line going up. Its an utter fantasy.
@@piccalillipit9211 I struggled with this but if it would have said 70% false, that is a new level. Fake does imply a copy passed off as the original, as it is synthesizing from original, I guess as long as it is not passed off as the original.
@@ben_kendall Yes but is HAS to degrade. It has no intelligence and it has no creativity, it can not take a bad painting and make a great one using that as inspiration as an artist can. Until we get general AI it will always be a ower quality rendition of the original data set WITH added hallucinations
This video could be like 2 mins long. Here's the whole video. 1. Synthetic data can now be used effectively in AI development. 2. Curators of the data control the models bias. The end. Why did you just waste 10 mins?
@godago ok cool thanks because I am hoping one day ai will make video games. Aka rpgs and I aways thought video game data from wikis was missing. That and chat gpt would get things wrong to do with games by guessing. So this will be needed.
This guy overestimates the value of human-generated data. As for “this isn’t the end all be all model” - well, that’s blatantly obvious and I haven’t heard anyone suggest it is. 🙄
@@godago yeah it's a fine line I guess , and the line is somewhere else for everyone, I don't know what motivated me to comment that but I'm sure it has little to do with your video,I apologize
Hey Goda , really nice video ! I was wondering if I could help you with more Quality Editing in your videos and also make a highly engaging Thumbnail and also help you with the overall youtube strategy and growth ! Pls let me know what do you think ?
👉 To try everything Brilliant has to offer for free for a full 30 days, visit brilliant.org/godago + You'll also get 20% off an annual premium subscription.
The reason why 70% of the data is synthetic is so obvious. It's data that they didn't have rights to use verbatim so they used LLM to summarise or rewrite the data in a way where it's not possible to get sued for using it.
It's also able to be formatted in a way that likely is better for training.
In 2 years, I imagine they have done small scale training tests to see how the density and style of text/tokens impacts the model.
Once they know the best, synthesis of ALL data makes sense.
Hell cluade 3.5 articulates many subjects better than many humans, even those who write books. This clearly the case the synthetic data is the best way to train.
For example, fill a context window with a scientific paper, summarise the text, articulate it in leymens terms and compare contrast against other similar papers to create a literature review on this topic. Synthetic data... that will be pretty useful.
Way to inject your bias without having a clue why synthetic data is actually useful
@geekswithfeet9137 can you explain then? In my understanding synthetic data can be more dense and structured.
Am I way wrong?
@@SearchingForSounds I was replying to OP, you’re fine, you at least somewhat get it.
It’s called distillation. You can train expert models specifically on that topic, and use it to “teach” other models in a concise, but dense way. Like a professor to a university student.
You can even have an intermediate model that specialises in translating to general models, like a high school teacher, that doesn’t even necessarily understand the subject entirely, but is good at rewording in a way to bridge the gap.
@geekswithfeet9137 gotcha. Yeah like Orca did with textbooks.
It doesn't matter if the data is synthetic or not, it matters about the quality of the data. only a human can truely check the quality with a degree of certainty, so the big question is not if the data is synthetic but whether it has been verified and validated by a Human.
I think the more humans put their fingers in data, the more corrupt and inaccurate it becomes.
A simple method to reach artificial super intelligence would probably simply be to let ai "clean up" the human produced data, with all of its emotions and stupid human bias.
That's not true at all. Most tasks are easier to check the answer than to produce a correct one.
So the prophecy will come true. They have created a cannibal going : 'Give me data, I need data'.
I think "accuracy" of the Data, whether it is natural or synthetic, is far more important than how much Data you have. One accurate Formula beats a million inaccurate Formulas in the same Domain.
Yes that is for sure a problem based on my tests (mostly physics questions). Internet is full of incorrect info and that is integrated in the LLM's.
Maybe problem could be solved by passing all the data trough a chain of reasoning filter.
yeah, and when dealing with AI, it tends to try to find the lowest common denominator. In many cases, this lowest common denominator tends to be incorrect info - it learns that it can just create information it can't link to a valid enough source, and models like Claude have a habit of generating sources to validate themselves since they can't double check information. This is the core of hallucinations, that and link rot. When you make ai a primary source, this lowest common denominator carries even more weight in your future iterations - it confirms that, yes, this thing that showed up only sometimes in the primary dataset also shows up even more in the dataset as a whole, which must mean it's the simplest path to the solution.
Synthetic data is sure to kill the model.
thank you for parsing this information and providing more quality information. Glad you are part of the few still making the active effort to add quality information to the Internet.
It's not fake, it's syntetic, that's like saying that when you write in your notebook things from your brain, it's fake
Synthetic becomes fake, when the data created refers to something that can be idea fact or not fact e.g. "Who is the 3rd president of the United States". The factual answer is Thomas Jefferson. If an AI is used to create wrong data such that the 3rd president was hallucinated as "Tony Blair", then the outcome of the AI's inference later would be wrong. This makes the synthetic data, fake. If it is used to create art, or write stories etc, then we can call it "synthetic" and move on. However, imaging using hallucinated data to train a medical diagnosis AI. That is dangerous. The data was fake. It should be factual, researched, verified and expertedly labeled data.
I know, it’s a thumbnail :)
the thoughts you wrote on a notebook from you brain had real thoughts. the data in gpt 5 is synthetic because it wasn't made by an organic being
@@godagoclickbait
The notebook notes are from comedians?
If biases can be removed from the data and, with that, LLMs can excel, then synthetic data will be much better than the data we currently have. / According to ChatGPT: Based on these estimates, it could be calculated that approximately 60-80% of the information available on the internet may be influenced by some type of bias, whether conscious, unconscious, intentional, or derived from structural factors such as algorithms or economic pressures.
Has ai read every copyrighted book and article? Has it been given senses to perceive the entire universe? I think that there is still a ways to go.
The more things change, the more they stay the same. In this case.... garbage in/garbage out.
The IA Habsburgo era.
through user interactions they do fine tune and tweak the data over time, and tihe internet always has more data so it can be updated
I think it's a silly idea - bastardised data, rather than synthetic. I think this is a technological dead end. But we'll find out soon enough.
Even human generated data is terrible if we're talking about ethics, politics and areas of science where checking the results and the progress of research is very hard. It is much easier with engineering. Studies are flawed, there's no unbiased way to 'curate' say reddit and forum discussions and decide what is is logical/correct, and now we'll get 'synthetic data'. IMO it is unlikely they'll use these models for code generation, engineering and similar tasks. It would become apparent very quickly what a bunch of nonsense is that. What they may end up doing is to use say mixture of experts or similar to switch to GPT4 and similar or specialized models, on the fly for engineering/coding tasks, then use the synthetic nonsense for fiction and propaganda. Maybe it could work for things like medicine too, because good chunk of it is anyway 'theories' which are basically pure speculation and educated guesses based on chemistry and our very limited understanding of human physiology.
Not sure why ethics would make data bad. There is a lot of data that is unethical, that doesn't mean it's fake. Jailbroken models that can answer any questions without restrictions prove this.
@@Leonhart_93 'ethics' will not make data bad. You and me have 100% different views and understanding of what's ethical, what not, and what ethics even is. It's philosophical subject that's has been discussed and debated since the dawn of humanity. You can't just 'objectively' program ethics and logic into 'AI'. Math has axioms, logic generally is way more complicated. Like in ethics here you have terms like true or false, right or wrong, but to precisely define these you would need a framework and something like axioms where 'everyone' would agree on rules, terminology etc. Not going to happen, unless one single world view was enforced on everyone (so you would need an emperor, and a very crazy one. Even within a single religion or philosophical doctrine normaly not possible).
@@denissorn People these days have whole meltdowns over what's "ethical", and they make up their own definitions on the fly. Each and every one of them.
That's why I don't care about most of them, it's not relevant. The only valid way becomes to answer with just the information the model has, ignoring any subjective ethics.
@@Leonhart_93 you seem to be confused about the way LLMs work. They just regurgitate training data (books, articles, forum/reddit discussions etc) based on data alone, plus weights that are adjusted by whomever is controlling the model. A model cannot 'objectively' provide information, unless you mean to simply repeat the info (e.g. How does the first chapter of Moby Dick start)
@@denissorn I am not confused about anything. I locally run models with every single type of artificial alignment removed. They won't refuse to answer anything, as long as they have an answer in the data. But of course it doesn't influence hallucinations, just that they don't care about subjective ethics and artificial restrictions.
But this also proves that data is no pre-selected to contain ethical stuff, most likely because humans are too limited to manually filter it. They just add easily removable restrictions afterwards.
Do you think Voice data that will come from people chatting to various Voice bot or call centers automation may be useful ?
In the data set is also a segment for user communication data, so I think text, but yeah, audio is key, for emotional and sentiment analysis. These models are already trained on voice. And in future will be trained on synthetic voice data. so yes.
@@godago thanks for taking the time to answer.
Ps : once again ... Your videos are superb !
Calls may be recorded for TRAINING and quality purposes..... They tell you right at the start.
People don't want a tool that isn't doing its job or that is making decisions for them. Ideological limitations and using synthetic data are two of the biggest reasons why people avoid tools like GPT and why models like sus-column-r become so popular so fast despite their obvious downsides.
I still wonder if synthetic data is better or not. I am guessing it is. I could see an argument for how it could be structured in ways that increase the models ability to learn. On the other hand, what if it's hard to learn any new insights from all fake data?
they should fire that murati for lying publicly
They'd have to fire everyone who's spoken publicly, then.
when I ask GPT to summarize your video do you think they give me an honest summary if it's biased against them?
It's not quite obvious to me whether this synthetic data is used in pre-training or in post-training. If it is used in post-training, then it not only makes perfectly sense, but it also solves the issue of having enough data to force the model answering questions in a specific fashion. While synthetic data in the pre-training might lead to a decrease of quality of the overall system, since too much redundant data in the model might lead to increase the importance of certain neural paths too early in the model, even if many of these data are actually produced with different LLM in the first place.
That's amazing. The quality of synthetic data and its impact on models concerns me quite a bit.
Synthetic data sounds a lot like it just making things up
That as well as simply rewriting stolen data to dodge lawsuits.
Learned SO much! Wow what an eye opener. Thank you for once again bringing the most important information to us, and explaining it in a way that is easy to understand. ❤
AI generated data is not FAKE.
Funny I noticed she is from Germany almost immediately. I am using Ollama to learn German. So far it’s a beautiful language I love it so much. ❤😊
How do Alan know this? Is he working at OpenAI or is that speculated data? If the latter, than relying on that as data analyst is a no-go.
He provides all the sources and links to if it was confirmed by primary source (eg. OpenAI) or if it is secondary (eg. someone working there, or research papers). And if it is not confirmed but all things point to it, he also transparently points that out.
@@godago Thank you for the clarification. I appreciate that.
@@godago thanks for that. Btw I realized that it’s best for me to delete my responses to the original comment because what I was saying was actually inaccurate
I think that with the recent large release trends, we have not yet reached diminishing returns with large data. We must go as far as we can.
We have been in a situation where a 10x increase in the training data is barely noticeable for 2 years. So we're definitely well into diminishing returns.
It's not necessarily a problem that data originates from AI. It may have been put in training sets intended for orchestration and "self"-correction.
This makes sense to me. Most LLM's are trained in one epoch--they only train on each chunk of data one time. Going through most of the data multiple times tends to result in over-fitting. However if you get an LLM to translate or summarize a bunch of human data in ten ways, then they can likely get the equivalent of 10 epochs worth of training without running into the over-fitting problem. Presumably LLM's are already being used to help curate the original human data too.
However this is still technically synthetic data. One can further enhance the synthetic data by doing stuff like whatever Strawberry is doing, likely criticizing the synthetic data multiple times and improving it. Maybe you can use a slow/accurate LLM to train a faster LLM. Using large slow LLM's to fine-tune small fast LLM's for specific use cases works well, problems happen when you get stupid LLM's to train other stupid LLM's.
If OpenAI is training itself on its own output that would mean it is being trained on hallucinations. What good is that?
Think about it as processing steps
Has downsides and positives but you can squeeze information more efficiently
But unfortunately the noise equivalent of ai models is hallucinations so basically training on compressed (partly pre trained ) data would obviously have that negative outcome they would have to deal with
Really interesting video. With model collapse taking place with synthetic data, I always remember that when humans use synthetic data that data is now changed due to its use context. A poem posted on a LinkedIn profile means something quite different than a poem posted on a Facebook profile. The perceived intentions, reactions and environment of a piece of data affects its overall meaning, which means even two pieces of identical synthetic data might be captured with different associations in a training set. Alongside this, humans selecting to use synthetic data is functionally a form of reinforcement with human feedback, as humans are selecting from multiple pieces of synthetic data when deciding which one to use in their own work, which in effect creates a new layer of curation. That in-situ curatorial layer might also benefit these models in a similar way to standard RLHF but in arguably a higher-quality way as it takes place in a real-world enviroment.
I think synthetic data is needed especially for logic thinking. I think it is not nessecarry used for information but teaching thinking patterns so that ai can better use given context data. Maybe we head to information free ai, which is only trained on problem solving and logic thinking, so you can inject any information to it to work with
If i understand correctly it seems to me GPT5 would be worse than gpt4O. I assume will have sone strong hallucinations.
I already see so weirder outputs with gpt40 specially if i get very specific on the topic.
I will know the result of it when I see GPT5, but I won't hold my breath. They might not even want to release it if it's sub par.
Hello Goda, I enjoy your channel very much. At almost 65 years old, I've seen many technology transformations. From rotary telephones to Facetime with iPhone, "don't talk to strangers" to hop in strangers car with Uber. LOL. I'm fascinated with ai perhaps more that other people in my age group, and if I'm understanding your argument correctly. With all the electrical power requirements, expensive computer chips for faster processing, etc. Very few companies such as(NVIDIA, META, GOOGLE) etc will have "control" of this original raw human data, which as you say, they have already produced regenerated "synthetic" data in GPT5. This will equate to the saying "Garbage in, Garbage out". I believe this synthetic data is useless. Even with CHATGPT 4, we should not blindly accept whatever it spits out as absolute truth. Specially with critical data/information that many people will make life altering decisions with. I'm not a programmer or engineer, but my gut tells me that with all these different models churning/mining this data it will become more diluted as time goes by. Looking forward to your thoughts and of your community on this topic. Thank you for sharing.
Synthetic data can be much better than human data because with syntheyic daya trainig direction can be steered much more precise than human data. For example, for training to code some special ways of generating synthetic code might give the model much more diverse and clear data than human. Its possible to generate indefinite much code with variations and these variations are not random but precise.
There is no upper limit to LLMs yet. There's a misnomer about "synthetic" data that's being used to train the next gen LLMs. It's not like taking a photocopy of a photocopy of a photocopy...they're feeding them logic problems, and the LLM has to deduce the answer. Like the game Clue where you figure out who the killer was.
Q* (Strawberry) likely uses Self-Taught Reasoner (STaR) and Monte Carlo Tree Search and is slow as shit sliding off a glacier...
However, it's REALLY good at figuring things out. It can find answers to things it never trained on. That's intelligence.
My understanding is that they decided to have Strawberry create a massive amount of logic problems to train GPT-5 (Orion). If you know the question, the answer, and the steps to get to the correct answer, then they can turn GPT-5 into one heck of a smart machine.
I've used Grok 2, and you can tell it uses something similar to STaR by reading its logic steps. I'm amazed @elon
was able to put together the xAi cluster so quickly. Pretty sure he and his share holders want a return on investment. As these things pay for themselves, it seems logical that you're going to see Super-intelligence within a few years. Clearly I'm bullish on AI for years to come. Buckle up.
A lot of the comment section doesn't understand this, but the video doesn't mention it either. I would have never found out if I hadn't read the comments that do.
Would've been much better off if all this wasn't delayed for half a century .
Great video!
Thanks for the video, it might become a serious issue.
How about Scholar AI as it claims that all is based on peer reviewed articles, would it be better or more reliable in comparison to CHAT GPT or it’s just another branch of AI as the accuracy of responses cannot be verified without additional investment and it would be just additional “service “, that actually doing the same job that CHAT GPT does? And in relation to the topic of this Video, I would agree that more important the quality of data, the question is more about of verification of it’s quality
Some kind of information law of thermodynamics?
But not all synthetic data is the same, some is specifically tailored for a training set, which can be better than natural diagnostic data due to scientist's knowledgr of the scenario at work
Black gold is appropriate when the data is stolen unconsentually, when that data is bought from individuals the epoch of golden data will begin.
Could they translate human data from other languages and use that? Of course that would eventually be finite too. It would also make the AI less biased towards specific cultures and prevailing viewpoints, which would be cool.
Garbage In > Garbage Out
Anyone else worried about the ethical implications of curating synthetic data? Feels like we're giving too much power to a select few to shape AI's 'knowledge'.
So how do they fight against model collapse of they use more and more synthetic data? And also if more and more of the internet, books, blogs, journals, etc is itself AI?
I believe in Alan Thompson , he is pretty conservative overall (and much less biased than most others), I like watching the AGI meter monthly to see if it gets that average 1% boost towards it lol
I believe the bubble is going to burst soon enough. As AIs start breeding AIs.
hopefully GPT-5 won't work much better than GPT-4. But if it does, the millitary will be involved deeply.
Are you certain it is appropriate to use the words 'control' and 'influence'? Bias typically pertains to opinion and censorship. For instance, gpt4 can perform numerical calculations with a precision of up to three digits. If you want gpt5 to handle calculations with more digits by generating synthetic data, that wouldn't necessarily introduce bias.
The implicit biases present in research and history are not changed just because synthetic data generates alternative scenarios; no new interpretations of the existing data are introduced. In other words, synthetic data does not create new interpretations of existing data that were not already possible. Consider market data as an example. When we include synthetic data for options, we are not introducing a new bias that diverges from the underlying market movements. Instead, we are learning how risks or insurance are managed. Therefore, synthetic data generated from existing data does not create a new form of control or influence beyond what already exists.
well done fam. well done.
How about the library of Congress, or the British library - that's 300m titles.
How about Scholar AI as it claims that all is based on peer reviewed articles, would it be better or more reliable in comparison to CHAT GPT or it’s just another branch of AI as the accuracy of responses cannot be verified without additional investment and it would be just additional “service “, that actually doing the same job that CHAT GPT does. And in relation to the topic of this Video, I would agree that more important the quality of data, the question is more about of verification of it’s quality
I mean.. i think its cool - when you ask the chat to tell you whats 2+2.. it doesnt gives you an answer based on what it finds on internet but calculates it itself.. same with some crazy complex things.. i liek it XD
What does it even mean for the data to be "synthetic" ? You needed another AI to produce the data to feed to your real AI ?
I asked the original Chat GPT how it would avoid disappearing up its own profundity and it told me it had a GPT challenge partner the allow it to learn to recognize AI content but it said in the same response that it hoped to be able to produce content that could not be recognized. AI is good on hard data and has been for years but generative LLMs cannot follow this pattern. I think the synthetic data direction is BS as is the degeneration to nonsense. What a probabilistic model cannot avoid is norming, ie tending to a central response, bad enough with broadly generated AI content but if you have the ability to flood data input with curated synthetic content you can direct LLM generation - very dangerous as you shift the norm in the way you want.
Seems like a big GIGO machine
Will it be any good that's the question
Great video, I am listening to it a few times and getting something new each time. Listening to David Shapiro for a while, he demonstrates why he is an expert on synthetic data, his recent video, "How OpenAI Strawberry Works ― "ONE TEXTBOOK TO RULE THEM ALL" ― My Educated Guess", it is more about connecting the dots, like the human brain might and writing the textbook on human knowledge, or synthesizing it. So synthesizing yes, some will be false, I think it will fine tuned over time like human synthesized knowledge.
I had to revisit your video on "In the age of AI, this is the new Black Gold" and seeing things in a new light. Glad I came upon your video today!
wierd, since we have seen how bad things get when they use synthetic data?
This is probably just a marketting stunt. You can adjust the degree of importance on the data you input. I'd bet the 30% of human data is doing over 90% of the lifting in the actual model.
*CAMBRIDE UNIVERSITY* for Ai ∆p --> 0 as $ --> ∞
The change in performance tends towards zero as the cost tends towards infinity. The exact opposite of what the tch bro's hoped for, they thought they would get exponential improvements instead they get the inverse if that. Exponentially declining improvements for exponentially increasing cost.
Hah, I wrote in my journal two days ago after watching an apparently old video regarding us running out of data, so I wrote about how we could proceed easily with Synthetic Data, well easily being, I have no f(((*****ng idea, lol but, a very LAY idea would be to have it create its data, have that data be scrutinized by teams of relevant researchers, provide feedback removing the poor synthetic scripts/dataz entirely, and yeah, that's kinda where my pups interrupted me so yeah, anywho. I mean, like, having all the data that it has, there's no reason for it not to be able to recreate variations of that which it already has and then check that output and make sure it makes sense, there's no hallucination or perhaps take the one with hallucination, clean it up with again relevant researchers hopefully paid a proper wage O.o ahem, and then train it on the new, perfected synthetic data source...
I just said the same thing twice, didn't I? I'm struggling with some severe brain fog, it's been 2 years now, I cannot wait until this long COVID ends... or my doctor takes me seriously and figures out what is really going on vs just using the new catchall...(unless it is just FOREVER-VID but HOLY CRAP I am incapable of NOT rambling... I just can't, NOT-ramble.... jfc lololol..) so yeah I apologize for writing like shih-tzu's.
Hate clickbait but hi you got me.
I think just scaling approach going to its climax
Synthetic data is a solved problem, it is known and well understood by model makers, it's just the public that keeps incorrectly bringing up model collapse or degradation when it's irrelevant.
The ONLY two reasons why synthetic data could hurt a model is the shrinking distribution problem and the ungrounded data problem. The problem isn’t synthetic data, but ungrounded synthetic data that's not representative of reality.
You can make infinite synthetic data if it's grounded and has a defined distribution that reflects reality. Humans almost exclusively train on our own grounded synthetic data. It's simply not a problem for models. Grounding a model will always make it better. It needs active ways to interact with reality and learn rather than just static data. It can even train on its own output if you ground the output and feed it back in. If the model is wrong or doesn't know something, tell it, it's literally just that simple. This is why math models are getting better so fast, you can easily check if the output was correct and feed that information back into the model. Data is simply not a bottleneck we will ever have. Data is the hardest part of making a model, we will always have problems curating it but we will never run out of it.
The reason why image model collapse happens is that a image model's output is a representation of reality, and is unaffected and ungrounded by reality. So you are trying to train an accurate representation of reality from a non accurate representation of reality. You should have no reason to expect the image model's representation of reality to get better from doing this.
I'm trying to understand the concept of grounded synthetic data. If grounded means data that is representative of reality, ungrounded data would be hallucinations? Saying something is representative of reality is also vague because does that mean material facts or does it mean possible based on reality? I keep thinking of how individuals experience something (grounded data) but previous experiences impact both the present experience, what is learned, and how it is interpreted. Or maybe I'm making this harder than it is.
@@rileylavonne8863
"If grounded means data that is representative of reality, ungrounded data would be hallucinations?"
All human and model outputs are hallucinations. Some output is correct (is true) other output isn't. And we don't know what's true until we check.
Humans since we live in the real world and get constant feedback can check ourselves to some extent. Models experience no reality, they have no senses and don't interact with the real world at all, only what is in their context. So they have no way to check their outputs against reality. They have no way to check if what they say or the model of the world they build is reflective of reality.
"Saying something is representative of reality is also vague because does that mean material facts or does it mean possible based on reality?"
I mean data that is true or accurate, and reflective of reality. With grounded data we really want as little interpretation steps as possible.
Interpretation steps are bad because a model never learns reality, like humans it learns an internal abstraction of reality.
So if you make data with model A and train model B on that. Model B learns an abstraction of model A's abstraction. If you keep doing that you get model collapse. But if you make data with model A and discard the false and change the output to be true and accurate, it's no longer just model A's abstraction of reality. It's now grounded in what's real and no different than human generated data. its representative of reality, not an abstraction of abstractions.
@@rileylavonne8863 "If grounded means data that is representative of reality, ungrounded data would be hallucinations?"
All human and model outputs are hallucinations. Some output is correct (is true) other output isn't. And we don't know what's true until we check.
Humans since we live in the real world and get constant feedback can check ourselves to some extent. Models experience no reality, they have no senses and don't interact with the real world at all, only what is in their context. So they have no way to check their outputs against reality. They have no way to check if what they say or the model of the world they build is reflective of reality.
"Saying something is representative of reality is also vague because does that mean material facts or does it mean possible based on reality?"
I mean data that is true or accurate, and reflective of reality. With grounded data we really want as little interpretation steps as possible.
Interpretation steps are bad because a model never learns reality, like humans it learns an internal abstraction of reality.
So if you make data with model A and train model B on that. Model B learns an abstraction of model A's abstraction. If you keep doing that you get model collapse. But if you make data with model A and discard the false and change the output to be true and accurate, it's no longer just model A's abstraction of reality. It's now grounded in what's real and no different than human generated data. its representative of reality, not an abstraction of abstractions.
They never solved the hallucination problem with the current models. So do you really think they checked the correctness of 20 millions of prompts by using humans before feeding the outputs back into the models? Right now it seems like they are throwing dice to see what happens, as they don't really know.
It's called synthetic data fyi❤
AI: Advanced Ignorance.
do you know dani daniel
Good overview, but it feels a bit sparse. To make it perfect I lack some super clear takeaway in the end :)
You did cover:
1. Use of synthetic data, you called it fake in Thumbnail which does not mean it's fake. It could be distilled.
2. You showed a Microsoft paper that is a counterargument to human data being good and synthetic data being bad. The problem is that human data is pretty bad. So it should be defined, filtered, cleaned, and distilled to become good raw materials for LLM training. Synthetic data into random BS, its real data curated through existing LLMs. Does it make that data better? Not always, but Microsoft and others working on synthetic data are working on pipelines that do indeed make better quality data this way than original human data.
there was also a paper from Meta researches on making LLM create a Dataset for the next LLM and doing it 3 times in a row where the previous model trains the next model. And each next model had better performance on Evals. What this tells me is that human data is not good enough for this. And we are refining it this way.
3. You did mention concerns about synthetic data resulting in the degradation of model quality. But that is a concern, there is not really research that tests this. In fact, there was research that showed that collapse does not happen if you have a mix of synthetic data and user data. But what Meta paper showed, is there is no collapse even if it's 100% synthetic, but the devil is in the details of how that synthetic data is created. If we give LLM a dataset and ask it to filter it, but keep things verbatim otherwise. Is that synthetic data? Well, not really, it's a subset of real data.
Also, you look tired here. Take care of yourself! You are not AI to churn such good vides without sleep :D
"1. Use of synthetic data, you called it fake in Thumbnail which does not mean it's fake. It could be distilled." that just means its concentrated slop
"Meta researches on making LLM create a Dataset for the next LLM and doing it 3 times in a row where the previous model trains the next model. And each next model had better performance on Evals."
Evals do not = accuracy or quality. Thi si just a desperate attempt to keep the stock line going up. Its an utter fantasy.
@@piccalillipit9211 I struggled with this but if it would have said 70% false, that is a new level. Fake does imply a copy passed off as the original, as it is synthesizing from original, I guess as long as it is not passed off as the original.
@@ben_kendall Yes but is HAS to degrade. It has no intelligence and it has no creativity, it can not take a bad painting and make a great one using that as inspiration as an artist can. Until we get general AI it will always be a ower quality rendition of the original data set WITH added hallucinations
how could synthetic data be good? What do we define as synthetic data? It sounds like it would be awful
Garage in Garbage out
This video could be like 2 mins long.
Here's the whole video.
1. Synthetic data can now be used effectively in AI development.
2. Curators of the data control the models bias.
The end.
Why did you just waste 10 mins?
Sure, thanks for the tip, next time will make a video just like you said :)
Fp the person who made this video are you say ing that it has the whole internet in its training data?
Yes
@godago ok cool thanks because I am hoping one day ai will make video games. Aka rpgs and I aways thought video game data from wikis was missing. That and chat gpt would get things wrong to do with games by guessing. So this will be needed.
Recursive JPEG 🖼️
You are calling synthetic data, fake data? Can you prove that to be a true statement of fact or only your opinions?
It's a thumbnail. To convey the difference between real and synthetic data, this is the easiest. In the video I never called it fake once :)
@@godago ok my bad then. maybe you need to fix that
there's something about Eastern European women that is so nice. Even with big noses. they're so hot. screw yall. you agree.
This guy overestimates the value of human-generated data. As for “this isn’t the end all be all model” - well, that’s blatantly obvious and I haven’t heard anyone suggest it is. 🙄
Synthetic Data is not exactly fake data oversimplifying for clicks is unethical so I find this all hypocrisy before even watching it
I used a synonym for thumbnail so it’s easier to grasp the idea.
@@godago yeah it's a fine line I guess , and the line is somewhere else for everyone, I don't know what motivated me to comment that but I'm sure it has little to do with your video,I apologize
GPT 5 will be like a schizophrenic patient mumbles bunch of nonsense.
ойба не деит
*SYNTHETIC = SLOP* GPT-5 has got brain prions like computer BSE
Propaganda
First!
Congratulations. Take this 🏅
Hey Goda , really nice video ! I was wondering if I could help you with more Quality Editing in your videos and also make a highly engaging Thumbnail and also help you with the overall youtube strategy and growth ! Pls let me know what do you think ?