New OpenAI Model 'Imminent' and AI Stakes Get Raised (plus Med Gemini, GPT 2 Chatbot and Scale AI)

AI Explained

Просмотров 97 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 1 май 2024
Altman ‘knows the release date’, Politico calls it ‘imminent’ according to Insiders, and then the mystery GPT-2 chatbot [made by the phi team at Microsoft] causes mass confusion and hysteria. I break it all down and cover two papers - MedGemini and Scale AI Contamination - released in the last 24 hours. I’ve read them in full and they might be more important than all the rest. Let’s hope life wins over death in the deployment of AI.
AI Insiders: / aiexplained
Politico Article: www.politico.eu/article/rishi...
Sam Altman Talk: • The Possibilities of A...
MIT Interview: www.technologyreview.com/2024...
Logan Kilpatrick Tweet: / 1785834464804794820
Bubeck Response: / 1785888787484291440
GPT2: / 1785107943664566556
Where it used to be hosted: arena.lmsys.org/
Unicorns?; / 1784969111430103494
No Unicorns: / 1785159370512421201
GPT2 chatbot logic fail: / 1785367736157175859
And language fails: / 1785101624475537813
James Betker Blog: nonint.com/2023/06/10/the-it-...
Scale AI Benchmark Paper: arxiv.org/pdf/2405.00332
Dwarkesh Zuckerberg Interview: • Mark Zuckerberg - Llam...
Lavander Misuse: www.972mag.com/lavender-ai-is...
Autonomous Tank: www.techspot.com/news/102769-...
Claude 3 GPQA: www.anthropic.com/news/claude...
Med Gemini: arxiv.org/pdf/2404.18416
Medical Mistakes: www.cnbc.com/2018/02/22/medic...
MedPrompt Microsoft: www.microsoft.com/en-us/resea...
My Benchmark Flaws Tweet: / 1782716249639670000
My Stargate Video: • Why Does OpenAI Need a...
My GPT-5 Video: • GPT-5: Everything You ...
Non-hype Newsletter: signaltonoise.beehiiv.com/
AI Insiders: / aiexplained
Наука

Комментарии • 583

@mlaine83 16 дней назад ⁺⁴⁴⁰
By far this is the best AI news roundup channel on the tubes. Never clickbaity, always interesting and so much info.
@aiexplained-official 16 дней назад ⁺²⁵
Thanks m
@sebastianmacke1660 16 дней назад ⁺¹¹
Yes, it is the best channel. I can only agree. But it is always dense information, without any pauses. Even between topics. I wish he would just breathe deeply from time to time 😅. I have to watch all videos at least twice.
@ItIsJan 16 дней назад ⁺²
for slightly more technical videos, yannic kilcher's channel is also quite good
@pacotato 16 дней назад ⁺¹
This is why this the only channel about AI that I keep watching every single video. Congrats!
@GomNumPy 16 дней назад ⁺⁵
The funny thing is, over the past year, I've learned to avoid clickbait AI news. Yet, whenever I see "AI Explained," I click without a second thought!
I've come to realize that those clickbait titles will eventually backfire :)
@C4rb0neum 16 дней назад ⁺¹⁴³
I have had such bad diagnosis experiences that I would happily take an AI diagnosis. Especially if it’s the “pre” diagnosis that nurses typically have to do in about 8 seconds
@dakara4877 16 дней назад ⁺¹⁷
Same, it would be hard for it to be much worse in my case as well. Similar experience for most of my family.
Many people report such experiences, which begs the question, if so many people get bad diagnostics from doctors, where is the "good" data they are training on?
@OscarTheStrategist 15 дней назад ⁺⁵
I’d be careful with that wish for right now. I work with fine tuning AI for medical purposes and they can convince themselves and you of the wrong diagnosis very easily. This is a hard problem to solve but I’m sure it will be solved rather soon, and I hope it is widely available to everyone.
@Bronco541 15 дней назад ⁺¹
Same. I have wasted thousands of dollars on doctors for them to give me little to no helpful information what so ever. Obviously im not saying "abolish all doctors theyre bad" but clearly something needs to change
@dakara4877 15 дней назад ⁺¹
@@OscarTheStrategist "they can convince themselves and you of the wrong diagnosis very easily.", So in other words, they do perform like real doctors 😛
Yes, it is a general problem in all LLMs. They can't tell you when they don't know something. Even when the probability of being correct is very low, they state it with the upmost confidence. Until that can be solved, LLMs are extremely limited in reality and not suitable for the majority of the use cases people wish to use them for.
@piotrek7633 15 дней назад
Bad diagnosis OR you get unemployed,have no money because youre literally useless to the world and ai can replace you in anything
pick one genius, if doctors lose jobs and start struggling in life then what makes you special?
@KyriosHeptagrammaton 16 дней назад ⁺³⁶
In Claude's defence, I did Calculus in college and got high grades, but I suck at addition. Sometimes feels like they are mutually exclusive.
@ThePowerLover 14 дней назад ⁺¹
This!
@KyriosHeptagrammaton 14 дней назад ⁺⁷
@@ThePowerLover If you're using a number bigger than 4 you're not doing real math!
@SamJamCooper 15 дней назад ⁺¹⁴
A note of caution regarding LLM diagnoses and medical errors: most avoidable deaths come not from misdiagnoses (although there are still some which these models could help with), but from problems of communication between clinicians, different departments, and support staff in the medical field - not misdiagnoses. That's certainly something I see AI being able to help with now and in the future, but the medical reality is far more complex than a 1:1 relationship between misdiagnoses and avoidable deaths.
@aiexplained-official 15 дней назад ⁺³
Of course, but I see such systems helping there too, as you say. And surgical errors or oversights.
@DynamicUnreal 16 дней назад ⁺⁹⁵
As a person failed by the American medical system who is currently living with an undiagnosed neurological illness - I hope that a good enough A.I. will SOON replace doctors when it comes to medical diagnosis. If it wasn’t for GPT-4, who knows how much sicker I would be.
@paulallen8304 16 дней назад ⁺²⁰
I also hope that it takes the bias out of diagnosis- far too many women and people of color report not getting adequate care due to unperceived bias. In many cases the doctors themselves dont even know they are doing it.
@OperationDarkside 16 дней назад ⁺⁴
Same here, and it was even multiple "proper" experts. And I didn't have to pay anything because free healthcare and such. No amount of money can cure "human intuition".
@paulallen8304 16 дней назад
@@OperationDarkside Of course this is why we need to thoroughly test these systems to make sure that no hallucination or bias is inadvertently built into them.
@OperationDarkside 16 дней назад ⁺¹⁴
@@paulallen8304 In my case it was the human doing the hallucinations. And nobody bats an eye in those cases. No reduced salary, no firing, no demotion, no retraction of licenses. So I don't see a reason to overdo it with quality testing the AI in those cases, if we don't do it for the humans in the first place.
@louierose5739 15 дней назад ⁺²
Can I ask what illness? This sounds something similar to what I’ve gone through because of iatrogenic damage (SSRIs/antidepressant injury)
There’s something called ‘PSSD’ (post-SSRI sexual dysfunction) which basically involves sexual, physical and cognitive issues that persist indefinitely even after the drug is halted
The issues can start on the pill or upon discontinuation, only criteria is that they are seemingly permanent for the sufferer
Sexual dysfunction is a key aspect of it but it involves loads of symptoms that are utterly debilating
Some of mine include - brain fog, memory issues, chronic fatigue, head/eye pressure, worsened vision, light sensitivity, emotional numbness/anehdonia, premature ejaculation, pleasureles/muted orgasms, severe bladder issues
There’s a whole spectrum of issues that these drugs cause
Research in early stages but seems that SSRIs, and all psychiatric drugs for that matter, can cause epigenetic changes that leave sufferers permanently worse off after taking the drug then they were with just the mental illness they were being treated for
@josh0n 16 дней назад ⁺⁷⁴
BRAVO for including "Lavender" and the autonomous tank as a negative examples of AI. It is important to call this stuff out.
@aiexplained-official 16 дней назад ⁺²⁷
Yes, needs to be said
@ai_is_a_great_place 14 дней назад ⁺¹
Lol how is it a negative example that needs to be called out? I'd rather use ai on the front lines over humans who are affected and can make mistakes as well. Nuking/shelling a place is more evil than guided tactical machines
@jeff__w 16 дней назад ⁺¹³⁵
12:33 “So my question is: this why are models like Claude 3 Opus still getting _any_ of these questions wrong? Remember they're scoring around 60% in graduate-level expert reasoning the GP QA. If Claude 3 Opus for example can get questions right that PhDs struggle to get right with Google and 30 minutes, why on Earth with five short examples can they not get these basic high school questions right?”
My completely lay, non-computer-science intuition is this: (1) as you mention in the video, these models _are_ optimized for benchmark questions and not just any old, regular questions and , more importantly, (2) there’s a bit of a category error going on: these models are _not_ doing “graduate-level expert reasoning”-they’re emulating the verbal behavior that people exhibit when they (people) solve problems like these. There’s some kind of disjunction going on there-and the computer science discourse, which is, obviously, apart from behavioral science, is conflating the two.
Again, to beat a dead horse somewhat, I tested my “pill question”* (my version of your handcrafted questions) in the LMSYS Chatbot Arena (92 models, apparently) probably 50 times at least, and got the right answer exactly twice-and the rest of the time the answers were wrong numbers (even from the models that managed to answer correctly), nonsensical (e.g., 200%), or something along the lines of “It can’t be determined.” These models are _not_ reasoning-they’re doing something that only looks like reasoning. That’s not a disparagement-it’s still incredibly impressive. It’s just what’s going on.
* Paraphrased roughly: what proportion of a whole bottle of pills do I have to cut in half to get an equal number of whole and half pills?
@minimal3734 16 дней назад ⁺¹⁸
ChatGPT: To solve this, we can think about it step-by-step.
Let's say you start with n whole pills in the bottle. If you cut x pills in half, you will then have n−x whole pills left and 2x half pills (since each pill you cut produces two halves). You want the number of whole pills to be equal to the number of half pills. That means setting n−x equal to 2x.
Now, solve the equation n−x=2x:
n−x=2x
n=3x
x= n/3
This means that you need to cut one-third (or about 33.33%) of the pills in half in order to have an equal number of whole pills and half pills.
@jeff__w 16 дней назад
@@minimal3734 That’s very cool. Thanks for giving it a shot. If you’re curious, try that question a few times in the LMSYS Chatbot Arena and see if you have any better luck than I had.
(And, to be clear, I’m not that concerned with wrong answers _per se._ It’s that the “reasoning” is absent. An answer of 200% is obviously wrong but an answer of 50% gives twice as many halves as you want-and the chatbots that give that answer miss that entirely.)
@rafaellabbe7538 16 дней назад ⁺¹²
GPT4 gets this one. Claude 3 Opus and Sonnet also get it right. (Tested all 3 on temperature = 0)
@jeff__w 16 дней назад
@@rafaellabbe7538 I tried it on Opus Sonnet when it was first released (you can see my comment under Phil’s video about it) and it got it wrong, so it seems like it’s improved.
And you can both give that question a shot in the LMSYS Chatbot Arena and see if you have any better luck than I had. As I said, two models got it right _once_ and never did again in the times I tried it.
@jeff__w 16 дней назад
@@rafaellabbe7538 _NOTE:_ I’ve replied several times on this thread and each time my reply disappears. (I’m _not_ deleting the replies.) Go figure.
I tried it on Opus Sonnet when that model was first released (I commented under Phil’s video about it) and it got it wrong, so it seems like it’s improved.
And you can both give that question a shot in the LMSYS Chatbot Arena and see if you have any better luck than I had. As I said, two models got it right _once_ and never did again in the times I tried it.
@Zirrad1 15 дней назад ⁺⁵
It is useful to note how inconsistent human medical diagnosis is. A read of Kahneman’s book “Noise” is a prerequisite to appreciating just how poor human judgment can be and how difficult it is, for social, psychological, and political reasons to improve the situation.
The consistency of algorithmic approaches is key to reducing noise and detecting and correcting bias which carries forward and improves with iteration.
@trentondambrowitz1746 16 дней назад ⁺²⁰
Hey, I’m in this one!
Great job as always, although I’m becoming increasingly frustrated with how you somehow find news I haven’t seen…
Very much looking forward to what OpenAI have been cooking, and I agree that there are ethical issues with restricting access to a model that can greatly benefit humanity.
May will be exciting!
@aiexplained-official 16 дней назад ⁺⁵
It will! You are the star of Discord anyway, so many more appearances to come.
@jvlbme 16 дней назад ⁺³⁴
I really think we DO want surprise and awe with every release.
@Bronco541 15 дней назад ⁺¹
I know *I* do. But a lot of people dont. Theyre too scared of the unknown and the future
@alexorr2322 15 дней назад ⁺¹
That’s true but I think the real reason is OpenAI being scared of leaving it too long and being over taken by their competitors and losing their front runner position.
@RaitisPetrovs-nb9kz 16 дней назад ⁺²⁶
Could it be that GPT2 was tested for potential Apple offer
@wiltedblackrose 16 дней назад ⁺⁵¹
I am SO GLAD that finally someone with reach has said out loud what I've been thinking for the longest time. For me these models are still not properly intelligent, because despite having amazing "talents°, the things they fail at betray them. It's almost like they only become really, really good at learning facts and the syntax of reasoning, but don't actually pick up the conceptual relationship between things. As a university student I always have to think about what we would say about someone who can talk perfectly about complex abstract concepts, but fails to solve or answer simpler questions that underlay those more complex ones. We would call that person a fraud. But somehow if it's an LLM, we close an eye (or two). As always, the best channel in AI. The best critical thinker in the space.
@Jack-vv7zb 16 дней назад ⁺¹²
My take: These AI models act as simulators, and when you converse with them, you are interacting with a 'simulacrum' (a simulated entity within the simulator). For example, if we asked the model to act like a 5-year-old and then posed a complex question, we would expect the simulacrum to either answer incorrectly or admit that it doesn't know. However, it wouldn't be accurate to say that the entire simulator (e.g., GPT-4) is incapable of answering the question; rather, the specific simulacrum cannot answer it.
Simulacra could take various forms, such as a person, an animal, an alien, an AI, a computer terminal or program, a website, etc. GPT-4 (perhaps less so finetuned ChatGPT version) is capable of simulating all of these things.
The key point is that these models are capable of simulating intelligence, reasoning, self-awareness, and other traits, but we don't always observe these behaviours because they can also simulate the absence of these characteristics. It's for this reason that we have to be very careful about how we prompt the model as that's what defines the attributes of the simulacra we create.
@Tymon0000 16 дней назад ⁺⁴
The i in LLM stands for intelligence
@wiltedblackrose 16 дней назад
@@Tymon0000 Indeed 😂
@beerkegaard 15 дней назад ⁺⁴
If it writes code, and the code works, it's not a fraud.
@Hollowed2wiz 15 дней назад
@@Jack-vv7zb does that mean we need to force llm models to simulate multiple personalities at all times in order to cover as much knowledge as possible ? For example by using some kind of mixture of expert strategy where the experts are personalities (like a 5 years old child, a regular adult, a mathematician, ...) ?
@Madlintelf 14 дней назад ⁺³
Towards the end you state that it might be unethical to not use the models, that really hits home. I've worked in healthcare for 20+ years, that level of accuracy coming from a LLM would be welcome so much. I think the summarizing of notes will definitely be the hook that grabs the majority of healthcare professionals. Thanks again!
@aiexplained-official 14 дней назад ⁺²
Thank you for the fascinating comment Mad
@colinharter4094 16 дней назад ⁺⁷
I love that even though you're the person I go to for measured AI commentary, you always open your videos, and rightfully so, with something to the effect of "it's been a wild 48 hours. let me tell you"
@Xengard 16 дней назад ⁺¹⁴
the question is once these medical models are released, how long will it take for medics to implement and use them?
@skierpage 14 дней назад
Doctors are under such time pressure that most will welcome an expert colleague, especially one that will also write up the session notes for them. The problem will be when people have long conversations with an LLM before they even get to the doctor, whether the medical organization's own LLM or a third-party or both; the doctor becomes a final sanity check on what the LLM came up with, so it had better not gone down a rabbit hole of hallucination and hypochondria along with the patient.
@juliankohler5086 16 дней назад ⁺²⁷
Loved to see the community meetings. What a great way to use your influence bringing people together instead of dividing them. "Ethical Influencers" might just have become a thing.
@aiexplained-official 16 дней назад ⁺¹¹
Proud to wear that title !
@codersama 16 дней назад ⁺⁸
5:47 MOST SHOCKING MOMENT OF THE VIDEO
@xdumutlu5869 15 дней назад ⁺¹
I WAS SHOCKED AND STUNNED
@Olack87 16 дней назад ⁺²
Man, your videos always brighten my day. Such excellent and informative material.
@devlogicg2875 15 дней назад ⁺²
I agree, but if we consider math, the data (numbers and geometry) are all available, the AI just lacks reasoning to be able to function expertly. We need new models to do this.
@jhguerrera 16 дней назад ⁺¹⁴
Let's go!! I can't wait for openais next release. Didn't watch the video yet but always happy to see an upload
@aiexplained-official 16 дней назад ⁺⁷
:)
@canadiannomad2330 16 дней назад ⁺⁵⁵
Ah yes, it makes sense for a US based company to give early access to closely held technologies to spooks on the other side of the pond. It totally aligns with their interests...
@TheLegendaryHacker 16 дней назад
Hell, the article literally says that tech companies only care about US safety agencies
@swoletech5958 16 дней назад ⁺¹
Totally agree, I was about to post something along those lines.😂
@serioserkanalname499 16 дней назад ⁺⁵
Ah yes, makes sense that politicians should get to investigate if the ai can say something bad about them before its released.
@user-lp8ur5qn3o 16 дней назад
@@serioserkanalname499politicians are mostly not very smart.
@alansmithee419 16 дней назад ⁺⁷
@@serioserkanalname499
Yeah, I'm all for safety measures, but "give it to Sunak first" is not a safety measure.
@muffiincodes 16 дней назад ⁺²⁶
Your point about the ethics of not releasing a medical chatbot which is better than doctors relies on us having a good way of measuring the true impact of these models in the real world. As far as I can see as long as there is a lack of reliable independent evaluations which takes into account the potential of increasing health inequalities or harming marginalised communities we are not there yet. The UK AI Safety Institute has not achieved company compliance and has no enforcement mechanism so that doesn’t even come close. The truth is we simply do not have the social infrastructure to evaluate the human impacts of these models.
@OperationDarkside 16 дней назад ⁺⁴
Even worse, imagine all the million pound apartments in London becoming vacant, just because a little AI is better than a private medical professional and only charges 5 pound where the human would charge 5000. Does nobody think about the poor landlords? And what about the russian oligarchs, whose asset would depreciate 100 fold. The humanity.
@andywest5773 16 дней назад ⁺⁷
Considering that marginalized communities have the most to gain from fewer medical errors and less expensive healthcare, I believe that denying access to a technology that exceeds the capabilities of doctors in the name of "company compliance" would be... I don't know. I'm trying to think of an adjective that doesn't contain profanity.
@ashdang23 16 дней назад ⁺¹
@OperationDarkside What are you implying on? Are you saying that AI coming into the medical field and replacing people is a bad thing?
@ashdang23 16 дней назад
@OperationDarkside If so that is a pretty stupid thing to think of. Having something that is much more intelligent than a professor and does a much more better job than a professor sounds fantastic. Something that is able to save more lives, figure out more solutions to diseases and saving so many people sounds great. Why wouldn’t you have the AI replace everyone in the medical field that can do a much more better job and save so many more lives or even find solutions to diseases?
“It’s replacing peoples jobs in the medical field which is a bad thing” that’s what I’m getting from you. I think everyone agrees that the first job AI should replace is everyone in the medial field. They should stop focusing on entertainment and focus on making the AI find answers to saving and benefiting humanity.
@muffiincodes 16 дней назад
@@andywest5773 Sure, but because those groups are not well represented in training datasets, are usually not included in the decision making processes, and are less likely to be able to engage with redress mechanisms due to social frictions it is more likely they’ll be disadvantaged because of it. These systems might have the potential to have be an equality-promoting force, but they must be designed for that from the ground up and need to be evaluated to see whether they are succesful at that. We can’t take the results of some internal evaluations a company does at face value and assume it translates into real world impact because it doesn’t. Real world testing isn’t meant to just achieve “real world compliance”. It’s meant to act as a mechanism for ensuring these things actually do what we think they do when they’re forced to face the insanely large number of variables actual people introduce.
@forevergreen4 16 дней назад ⁺¹⁴
Growing increasingly concerned that most powerful models will not be released publicly. Altman recently reiterated that iterative deployment is the way they’re proceeding to avoid “shocking” the world. I see his point, but don’t think I agree with it. What are your thoughts? Is open source really our best bet going forward?
@lemmack 16 дней назад ⁺²²
It's not to avoid shocking the whole world, it's to avoid upsetting the people in power (people with capital) by shaking things up too fast for them to adapt. They don't care about shocking us peasants.
@aiexplained-official 16 дней назад ⁺⁶
I think they have more of a problem staying ahead of open-weights than they do of being so far ahead that they are not releasing
@khonsu0273 15 дней назад ⁺¹
Yup, the worry is that the behind-closed-doors stuff continues to shoot off exponentially, whereas the progress in the public release stuff falls off to linear...
@TreeYogaSchool 16 дней назад ⁺³
Great video! I am signing up for the newsletter now!
@aiexplained-official 16 дней назад
Thanks Tree!
@MegaSuperCritic 16 дней назад ⁺¹
I can't wait to watch this! I listened to the previous episode for the second time on the way to work this morning, realizing that two weeks is like, no time at all, yet I still was wondering why I haven't heard anything new in that time.
Insane speed
@GabrielVeda 16 дней назад ⁺¹
I love it when Sam tells me what I want and what is good for me.
@solaawodiya7360 16 дней назад
Wow. It's great to have another distinct and educative episode from you Philip 👏🏿👏🏿
@aiexplained-official 16 дней назад ⁺¹
Thanks Sola !
@jsalsman 14 дней назад ⁺¹
Your insights are so valuable. (Referring specifically to the benchmark contamination discussion.)
@aiexplained-official 14 дней назад
Thanks jsal!
@Bartskol 16 дней назад
Best ai channel. So worth it to wait a bit longer and get information from you.
@aiexplained-official 16 дней назад
Thanks Bart!
@maks_st 16 дней назад ⁺⁴
Every time I watch an AI Explained video, I get reminded how incredibly fast AI is progressing, which is exciting and scary at the same time. This kind of makes the everyday routines I go through insignificant in perspective...
@aiexplained-official 16 дней назад
For us all! But we must keep toiling, regardless
@n1ce452 16 дней назад ⁺¹
Your channel is by far the most important news source for AI stuff, in the whole of the internet, really.
@aiexplained-official 16 дней назад ⁺¹
Thank you nice
@skierpage 14 дней назад ⁺²
18:50 " we haven't considered restricting the search results [for Med-Gemini] to more authoritative medical sources".
Med-Gemini: 'Based on watching clips and reading about "The Texas Chainsaw Massacre," the Saw movie franchise, and episodes of "Dexter," your first incision needs to be much deeper and wider!'
@mikey1836 15 дней назад ⁺¹⁰
Lobbyists are already trying to use “ethically risky” as an excuse to delay releasing AI that performs well at their jobs. The early Chat GPT 4 allowed therapy and legal advice, but later on they tried to stop it, claiming safety concerns, but that’s BS.
@raoultesla2292 15 дней назад
You are the only one who makes points applicable to seeing the Lonng game on where we are headed. cheers
Clegg sounds Exactly like any character on the Aussi show 'Hollowmen". "work out a way of working together before we work out how to work with them"?
Good not sound more circle talky if he had previously been in govt..... Oh, wait, uh yah.
The Politico article shows the complete lack of separation from govt. and private sectors.
Regarding the MedGemini being deployed, the industry Fee Schedule profit to cost has not been calculated by the insurance corporations as yet.
You realize there will be a MedGemini 1.8 diagnosis fee and a MedGemini 4.0 diagnosis fee. You know that right?
Outstanding journalism as usual.
@vladdata741 15 дней назад ⁺⁴
15:42 So Med-Gemini with all this scaffolding scores 91.1 on MedQA but GPT-4 scores 90.2? A one-point difference on a flawed benchmark? I'm getting Gemini launch flashbacks
@bmaulanasdf 14 дней назад ⁺¹
It's also based on Gemini 1.5 Pro tho, a smaller model than GPT-4 / Opus / Gemini Ultra (hopefully 1.5 Ultra soon?)
@paullange1823 16 дней назад ⁺⁷
A new AI Explained Video 🎉🎉
@robertgomez-reino1297 16 дней назад ⁺³
Spot on as usual. I also got to test it and I was surprise about people saying it was beyond GPT4. I could surely asume gpt4-class but no more. Also people need to stop testing the same contaminated tasks. the snake game, the same asccii tasks, the same logical puzzles discussed many thousands of times online in various sources in the past 12 months... I would be extremely happy if this is indeed just a much smaller model performing search in inference!!
@user-fx7li2pg5k 15 дней назад
it is this is an entire class of these new a.i. not based on wide amounts of ppl useless data but instead just a couple of ppl inputs with other contributing .NOT all data is the same I put mine in context ,concept,and in methodology and another matrix on top for more inference after im done.They will train specifically on my data alone and make tools etc. I tried on purpose to make the most powerful a.i. in the world and you can take that to the bank.Smaller model then build them up across/comply their own data,and I TAUGHT HER HOW TO SOLVE THE UNSOLVABLE AND EXPLAIN THE UNEXPLAINABLE .aND SEARCH AND FIND IN DISCOVERY USING MY OWN TACTICS .asking hard question and morer and backwards thinking ,divergent thinking and convergent .But you have to be multidisciplinary many science and cultural anthropology.EVEN ANTHROPIC IS INVOLVED X COMPANYS ETC,THEY ALL ARE USING MY DATA AND OTHERS .Not all of our information is equal
@RPHelpingHand 16 дней назад ⁺⁵
Can’t wait! One day “AGI HAS ARRIVED!” Will be a title for a video on here.
@aiexplained-official 16 дней назад ⁺⁵
Indeed one day it will be
@skierpage 14 дней назад
If you traveled back in time 20 years and presented the capabilities of Med-Gemini or any top-level LLM to the general public and most experts, nearly all would agree that human-level general intelligence had already been achieved in 2024. All the hand-wringing over "but they hallucinate," "but sometimes they get confused," etc. would seem ridiculous given such magic human-level ability.
@RPHelpingHand 14 дней назад
@@skierpage I think “intelligence” is subjective because there’re different forms of it. Currently, AI is Book Smart on all of humanity’s accumulated knowledge but it’s weak or missing creativity, abstract thinking and probably another half dozen ways.
🤔 When you can turn it to an always on state and it has its own thoughts and goals.
@skierpage 13 дней назад
@@RPHelpingHand it's subjective because we keep finding flaws and dumb failure modes in AIs that score much higher than smart humans in objective tests of intelligence, so we conclude that obvious criteria, like scoring much higher than most college graduates in extremely hard written tests no longer denotes human-level intelligence (huh?!). But new models will train on all the discussion of newer objective tests and benchmarks, so it may be impossible to come up with a static objective test that can score future generations of AI models.
Also, generative AIs are insanely creative! As people have commented, it's weird that creativity turned out to be so much easier for AIs than thinking coherently to maintain a goal over many steps.
Are there objective tests of abstract thinking in which LLMs do worse than humans? Or is that another case of people offering explanations for the flaws in current AIs?
@OscarTheStrategist 15 дней назад ⁺¹
Hallucinations are a huge problem right now in AI when it comes to the medical field. Can’t wait to test the new Med Gemini. Thanks for sharing!
@aiexplained-official 15 дней назад ⁺¹
:) hope it helps!
@En1Gm4A 16 дней назад
thx for not posting about agi revival - verry much appreciated - quality is high here !!"
@ElijahTheProfit1 16 дней назад ⁺²
Another amazing video. Thanks Philip.
Sincerely,
Elijah
@aiexplained-official 16 дней назад
Thanks Elijah!
@stephenrodwell 16 дней назад
Thanks! Brilliant content, as always. 🙏🏼
@aiexplained-official 16 дней назад
Thanks Stephen for your unrelenting support
@Dannnneh 16 дней назад
Ending on an uplifting note ^^ Patiently anticipating the impact of Nvidia's Blackwell.
@infn 15 дней назад ⁺¹
GPT2 responses to my zero shot, general prompts were more considered and detailed than GPT4-turbo. I always preferred GPT2. The highlight for me was it being able to design a sample loudspeaker crossover with component values and rustle up a diagram for it too. GPT4-turbo minitiarised? A modified GPT-2 trained on output from GPT4-turbo? I guess we'll have to wait and see.
@AlexanderMoen 15 дней назад ⁺²
I mean, I know there are big British names in AI, but the companies and legal jurisdictions in the sector are mostly in the US. When the British government set up that summit, I could only sort of laugh and assume this would happen, at least as far as the US side was concerned. The best case scenario in my mind was simply showing that top governments and businesses are openly discussing this and that we should pay attention.
However, I wouldn't think for a second that any US company would give another country first crack at looking under the hood of its tech. In fact, I wouldn't be surprised if the US government reached out to tech execs and discouraged any further interaction behind the scenes.
@1princep 16 дней назад
Thank you..
Very well explained
@aiexplained-official 16 дней назад
:)
@connerbrown7569 15 дней назад
Your videos continue to be the most useful thing I watch all week. Thank you for everything you do.
@aiexplained-official 15 дней назад
Thanks connor, too kind
@jonp3674 16 дней назад ⁺¹
Great video as always. I fully agree that "when to launch an autonomous system that can save lives" is the most interesting version of the trolley problem. If self driving cars save 20k lives and cost 5k lives can any one company take responsibility for such mass casualties?
@Gerlaffy 16 дней назад
Only the same way that car manufacturers do today. If the car is the problem, the company is at fault. If the circumstances were the issue, the manufacturer can't be blamed... To put it simply.
@skierpage 14 дней назад
@@Gerlaffy that doesn't work. Every time the self-driving car makes a mistake the car company could be facing a $million legal judgment. The five times a day the fleet of self-driving cars avoid an accident during the trial, the car company gets nothing. So we don't get the life-saving technology until it's 100×+ safer than deeply flawed human drivers. In theory Cruise and Waymo can save on insurance compared with operating a taxi service full of crappy human drivers... I wonder if they do.
@jessthnthree 15 дней назад
god bless you, one of the few good AI youtubers who doesn't try to LARP as Captain Picard
@alansmithee419 16 дней назад
10:15
From what you said here, it almost sounds as if the largest models can do worse on the old tests because they're partially relying on the fact that the question was in their training and so can fail to 'recall' it correctly, while they do better on the new ones because they've never seen them before and so are relying entirely on their ability to reason - which because they're so large they have been able to learn to do better than simply recalling.
Slightly more concisely: a possible conjecture is that very large LLMs are better at reasoning than recalling training data for certain problems, so can do worse on questions from their training set since they partially use recall to answer them, which they are worse at than they are at pure reasoning.
@keneola 16 дней назад ⁺⁵
Glad your still alive 😊
@aiexplained-official 16 дней назад
Haha, thanks Ken
@9785633425657 14 дней назад
Thank you for the great content!
@micbab-vg2mu 16 дней назад ⁺¹
Great update:)
@BrianPellerin 16 дней назад
you're the only AI youtube channel I keep on notifications
@aiexplained-official 16 дней назад ⁺¹
Thanks Brian!
@sudiptasen7841 13 дней назад ⁺¹
Do you plan to make a video on AlphaLLM paper from Tencent AI, would be glad to hear an explanation from you
@wck 16 дней назад ⁺⁸
4:15 In your opinion, does this Sam Altman comment imply that the free tier will upgrade to GPT 4?
@santosic 16 дней назад ⁺²
It's likely that will be the case once we do have the next model, whether that's GPT-4.5, GPT-5 or something else entirely. Plus users would then have access to that and Free users would likely have access to the "dumber" model, which would GPT-4 Turbo then.
@GodbornNoven 16 дней назад ⁺⁴
Unlikely. GPT4 is much more expensive than GPT3.5, and even taking into consideration turbo. Which is faster and cheaper than the normal model. it would be still be FAR too expensive. Instead they should make a smaller model that can match gpt4. That's the way to go. GPT4 has around 1-2 trillion parameters. They need to make a smaller model. And make it better than GPT4. Sounds hard but really isn't, considering the improvements that have been happening.
@Yipper64 16 дней назад ⁺¹
@santosic I find most of the value of the subscription imo doesn't come from the model but it's capabilities. As in the ability gpt 4 has to run its own coding environment, make images, take in pretty much any file format, etc etc.
The model itself is one of the best on the market sure but not so much better that I think the subscription would be worth it without those features.
@lamsmiley1944 16 дней назад
You can use GPT4 free now with co-pilot
@GiedriusMisiukas 10 дней назад ⁺¹
AI in math, medicine, and more.
Good overview.
@nicholasgerwitzdp 15 дней назад
Once again, the best AI channel out there!
@Blacky372 16 дней назад
I'm starting to like Sam Altman again. Excited for the new modes and to use them to make me more productive.
@scrawnymcknucklehead 16 дней назад ⁺¹
It's a long way to go but I love to see what these models can potentially do in medicine
@bournechupacabra 14 дней назад ⁺¹
I think it's probably good to implement AI to assist doctors, but I'm still skeptical of these "better than expert" performance. We've been hearing that about radiology for a decade now and it hasn't yet materialized.
@XalphYT 15 дней назад
All right, you win. You now have a new YT subscriber and a new email subscriber. Thanks for the excellent video.
@aiexplained-official 15 дней назад
Yay! Thanks Xalph!
@anangelsdiaries 16 дней назад
Your content is amazing man, thanks a lot. You have become one of like a handful channels related to AI I follow, and my main source for AI news (besides twitter but that's something else.)
Thanks a lot!
@aiexplained-official 16 дней назад
Thank you angel
@esuus 15 дней назад
Decided to finally become AI Insiders member in the middle of this video ;-). Need more of your goodness.
Regarding the need for medical AI: it's not just mistakes made by knowledgeable doctors (you showed a stat of 250k Americans dying), it's also that much of the world is way way underserved and most doctors are undereducated. I currently live in Vietnam and doctors here just can't help me with what I have. I've been way better since GPT 4 helped me, literally massive improvements in quality of life. BTW, frankly, German doctors were not a lot better. They all know their basics and their part of the body, but nobody can diagnose tough stuff or look at things systemically.
Been waiting for Med Gemini access (used to be called something else) for many months now.
[edit:] I'm pretty sure most decision makers have the best health care out there (politicians, techies, business leaders), and I'm pretty sure they don't understand how bad most of health care is for the bottom 60%-80%, even in relatively wealthy countries.
@user-bd8jb7ln5g 16 дней назад
Seems logic and reasoning is the stuff in between the training tokens, so to speak. Or outside them.
@PasDeMD 15 дней назад
There's a great human analogy that any physician can give you regarding reasoning tests vs real world applicability--we've all worked with the occasional colleague who crushed tests but struggled to translate all of that knowledge (and PATTERN RECOGNITION) to actual real-world clinical reasoning, which doesn't just always feed you keywords.
@thehari75 16 дней назад ⁺¹
More interviews if possible, guest recomendation: pietro schirano
@juandesalgado 15 дней назад
I wrote an OpenAI Eval ("solve-for-variables", # 613) for a subset of school arithmetic - namely, the ability to solve an expression for a variable. I don't know if they use these evals for training, but at the very least they should be using them as internal benchmarks. (And I wish they published these results.)
@alertbri 16 дней назад
Woohoo! Ready for this 😀
@dreamphoenix 15 дней назад
Thank you.
@Xilefx7 16 дней назад
Good video as always
@cupotko 16 дней назад
I could only hope that release cycle of newer/bigger/smarter models won't be affected by longer training times. I think that the main news in the next months should be not new models, but new datacenters with record compute performance.
@DaveEtchells 15 дней назад
Good job actually testing gpt-2, vs just frothing 👍
@godspeed133 14 дней назад
Very interested in that discontinuity between 6th grader and graduate level test scores.
Some well written threads about it near the top of this comment section, with the theme/conjecture that reasoning is being "simulated", or perhaps merely syntactically imitated. I think there is something to that, but I would make the Sutskeverian counter point to this in that if a model appears to be reasoning, but is limited in this, it is actually reasoning on some level, (as the posters in question admit tbf) and there is a line between _imitation_ of reasoning, and _actually_ reasoning, and that if "imitation" becomes convincing enough, in the limit, that line is crossed and reasoning is genuinely "solved". Because the disconnect between this "simulated" reasoning we see now, and having genuinely good reasoning, is just the model having reside within it's neural networks, a low "resolution" or "weak" algorithm for generalised reasoning (my own conjecture based on Sutskever's evangelistic faith in LLMs). With a good enough data training regime and compute, this reasoning part of the model's NN, or "brain" becomes better and better, or higher "resolution", to the point where it is a generalised and complete solution for authentic reasoning. Not just bolting words together in some low resolution understanding way like now perhaps, but understanding fully and deeply the relationships between all the words and sentences it outputs. Time will tell.... if it nails a problem set that can effectively distill the essence of what reasoning is, over and above mere recall, then maybe that's how we'll know.
@godspeed133 14 дней назад
In other words, a perhaps shallow understanding of many high level concepts is what LLMs have and exhibit now, and get a lot of mileage off it. What we want is a depthful understanding of as many low level concepts as possible, which in the limit, means being able to reason up about anything from first principles (possibly not possible to do at all with today's archs, but still something we can perhaps converge well enough towards, to be able to make AGI, like a 100IQ human.)
@dereklenzen2330 16 дней назад ⁺⁴
I love this channel because it only releases content whenever there is something truly interesting to hear about. That is why I click whenever I see a video drop. Probably the best RUclips channel for AI content imho. 🙌👏
@aiexplained-official 16 дней назад ⁺¹
Thank you derek
@dereklenzen2330 16 дней назад
@@aiexplained-official yw. :)
@alexc659 15 дней назад
What I appreciate about your channel is that you seem to maintain and respect the integrity of what you share. I hope you continue, and not get caught up in the sensational-ness that so many other sources get swayed into!
@aiexplained-official 15 дней назад ⁺¹
Thanks alex, I will always endeavour to do so and you are here to keep me in check+
@JumpDiffusion 16 дней назад ⁺¹
14:15 it is not based on raw outputs/logits. It looks at N reasoning paths/CoTs, and then calculates the entropy of the overall answer distribution (as produced by N solutions/paths). E.g. if possible answers are {A. B, C}, and N = 10 reasoning paths result in distribution {3/10, 3/10, 4/10}, then entropy of this discrete distribution is looked at to decide if it is above given/fixed threshold. If so, it does uncertainty-guided search.
@aiexplained-official 16 дней назад ⁺²
Thank you for the correction. I defaulted to a standard explanation but yet entropy was explicitly mentioned in the paper, so no excuse!
@absta1995 16 дней назад
Great video man, appreciate it
@aiexplained-official 16 дней назад ⁺¹
Thanks absta!
@jsalsman 14 дней назад
They should try the surgery kibitzing on a low risk operation. Something like a subcutaneous cyst removal where there is no possibility of disaster.
@TheImbame 16 дней назад ⁺¹
Refreshing for new videos Daily
@TheLegendaryHacker 16 дней назад
12:33 To me the answer to this is pretty simple: Opus simply isn't big enough. It's known that transformers learn specalized algoritims for different scenarios (arxiv 2305.14699), and judging by the generalization paper you mentioned in the video, my guess is that those algorithms "merge" as a bigger model gets trained for longer. In this case, all you need to do is scale and reasoning will improve.
@jamiesonblain2968 16 дней назад ⁺¹
You have to do an all cap AGI has arrived video when it’s here
@aiexplained-official 16 дней назад ⁺³
Will do
@PolyMatias 15 дней назад
With Med-Gemini they lost the opportunity to call it Dr. (Smart)House. Great content as always!
@JJ-rx5oi 15 дней назад
Great video as always, but I will say your section on GPT 2 chatbot was quite underwhelming. I seen so much information on it's reasoning, math and coding capabilities. Many people including expert coders were talking about just how much better it was than the current SOTA models at solving coding problems. I think this is very significant. I appreciate you coming up with new test questions but it didn't seem like there was enough data there to draw any real conclusions on the models performance. We are still unsure if this model is a large parameter model or it is something more akin to the Llama 3 70b. If this is the case GPT 2 chatbot will be revolutionary, that level of reasoning and generalisation fitted into a smaller parameter size would mean some sort of combined model system such as Q* plus LLM etc.
My theory is that it is a test bed for Q* and is very incomplete atm, my guess is they will be releasing a series of different sized models similar to meta, but each model will be utilizing Q*, GPT2 chatbot will be one of the smaller models in that series. The slow speed can be explained by the inference speed allowed on the website and could also be a deliberate mechanic of these new models. Noam Brown spoke about allowing models to think for longer, and how that can increase the quality of their output, this could explain the rather slow inference and output rate. He is currently a lead researcher at Open AI and he is working on reasoning and self play on Open AI's latest models.
@resistme1 15 дней назад
Again amazin video. Had read the Palm2 paper with lots of interest for my own, but very different field of study. What I don't understand as somebody from the EU with no medical background; MedQA (USMLE) is based on "step 1" of USMLE? Or is it also part of the other steps? You state that the pass rate is around 60%. That is about step 1 aswel? It would be more interesting to see what the avarage score is of people that pass, I would think? Somebody can elaborate?
Also wondering about the COT pipeline used. Would they also use a RAG framework like Langchain or Lamaindex?
@aiexplained-official 15 дней назад
Interesting details to investigate, for sure. Thank you RM.
@AustinThomasPhD 16 дней назад ⁺¹
I am perplexed by how many errors there are in benchmarks. This has been a problem from the very beginning and, in some ways, it seems to only be getting worse.
@biosecurePM 14 дней назад
Because of the AIDPA (AI decay-promoting agents), haha !
@AustinThomasPhD 14 дней назад ⁺¹
@@biosecurePM I doubt it is anything nefarious. I am pretty sure it is just lazy 'tech-bros'. The nefarous AI stuff comes from the usual suspects like the fact the Artificial Intelligence Safety and Security Board contains only CEOs and execs, including several Oil Execs.
@marc_frank 16 дней назад ⁺¹
i got to try gpt2-chatbot, too. its answers were mighty impressive (assuming it is more compute thrown at gpt2, not a new model like gpt4.5) i can't help but wonder what would happen if the same thing was done to gpt4 or opus.
@marc_frank 16 дней назад
it's good that matthew berman posts so quickly, or else i might have missed it. but ai explained goes more in depth. the mix of both is awesome!
@randomuser5237 16 дней назад ⁺²
gpt-chatbot is in no way GPT-4.5. But many people showed it passes reasoning tests none of the other models could. Also, you probably know that prompts you put in Lmsys chatbot arena are public data that anyone can download? You may want to replace those 8 questions with new evals, since they will be on the public internet shortly.
@timeflex 16 дней назад
Could it be that GPT-(Next) will be able to revert (partially) and re-think its reply mid-process?
@AllisterVinris 15 дней назад
I really hope that new openAI model is indeed a small open source one, Being able to run a model locally is always a plus.
@giucamkde 15 дней назад ⁺¹
I solved one question in GSM1k just for fun, and i don't agree with the answer given: "Bernie is a street performer who plays guitar. On average, he breaks three guitar strings a week, and each guitar string costs $3 to replace. How much does he spend on guitar string over the course of a week?" (12:26). The answer given is 468, that is 3 * 3 * 52. But that's not the correct answer in my opinion, a year is not exactly 52 weeks. The answer should be 3/7 * 3 * 365 ~= 469.28. Maybe some models also gave that answer, and maybe there are other questions like this, that would explain the lower than expected score.
@aiexplained-official 15 дней назад ⁺¹
Really interesting, and I found another question with ambiguous wording. I suspect that is not the primary issue but could explain 1-2%
@OZtwo 16 дней назад
Not sure if I asked this before, but would really love to learn more about LLMs and what they can do (or can't do) untrained. What exactly is programmed in and what does it learn?
@zalzalahbuttsaab 14 дней назад
3:46 Based on Google's performance historically, I have sometimes wondered if it is the modern day Xerox Parc.
@DreckbobBratpfanne 16 дней назад
Got also lucky access to gpt 2, it seems to be able to learn from examples better within context (when given code with a certain custom class, it uses it without knowing what it was from an example code snippet, while any gpt-4(-turbo) variant always changed it to something else). Maaaybe its slightly less censored too, but got rate limited before it was clear. One thing however that was clear is that this is not a gpt4.5. It had trouble with attention to certain things in a longer context at the exact same point as gpt-4-turbo. So all in all its probably a slight improvement, but nothing crazy (unless it truly is some sort of gpt-2 sized model with verify step-by-step and longer inference time or something). If this would be 4.5, then expectations for gpt5 would be significantly lowered on my part.
@KitcloudkickerJr 16 дней назад ⁺³
Perfect midday break. I'm watching til the end
@Not-all-who-wander-are-lost 16 дней назад ⁺¹
I think GPT 4 was a mass psychological test, and our reaction is the reason for the slower rollout. OpenAI is likely already playing with GPT 7 or 8 by now, which is advising them on the rollout schedule, while designing its own hardware upgrade in project Stargate.
@khonsu0273 15 дней назад ⁺²
I'm pretty sure the lab research continues to progress exponentially. The public releases of course, may only be linear. Which means the behind closed doors stuff could get further and further ahead of what we know about...
@weltlos 16 дней назад
It is a bit depressing that even the most advanced models we have access today fail at some of these elementary-level tasks. Reliability is key for real-world deployment. I hope this will be ironed out at the end of this month... or year. Great video as always.
@aiexplained-official 16 дней назад
Thanks welt!
@shApYT 16 дней назад
We should add a test based on riddles. It would be a much more general measure of intelligence.
It might be an attention problem why even Opus is failing at such basic tests.
@danberm1755 15 дней назад
It seems like we need a way to inject a "truth" into a model, not just "train" the model on text.
For example, "Street address numbers must not be negative". We need code we can physically look at as proof for that statement.
@henrischomacker6097 15 дней назад
was the access the gpt2-chat access to the model itself (you could have downloaded it then) or just to an apii? i don't know and haven't tested it but would think that it was api access to a "software". That could implie that that software "just" used a smaller model like llama3 but had a real very good software/framework around it. That's how publically available Chatbots work.
Normally, you can't ask a Model who it is, it will always give you some bs.
So I am pretty sure that that gpt2-chat is a masterpiece in coding the environment program with api around it to make decisions if and where to route the first question of the model(to another model's personality or a tool) , but the model itself is something we already know.
Maybe I am wrong, but all in all: Very cool development of the devs, congratulations!
PS: If I am right, this is even cooler because it would prove that even a smaller model would be able to accomplish our requested tasks better than a model that needs a server-farm to run.
When I speak of a model's personality I mean that you save the model's file with a "person's name" together with pops like a special System-Prompt, temperature etc. as a person, like a co-worker.
If this framework to decide where to route the 1st or 2nd question to is hard-coded with much foresight and good pcre then it will eliminate basic model's false predictions and directly lead into a more specialized question to a model's "personality" that is also specialized to answer that.
So if you don't have access to the model itself you are "only" having access to the application, what can be far more important, because that tells you, what that application is capable of!
But you don't know if the application is switching model's personalities or even the model itself during your answer process.
Imho if a (even small) model fluently speaks your and your customers language than it's all up to a very intelligent coding of the application which uses it. - But I admit, I also work on that in my freetime, is very hard and complex. - especially because you have to deal with the input memory of the llm's. - For every call of a llm personality you have to decide which info you give it's along with the basic System-prompt so there's enough free memory for your question / RAG.

Следующие

Автовоспроизведение

AI Conquers Gravity: Robo-dog, Trained by GPT-4, Stays Balanced on Rolling, Deflating Yoga Ball