accurately counting letters is architecturally impossible. the best the model can do is guess/estimate. a correct guess does not make a model better or worse. tokenization means the model cannot "see" any specific letters. its like asking a blind person how many fingers you are holding up. But people seem to not know how current LLM architecture functions so they will be easily fooled by fitted responses.
Fair. I get the architectural limitations and how tokenisation also influences results. How do you think LLM should address these types of tasks and other issues or do you think it’s simply not feasible with current architectures?
This is an extremely simple problem actually the model just has to give a output first in a scratch pad or working memory then review its first response before giving a response to you kind of like how humans say uhh then talk
Impossible no, much more difficult yes, a blind person can’t see but they can feel the amount of fingers. If you have two or more different parsed tokenizations of the same word you can figure it out, practical? Not super, unless you find a better implementation but not impossible.
@@VinMan-ql1yu It could learn it for sure but there is a different problem I try to highlight in the video which is that tokenization depends on context and the understanding of those tokens may be different depending on that tokenization, context, and the model's internal representations.
@@ritvikrastogi4912 potential solution is the model being aware of it's architecture and using tools or just simply spelling words out every time it gets asked a question like that. if you ask it to spell out the word, it get's it right every time.
It's very simple they start with a scratch pad that is not something that you get from the API like a working memory and then it is capable of reviewing that and getting the correct answer @@ritvikrastogi4912
@@ritvikrastogi4912 hard to tell without actually running a robust set of experiments. I think eventually it will be fixed, either through bruteforce preference optimization or maybe architectural novelties. I think overall this is an interesting area of research in addition to understanding other quantitative related tasks.
how many r are there in the sentence "how many r are there in the word strawberry"? the answer is being hacked across all LLM vendors gpt-4o answered correctly when prompt is phrased: how many r are there in the sentence "how many r are there in the word strawberry"? perform step by step reasoning leading to the final answer
only latest gpt-4o and sonnet-3.5 answers correctly the complex 5 candles riddle with the following system prompt: You are a logic and reasoning expert. Reason step by step leading to the final answer.
accurately counting letters is architecturally impossible. the best the model can do is guess/estimate. a correct guess does not make a model better or worse. tokenization means the model cannot "see" any specific letters. its like asking a blind person how many fingers you are holding up. But people seem to not know how current LLM architecture functions so they will be easily fooled by fitted responses.
Fair. I get the architectural limitations and how tokenisation also influences results. How do you think LLM should address these types of tasks and other issues or do you think it’s simply not feasible with current architectures?
This is an extremely simple problem actually the model just has to give a output first in a scratch pad or working memory then review its first response before giving a response to you kind of like how humans say uhh then talk
Impossible no, much more difficult yes, a blind person can’t see but they can feel the amount of fingers. If you have two or more different parsed tokenizations of the same word you can figure it out, practical? Not super, unless you find a better implementation but not impossible.
Guys, individual letters ARE tokens. You give it a word, it should have learnt somewhere which individual letters it is made of...
@@VinMan-ql1yu It could learn it for sure but there is a different problem I try to highlight in the video which is that tokenization depends on context and the understanding of those tokens may be different depending on that tokenization, context, and the model's internal representations.
Does Claude 3.5 Sonnet still holds the crown?
For my uses cases, Yes! Mostly doing stuff with code generation and reasoning along with some vision capabilities.
isn't the strawberry problem, related to tokenization...? How could this be solved...?
I believe it is. I mention it later in the video.
@@elvissaravia what could be the possible fix?? preference optimization?
@@ritvikrastogi4912 potential solution is the model being aware of it's architecture and using tools or just simply spelling words out every time it gets asked a question like that. if you ask it to spell out the word, it get's it right every time.
It's very simple they start with a scratch pad that is not something that you get from the API like a working memory and then it is capable of reviewing that and getting the correct answer @@ritvikrastogi4912
@@ritvikrastogi4912 hard to tell without actually running a robust set of experiments. I think eventually it will be fixed, either through bruteforce preference optimization or maybe architectural novelties. I think overall this is an interesting area of research in addition to understanding other quantitative related tasks.
weird tomorrow they are going to announce gpt 4o-large
Is that confirmed or rumoured?
@@elvissaravia it is confirmed by that strawberry account he said that its going to happen on thirsday
@@Cine95Where can I find the account you’re referring to?
That's still a rumour, that's not "confirmation".
Dont trust him@@Cine95
how many r are there in the sentence "how many r are there in the word strawberry"?
the answer is being hacked across all LLM vendors
gpt-4o answered correctly when prompt is phrased:
how many r are there in the sentence "how many r are there in the word strawberry"?
perform step by step reasoning leading to the final answer
Allen Edward Johnson Frank Clark Eric
I assume you are not a coding engineer because what kind of coder wants to see comments in code😅
haha i am and i do think commenting is important in large codebases. depends on what kind of code you are referring to and for what it is used.
@@elvissaravia i get it. but i think overly commented code is not good. When chatgpt came out initially it used to add too many comments to the code.
@@gerkim62 agree overcommenting is a problem
only latest gpt-4o and sonnet-3.5 answers correctly the complex 5 candles riddle with the following system prompt:
You are a logic and reasoning expert. Reason step by step leading to the final answer.