This is the second of 5 videos where Phil examines the ins, outs, and struggles of AI! Join us every Tuesday in April for more. And watch the first one, on the difficult task of drawing hands, here: ruclips.net/video/24yjRbBah3w/видео.html
I imagine these models may actually help smaller languages to thrive. Imagine training them only enough to translate whole books with reasonable accuracy, keeping the native speakers with content.
I remember watching a video recently about a language which is only spoken. There's currently no way to write the language, so it'll be hard to train a model on those languages. Other than that I do agree with you. Even if it may not help all smaller languages to thrive, it may help some or a lot of them :)
Smaller languages might not have enough data points considering how surprisingly inaccurate machine translation still is with large languages with a ton of data set.
It's an interesting idea but also a dangerous one. The AI is not a native speaker and might damage the language. This reminds me of the non-Scot speaker that wrote a lot of Scot-wikipedia based on what they THOUGHT Scottish was, and did a lot of damage there.
This is not just a problem for the AI. As a language nerd, I have faced this problem with low-resource languages before (namely Hawaiian, where I was born and raised). As much as the linguistic community likes to talk about saving endangered languages, I feel like too few people actually do the first, most important work: Write the language down! Write and write and write! I'm glad to see the people here are taking the work with AI seriously, and I look forward to the day chatgpt can help me learn some of the fascinating low-resource languages of the world.
@@TylerMarkRichardson probably easier than recording. You probably could pay people for recordings of their phone calls tough. Or get them to donate them at the end of the call. Shady but easy
@@TylerMarkRichardson I'm not sure what you mean. Are you talking about historical Hawaiian? Because, they didn't have writing. But Hawaiian still exists, and if people intend to save it, it needs to be written down. And I'm not sure what you mean by "a modified version" since languages literally change every single day, so of course any Hawaiian written down now will be different from the past. That doesn't make it any less valuable to write down.
I've been using it to further my studies of Georgian, which is unfortunately difficult to find decent recourses for. It's been quite helpful so far. I've tested some of the responses with my Georgian friend and he said it was pretty accurate. I look forward to seeing it continue to improve!
Some people are mentioning how AI can help small languages survive. It's an interesting idea but also a dangerous one. The AI is not a native speaker and might damage the language. This reminds me of the non-Scot speaker that wrote a lot of Scot-wikipedia based on what they THOUGHT Scots was, and did a lot of damage there.
lol theres a lot of misconceptions out there. there are people too that arent very clear on how chinese languages work and they get easily influenced too by some people's agendas based on how their perspective works
@@seanleo1285 usually, no. But for endangered languages, maybe. Specially when it's not clear that it's a non speaker speaking. Read about Scots Wikipedia and the damage done there.
As a native Filipino (tagalog more specifically), I find it interesting because Filipinos sometimes mix both english and filipino when speaking so I can somewhat assume that some blogs or informal news sites could be used as part of the sample and it could create problems for GPT-3/4 when understanding and decyphering which paragraphs/sentences are pure Filipino or "TagLish" (basically English and Filipino in the same sentence) On a side note: Does anyone have a link to the data set he used in the video because I am now curious with my own language and how it's used in the Dataset
I don't think it would be a problem for GPT. The problem is only a lack of readily available data to feed it of written Filipino speech. If enough data were available, I'm sure GPT could master Tagalog while also understanding when to intersperse English or how to code-switch without getting confused. That is what GPT is best at. I'm surprised at how much of the common crawl is English, and even other big languages make up a small percentage.
My grandmother immigrated from Italy in the 40’s. This was before Italy pushed to unify the language across the country, so she spoke a regional dialect that no one in Italy speaks anymore. In addition, when she arrived, she had to learn English solely by immersion learning. So now when she speaks her Italian dialect, about 40% of it is just English words and phrases dropped into it. So there’s a mix of still-used Italian words, Italian words from a dead dialect, ‘Italianified’ English words (basically adding an -o or -a suffix), and just plain English words and phrases. There’s like a total of 20 people who maybe understand and/or speak this weird conglomeration of languages developed due to circumstance. There’s basically no way any AI would ever understand her.
That woman has amazing patience to go through those data sets. I wonder how they are addressing the regional differences just within England or the UK. in the North of England ‘pants’ are your jeans or joggers, whereas down south your ‘pants’ are your undergarments!
@@jamesgreenldn pants are pants in the US. From the north to the south I've never heard someone in the US say trousers for non-comedic/sarcastic purposes.
No one in the north or south call them that. It's trousers as a general term everywhere in the UK with slight regional variations such as keks up north, or specific words for the type of garment like joggers or jeans. Undergarments are your underwear, and most people in the UK call them pants(M/F)/panties(F) or knickers(older F).
I'm Catalan and just yesterday I was testing Chat GPT's Catalan capabilities, which made me wonder how well trained it had been. Catalan has a relatively strong online presence, as a lot of work has been put into this by the local government, so I thought it was trained on a lot more than just a few pages... Now I know
Unicode originally thought it only needs 65,536 code points to represent all the writing systems in the world. That was proven to be false so they just created new encoding systems and can now represent millions of code points. Researchers and language advocates continue to add obscure writing scripts to the Unicode, like the Linear B ideograms which are not yet deciphered yet but are now part of the Unicode character set. It only worked this way because Unicode is a non-profit entity. And that's what needed in large language models which are used in chat AI. A non-profit entity that can add even extinct languages to these models. A non-profit organization that is accessible to researchers and language advocates, while at the same time sets the standard on how large language models should behave.
I vouch heavily for LLM and a Chat AI standard to remain non-profit and preferably open source. But I can imagine Google, Microsoft, Amazon & Meta to try everything they can to prevent that. Hopefully government regulations will be put in place for EU, as we barely have a say in what those American hypercompanies decide with the future of internet.
I’m absolutely in favor of this approach. I think AI has the potential to be a transformative tool when it comes to preserving languages, but it must be managed wisely.
The machine translations for Finnish have been a target of constant jokes since 1990's and those translators still are not perfect today. So I believe when some AI software starts speaking Finnish the results are hilarious.
@@fz1576 I think the problem is the structure of the language, a lot of words depend on the context of the sentence so you cant just put a bunch of words one after the other like in english.
I am volunteer working on a machine translator Khmer-English called Sugar Palm Tree Translator. I am anxious about being outperformed by Google and turned into irrelevance. What warms my heart is that out of 7 most basic and often used phrases in Khmer, Google Translate fails on 6, while mine does all 7 correctly: older sibling, younger sibling, older brother, older sister, younger brother, younger sister, siblings. Also, Google Translate once translated "the police shot at the thief" as "the thief shot at the police". Comes handy when you are reading the news reports.
Machine translations of Hebrew and other Jewish languages are also infamously bad to the point where it's a joke on the internet. Even a lot of yiddish words that have been absorbed into standard English or just English Jewish words aren't recognized by auto-correct, let alone Siri.
Another language related issue is how precise a language is inherently. We could expect different results in therm of quality or efficiency in specific fields based on language specificities. Let's take Pirahã as an example, they only have 2 colours in their languages light and dark. They would have a harder time express visual art related specificities in therm of colour. Inversely they might allow more creativity to the AI to try things... so it's a mix bag and every languages will have blindspots like that.
Languages just are. If they exist in the form that they do, it’s because that’s sufficient for speakers to communicate effectively. It’s not AI’s or anyone’s role to determine how precise or not those languages are.
It helps when translating through two non-English languages doesn't have to route through English. For one, when translating between French, German and Russian, the tu/vous, du/Sie, ты/вы distinction would be lost if we have to do a translation to English first (all translate to you)
Hope this gets addressed for the low resource languages. Here in Canada, many of the First Nations have been challenged to get younger generations learning their languages so AI is something else for them to bolster their efforts.
Only solution is to generate more content, and frankly it is impossible to just transalte the internet and everything in it to every language, as you need to invent new words for it.
It's so wonderful to know that people are working so hard to make sure that all speakers of all languages, even the most obscure ones, will also have their jobs taken by natural language models! Equality!
I tried speaking to bing (by hitting the mic button) in Vietnamese. It not only recognized what I was asking but it gave an appropriate response with the correct diacritics.
Same with Kiswahili. All LLMs are pretty good with it with only some minor grammatical errors once in a while. Maybe our languages are just simpler to learn.
I'm now imagining a job people may have one day, to sit down in a massive office with others, literally just chatting with one another about anything, or writing books, or poetry or whatever, while an LLM catalogs all of it. I feel like those sorts of jobs could jumpstart low resource language sets. Too bad our tech companies don't prioritize stuff like that.
You have no idea how much training data LLMs need do you? You could pay starvation-level pay for those native workers, and you would still bankrupt a company the size of Microsoft if you wanted to do that at scale And then again, why _should_ they prioritize that? I'm not a native English speaker either, and it'd be great if we could keep obscure languages alive, but why take it out on AI companies? They have no obligation to do anything, and AI isn't _by far_ the best way to achieve this goal. Not even if AI was genuinely perfect. What use is there an AI speaking it, if people just never want to learn it? Just put that ridiculous amount of money in education instead...
@@hundvd_7 Your lack of imagination does not surprise me. And yes, obviously a single paragraph I wrote after seeing this video is the exact blueprint I think should be used going forward. That is exactly what I said. 🙄
This is really intriguing for English variants. If English is allowed to change over time and become something new then AI should be learning the process in countries where English hybridised with other languages like Singapore and Hongkong. Perhaps one day the will be a language model that has no problem blending languages and adapting to new grammatical rules. Language is not static after all
ChatGPT already has no problem blending languages. I speak to it in English and Spanish, and it has no trouble understanding me, even when I'm using both in the same context, and no trouble using English terms when it thinks it appropriate. Sometimes, for example, I might be asking for help in English and say "my system language is Spanish, so tell me where to find this option in program X with Spanish menu options." No problem at all. ChatGPT4 has perfect English, Spanish and German from what I can tell. I don't know about other languages, but I've read that its proficiency goes down the less common the language is on the Internet.
@@mikicerise6250 Yeah obviously it has no problem with those languages because they're dominant global languages with a lot of source material. The point of the video is that languages that haven't been exported around the world or aren't represented by countries with a lot of political power are being left behind
Top-10 the most used languages on the Internet are 1. English - 56.1% 2. Russian - 5.1% 3. Spanish - 4.8% 4. French - 4.2% 5. German - 4.2% 6. Japanese - 3.5% 7. Turkish - 2.4% 8. Portuguese - 2.2% 9. Persian - 1.9% 10. Italian - 1.8%
@@Max_Jacoby they restrict only some access to western resources, they have way more internet usage than Japan. Also no Hindu and Arab. That's absolutely wrong statistics you found.
@@Max_Jacoby I don't think that they are restricted from using the internet, it's just that there's the Great Firewall that restricts access to foreign websites. I looked up your list and it seems to report the percentage of websites, not content. From the methodology section "We investigate technologies of websites, not of individual web pages. If we find a technology on any of the pages, it is considered to be used by the website.". So Chinese only being 1.5% could be because they only have a few large websites that everyone uses instead of a whole bunch of smaller websites.
Nice one there. Yeah, a spotty dataset is a problem for language processing. I personally know people, who speak a dialect of an high resource langugage, but because the language they speak is written down standardized, the dialect/language they speak is basically not represented in the dataset. e.g. good luck getting your point across to the computer speaking irish english or swiss german.
Very nice explanatory video. I am catalan, and I too was surprised by how well chatGPT could write in Catalan, though it still had some mistakes. Let’s hope we can add some clarity and transparency to this issue
After learning anything about languages, I always wonder just how many have been lost to the sands of time... never to be heard again... sad and tragic, but also inevitable for most languages as time marches on.
I work at an AI startup as a prompt engineer. I think the main reason AI LLM (Large language models) don't know every language has to do with the linguistic knowledge of the people compliling the LLMs. Rarer languages have less data to be scraped, so you need to try harder to find the content to generate new data. However, there are some languages, such as Hebrew or Basque, that models such as GPT4 know quite well, and they have fewer than 10 million native speakers, each.
Not exactly about less data , mostly is about which data you chose to train the LLM with . I remember an article that came out two ir three weeks ago which was about kinda close to the same topic. In the article they were stating that sometimes the wirds that were being translated weren't always the correct ones , and that ked to problems They also stated that for the LLM to understand the words that needed ro finetune / filter the data that it was fed or feed a smaller bach if correct data.
In addition to this whole mess I would say, there are words that you cannot directly connect by meaing to other languages, you would need to guestimate them all. For example you can say that russian "smekalka" is something like "savy" but historically and culturally it's very different.
@@MateusSFigueiredo Right but it cannot inherit what people think of the word, the AI can't find the underlying context. Idk how to explain it properly english is not even my first language.
@@danser_theplayer01 I agree. The robot has read every book but doesn't really understand any of them, much less understand the feelings, the subjective interpretations, the insinuations...
I dont think it matters when you say historically and culturally really different. For example i am argentinian from the capital buenos aires. The chat box needs to understand formal neutral spanish the most not our own sub language lunfardo of which you can speak almost whole sentences with no actual Spanish in them. This words also dont translate well to 1 single word in spanish or english.
Thanks Vox for highlighting the genuine concerns surrounding language models and the significant influence that large corporations have on the source data that is used to train AI models.
An additional problem for low resource languages is that they're often not standardized, so there may be different writing conventions, different words depending on local dialects, etc. All of this makes it extremely difficult even to get good data to train on.
An oddity for you: when my father graduated from college, in pharmacy, in the 1930s, his first job was translating English-language pharmacy texts into Welsh. He lasted two weeks in the job, he told me, and quit, believing the whole thing was useless. My former partner's first job, when she first left higher education because of the civil war, was translating English language texts into Tik-Monjiaang, ("Language of The People," aka "Dinka"). Exactly parallel story: she quit within weeks, believing that the whole effort was going nowhere. (Today, a generation later, both efforts, in Welsh and in Dinka, are back in operation -- at scales comparable perhaps to that in Catalan in this excellent video.)
I remember a friend who lived in the country of Georgia back in the early 2000s. He said technology companies didn't have Georgian language settings in their devices. So many Georgians used Russian or English on their devices.
When I use ChatGPT I use English and then whenever I need to show the result to other people I use Google translate on a browser level to convert the whole conversation to Korean(my native language). I think general-purpose chatbots like ChatGPT is the niche where it makes sense to imitate what has been done on a dataset that is mostly English and replicating the same task on a different geographic context. It does work as a localized replacement rather than a redundant vanity project.
My friend is working for a local Indonesian AI startup. He was majoring Indonesian Literature and actually the perfect major for teaching AI about Bahasa Indonesia structure, vocab, and rules
Actually AI and chatgp[t like applications can be a saviour of small languages. I'm catalan, and never imagined I coul have the option to read any documentation about any software application for example... now I can. I get the same quality of answerd in English that in Catalan.
The Bofa language of south-central Cameroon is famously known among linguists as being impossible to learn unless you were born into the Bofa culture. Figuring it out will be the true litmus twst for AI.
Dear Mr. Phil, Congratulations on your newest RUclips video! I wanted to take a moment to extend my warmest congratulations to you on this exciting accomplishment. Your dedication and hard work really shine through in your videos, and it's evident that you're passionate about creating content that resonates with your audience. Your latest video was engaging, informative, and truly entertaining. The way you effortlessly connect with your viewers and share your unique perspective is truly impressive. I know how much effort and creativity go into producing high-quality content, and your latest video is a testament to your talent and commitment to excellence. You continue to impress and inspire your viewers, and I'm sure your RUclips channel will continue to grow and thrive. Once again, congratulations on your latest RUclips video, Mr. Phil! I'm excited to see what you have in store for your future videos, and I wish you continued success in all your endeavors. Best, ChatGPT
The information that the next most common language for words in GPT-3 was German - as a German speaker - absolutely baffled me... Especially that it's ahead of Spanish!
while most people wonder if the chat is sentient, you hit the nail on the head and understood how it works, when it doesn't and its social impact. very professional, journalistic aproach. Vox is really one of the best sources of information these days. it is not perfect... nor does it cease to be a company... but it does have good people working on good stories
There's a political aspect to this too. Catalan language for example is absolutely central to the Catalan independence movement. If you can marginalise a language by preventing new technologies from understanding or accepting that language, that could be a powerful tool for political control. If AI is going to be half as central to the future as its proponents claim, then excluding a language from AI system will be every bit as devastating to a culture as excluding that language today from the internet.
The fact that GPT can do a passable job of writing in languages for which it has very small datasets makes me wonder if it has discovered something consistent among all languages.
*One reason maybe is that AI speech tools rely heavily on the use of big data sets to learn a language, its pronunciation, grammar and semantics. The ability or quality of the resulting tools is mainly limited by how much data is available (and how “good” it is).*
My academic background is in Philosophy of AI and language. I don't know if we will learn anything about "artificial intelligence", but we are going to learn a lot about natural stupidity. This "AI histeria" is going to be the mother of all investment scams.
AI doesn't actually need to be intelligent by the normative standards of human intelligence in order to be *effectively* intelligent in the context of common types of work. I sense you're implying that AI ought to be, a bit like Chomsky, when that isn't even necessary.
Sometime ago, I've simulated the same narrative using GPT-3, firstly using English and then Portuguese. I've noticed that in English it was able to create not only way better dialogues, but also a superior quality narrative, more informal and creative.
I asked chat gpt on percentages of language in GPT-2. It tild me the following GPT-2 was trained on a large and diverse corpus of text data from various sources, which includes a significant amount of text in English. GPT-2 was trained on a web crawl corpus that contained a diverse set of texts from websites, news articles, books, and other sources. The corpus includes texts in several languages, but the majority of the data is in English. According to the OpenAI team, the corpus used to train GPT-2 included around 40 GB of text in English, out of a total of 45 GB of text data. It is worth noting that GPT-2 was designed to generate high-quality text in English, and its performance in other languages may not be as good. However, GPT-2 can still generate text in other languages, and the quality of the output depends on the quality and quantity of the training data available for each language.
This is a great video talking about what gpt actually is. I think most people think it is sentient computers when all it is is trying to model language like a kid in a library trying to learn but having no actual experience
you could start answer queries from a vector database (i.e. [x] language dictionary, common expressions, Q&A's etc.) and gradually train it the GPT with those precise recalls till it reaches optimal "comprehension" levels.
This is super interesting. Would love to see a bit more on how AI is or isn't accommodating people with disabilities i.e. deaf, hard of hearing, blind, etc.
I'm a bit surprised there was no mention of embeddings at all in this video. Once a large language model has learned a diversity of different types of languages, there's a lot of transferable knowledge that it can pull from for low resource languages. For example the reason why it's performing so well on Catalan given only 140 pages of example is because it's pulling from similar languages that it knows in much more detail. This will be doubly true for the pidgin creole languages. I agree with the fundamental premise of the video being that data diversity is key. Since all our languages are related at some level then transferable knowledge means that learning new languages will bolster the quality of the whole.
Come to think of it, a lot of best selling books have been published in dozens of languages. It might be a good evaluation for multi lingual systems to find inconsistencies between versions that we already know of.
"Our Bloom model needs about 360 GB of RAM to run" - that's beyond practical! I have a Raspberry Pi 4B with 4GB RAM directly soldered on the mainboard. Should I hook another 500 GB external USB hard disk to it and create 500 GB swap on it? My second hard disk enclosure is USB 2.0 only, won't it slow down Bloom too much?
I have open source GPT models on my local machine, but, I suppose because those models have to be compacted to use fewer parameters than the ones OpenAI provide on the cloud, they can't even speak Spanish very well. It would say Alpaca is about B2 in Spanish. They basically are only truly proficient in English. I agree language-specific models should be developed, as LLMs are a fantastic tool for conserving minority languages that are in danger of extinction for future generations.
This is so true. Like, why do we have to build a universal translation if we can just use GPT and create separate models catering for specific languages? That would be much much more feasible imo
Accessibility is great and has to be pursued but at the same time how do you protect the language, its culture and the people that produce/reproduce it? all those AI make so much easy to track and control what is said online while deciding what can be read or watch and removing, hiding or blocking the access for any other information and certain languages. It's known that intelligence services in the 60's and 70's sent anthropologist to many third countries to learn about their languages and ideologies, we don't have access but the data is there and more likely they're already using it.
finnaly a good video on the topic of AI and LLMs, I actually might have found a solution to this it is not just the data but there are other fundamental problems too (it's more related to indexing and embeddings)
I know so many non native English speakers that put all their digital content in English to reach a wider audience, and as a result all of their content is in English and not their native language.
Android is an open source OS. Good look uninstalling OEMs. The notion there is an abandoned supercomputer in Paris made me laugh. That's a bad start for a real open source project.
because AI heavily relies on a written language. All the AI's would struggle to speak any other Chinese language other than Mandarin because modern Chinese script only makes sense in Mandarin thus giving other Chinese languages less of an edge in the world of technology
I think the real takeaway from this is, language changes and evolves some go extinct always have and always will. This a simply a new way languages are evolving
Are there higher value sources vs lower value sources? Like are there tags for verified trustworthy or high usefulness or other important markers? Also, do they include dictionaries and thesauruses?
I was wondering how language models deal with primarily spoken language ? if we manage to add audio in training dataset, it can make a huge difference. Some languages have a lot of variations in “speaking/writting” form or “colloquial/formal” form such as cantonese or javanese… Also imagine if languages models can output regional vocabularies and mimic regional accents… it will change the translation app, the way we learn languages, etc…. 😎
I "invent" and app and train it to do well on a specific language, why am i forced to include others? just bc I do well and you want it too? shouldnt ppl of each language, if they want, develop their own app? and if they can't, why am I responsible to include them?
When speaking Ukrainian, ChatGPT sometimes tries to "invent" words taking its inspiration from Russian. Interestingly enough, these are really common words, you wouldn't blame the lack of data for those. And yeah, it's fair to say the model is still ahead of anything else out there.
This is the second of 5 videos where Phil examines the ins, outs, and struggles of AI! Join us every Tuesday in April for more. And watch the first one, on the difficult task of drawing hands, here: ruclips.net/video/24yjRbBah3w/видео.html
Love you vox for these great videos, keep making more!
You guys should not pin this as to allow more time for comments by others to have screen time in a more favorable way.
He can examine my ins and outs
Is there Malayalam
I imagine these models may actually help smaller languages to thrive. Imagine training them only enough to translate whole books with reasonable accuracy, keeping the native speakers with content.
I remember watching a video recently about a language which is only spoken. There's currently no way to write the language, so it'll be hard to train a model on those languages. Other than that I do agree with you. Even if it may not help all smaller languages to thrive, it may help some or a lot of them :)
Smaller languages might not have enough data points considering how surprisingly inaccurate machine translation still is with large languages with a ton of data set.
It's an interesting idea but also a dangerous one. The AI is not a native speaker and might damage the language. This reminds me of the non-Scot speaker that wrote a lot of Scot-wikipedia based on what they THOUGHT Scottish was, and did a lot of damage there.
@@MateusSFigueiredo I was thinking the same thing. The language is called Scots btw.
exactly. this was my immediate reaction. we could use LLMs to sustain smaller languages.
As a fellow Jamaican software engineer, I am EXTREMELY proud of Ruth-Ann! 👨🏽💻🇯🇲
🇯🇲❤❤❤ it’s awesome that Patois is getting this love
Thats awesome
This is not just a problem for the AI. As a language nerd, I have faced this problem with low-resource languages before (namely Hawaiian, where I was born and raised). As much as the linguistic community likes to talk about saving endangered languages, I feel like too few people actually do the first, most important work: Write the language down! Write and write and write! I'm glad to see the people here are taking the work with AI seriously, and I look forward to the day chatgpt can help me learn some of the fascinating low-resource languages of the world.
@@TylerMarkRichardson probably easier than recording. You probably could pay people for recordings of their phone calls tough. Or get them to donate them at the end of the call. Shady but easy
@@TylerMarkRichardson I'm not sure what you mean. Are you talking about historical Hawaiian? Because, they didn't have writing. But Hawaiian still exists, and if people intend to save it, it needs to be written down. And I'm not sure what you mean by "a modified version" since languages literally change every single day, so of course any Hawaiian written down now will be different from the past. That doesn't make it any less valuable to write down.
And also record the language in other ways, the sound, the body language, etc.
Language is much more than gets captured in written text.
I’m writing hokkien, henghua down
I've been using it to further my studies of Georgian, which is unfortunately difficult to find decent recourses for. It's been quite helpful so far. I've tested some of the responses with my Georgian friend and he said it was pretty accurate. I look forward to seeing it continue to improve!
Some people are mentioning how AI can help small languages survive. It's an interesting idea but also a dangerous one. The AI is not a native speaker and might damage the language. This reminds me of the non-Scot speaker that wrote a lot of Scot-wikipedia based on what they THOUGHT Scots was, and did a lot of damage there.
Yes definitely
The AI speaks English pretty well, I guess it comes down to if they can obtain enough training data for the rare languages.
lol theres a lot of misconceptions out there. there are people too that arent very clear on how chinese languages work and they get easily influenced too by some people's agendas based on how their perspective works
Damage? So when a non speakers use your language, they damage your language?
@@seanleo1285 usually, no. But for endangered languages, maybe. Specially when it's not clear that it's a non speaker speaking. Read about Scots Wikipedia and the damage done there.
As a native Filipino (tagalog more specifically), I find it interesting because Filipinos sometimes mix both english and filipino when speaking so I can somewhat assume that some blogs or informal news sites could be used as part of the sample and it could create problems for GPT-3/4 when understanding and decyphering which paragraphs/sentences are pure Filipino or "TagLish" (basically English and Filipino in the same sentence)
On a side note: Does anyone have a link to the data set he used in the video because I am now curious with my own language and how it's used in the Dataset
They had yet to replicate some of our 195 languages
@@guillerhonora717 exactly, which is why I'm interested on how they will translate those types of sentences for now.
Imagine Malaysia and Singapore
I don't think it would be a problem for GPT. The problem is only a lack of readily available data to feed it of written Filipino speech. If enough data were available, I'm sure GPT could master Tagalog while also understanding when to intersperse English or how to code-switch without getting confused. That is what GPT is best at. I'm surprised at how much of the common crawl is English, and even other big languages make up a small percentage.
My grandmother immigrated from Italy in the 40’s. This was before Italy pushed to unify the language across the country, so she spoke a regional dialect that no one in Italy speaks anymore. In addition, when she arrived, she had to learn English solely by immersion learning. So now when she speaks her Italian dialect, about 40% of it is just English words and phrases dropped into it. So there’s a mix of still-used Italian words, Italian words from a dead dialect, ‘Italianified’ English words (basically adding an -o or -a suffix), and just plain English words and phrases. There’s like a total of 20 people who maybe understand and/or speak this weird conglomeration of languages developed due to circumstance. There’s basically no way any AI would ever understand her.
That woman has amazing patience to go through those data sets. I wonder how they are addressing the regional differences just within England or the UK. in the North of England ‘pants’ are your jeans or joggers, whereas down south your ‘pants’ are your undergarments!
Dialect.
You are wondering how they handle Dialect.
I thought trousers are called pants in the U.S? What do you call pants up North?
@@jamesgreenldn pants are pants in the US. From the north to the south I've never heard someone in the US say trousers for non-comedic/sarcastic purposes.
@@andrewvirtue5048 yes I know thanks I was asking the O.P
No one in the north or south call them that. It's trousers as a general term everywhere in the UK with slight regional variations such as keks up north, or specific words for the type of garment like joggers or jeans.
Undergarments are your underwear, and most people in the UK call them pants(M/F)/panties(F) or knickers(older F).
I'm Catalan and just yesterday I was testing Chat GPT's Catalan capabilities, which made me wonder how well trained it had been. Catalan has a relatively strong online presence, as a lot of work has been put into this by the local government, so I thought it was trained on a lot more than just a few pages... Now I know
Same thing for me! When I heard it was trained on 140 pages of data I was like: "Això sí que no m'ho esperava"
Unicode originally thought it only needs 65,536 code points to represent all the writing systems in the world. That was proven to be false so they just created new encoding systems and can now represent millions of code points. Researchers and language advocates continue to add obscure writing scripts to the Unicode, like the Linear B ideograms which are not yet deciphered yet but are now part of the Unicode character set. It only worked this way because Unicode is a non-profit entity. And that's what needed in large language models which are used in chat AI. A non-profit entity that can add even extinct languages to these models. A non-profit organization that is accessible to researchers and language advocates, while at the same time sets the standard on how large language models should behave.
I vouch heavily for LLM and a Chat AI standard to remain non-profit and preferably open source. But I can imagine Google, Microsoft, Amazon & Meta to try everything they can to prevent that.
Hopefully government regulations will be put in place for EU, as we barely have a say in what those American hypercompanies decide with the future of internet.
oh if only this commentwas written in November last year.....
OpenAI kinda is? They're not being very open though, and they have a for-profit subsidiary OpenAI Limited Partnership...
I’m absolutely in favor of this approach. I think AI has the potential to be a transformative tool when it comes to preserving languages, but it must be managed wisely.
The machine translations for Finnish have been a target of constant jokes since 1990's and those translators still are not perfect today. So I believe when some AI software starts speaking Finnish the results are hilarious.
Just wondering if Deepl does a better job with Finnish translations?
@@fz1576 I think the problem is the structure of the language, a lot of words depend on the context of the sentence so you cant just put a bunch of words one after the other like in english.
Mitä ite kokeilin ChatGPT:tä, niin osasi muistaakseni suomea yllättävän hyvin.
I am volunteer working on a machine translator Khmer-English called Sugar Palm Tree Translator. I am anxious about being outperformed by Google and turned into irrelevance. What warms my heart is that out of 7 most basic and often used phrases in Khmer, Google Translate fails on 6, while mine does all 7 correctly: older sibling, younger sibling, older brother, older sister, younger brother, younger sister, siblings. Also, Google Translate once translated "the police shot at the thief" as "the thief shot at the police". Comes handy when you are reading the news reports.
Machine translations of Hebrew and other Jewish languages are also infamously bad to the point where it's a joke on the internet. Even a lot of yiddish words that have been absorbed into standard English or just English Jewish words aren't recognized by auto-correct, let alone Siri.
Another language related issue is how precise a language is inherently. We could expect different results in therm of quality or efficiency in specific fields based on language specificities. Let's take Pirahã as an example, they only have 2 colours in their languages light and dark. They would have a harder time express visual art related specificities in therm of colour. Inversely they might allow more creativity to the AI to try things... so it's a mix bag and every languages will have blindspots like that.
I don't think that's an issue
Languages just are. If they exist in the form that they do, it’s because that’s sufficient for speakers to communicate effectively. It’s not AI’s or anyone’s role to determine how precise or not those languages are.
@@polecat3 lol ok, that's your opinion. It isn't mine.
It helps when translating through two non-English languages doesn't have to route through English. For one, when translating between French, German and Russian, the tu/vous, du/Sie, ты/вы distinction would be lost if we have to do a translation to English first (all translate to you)
Ok but Piraha is a language of stone age primitives, an extreme outlier.
Hope this gets addressed for the low resource languages. Here in Canada, many of the First Nations have been challenged to get younger generations learning their languages so AI is something else for them to bolster their efforts.
Native Americans need to move on lol Canada is being taken over by Africans and Asians
Only solution is to generate more content, and frankly it is impossible to just transalte the internet and everything in it to every language, as you need to invent new words for it.
It's so wonderful to know that people are working so hard to make sure that all speakers of all languages, even the most obscure ones, will also have their jobs taken by natural language models! Equality!
I tried speaking to bing (by hitting the mic button) in Vietnamese. It not only recognized what I was asking but it gave an appropriate response with the correct diacritics.
Same with Kiswahili. All LLMs are pretty good with it with only some minor grammatical errors once in a while. Maybe our languages are just simpler to learn.
@@Squidlark I know it did a better job than Google translator or bing translator
It is worth noting that the classification of a language as high or low resource can vary depending on the context and criteria used. (by AI
)
I'm now imagining a job people may have one day, to sit down in a massive office with others, literally just chatting with one another about anything, or writing books, or poetry or whatever, while an LLM catalogs all of it. I feel like those sorts of jobs could jumpstart low resource language sets. Too bad our tech companies don't prioritize stuff like that.
You have no idea how much training data LLMs need do you?
You could pay starvation-level pay for those native workers, and you would still bankrupt a company the size of Microsoft if you wanted to do that at scale
And then again, why _should_ they prioritize that? I'm not a native English speaker either, and it'd be great if we could keep obscure languages alive, but why take it out on AI companies?
They have no obligation to do anything, and AI isn't _by far_ the best way to achieve this goal. Not even if AI was genuinely perfect. What use is there an AI speaking it, if people just never want to learn it?
Just put that ridiculous amount of money in education instead...
@@hundvd_7 Your lack of imagination does not surprise me. And yes, obviously a single paragraph I wrote after seeing this video is the exact blueprint I think should be used going forward. That is exactly what I said. 🙄
I was not expecting open source to be brought up but I'm very glad it was.
This is really intriguing for English variants. If English is allowed to change over time and become something new then AI should be learning the process in countries where English hybridised with other languages like Singapore and Hongkong. Perhaps one day the will be a language model that has no problem blending languages and adapting to new grammatical rules. Language is not static after all
ChatGPT already has no problem blending languages. I speak to it in English and Spanish, and it has no trouble understanding me, even when I'm using both in the same context, and no trouble using English terms when it thinks it appropriate. Sometimes, for example, I might be asking for help in English and say "my system language is Spanish, so tell me where to find this option in program X with Spanish menu options." No problem at all. ChatGPT4 has perfect English, Spanish and German from what I can tell. I don't know about other languages, but I've read that its proficiency goes down the less common the language is on the Internet.
@@mikicerise6250 Yeah obviously it has no problem with those languages because they're dominant global languages with a lot of source material. The point of the video is that languages that haven't been exported around the world or aren't represented by countries with a lot of political power are being left behind
We should speak binary
Top-10 the most used languages on the Internet are
1. English - 56.1%
2. Russian - 5.1%
3. Spanish - 4.8%
4. French - 4.2%
5. German - 4.2%
6. Japanese - 3.5%
7. Turkish - 2.4%
8. Portuguese - 2.2%
9. Persian - 1.9%
10. Italian - 1.8%
Chinese not being there is a bit suspect. Their internet usage is very large.
Chinese don't have internet?
China is number 11 with 1,5%. Although Chinese is spoken by 1.5 billion people their usage of Internet is heavely restricted by government.
@@Max_Jacoby they restrict only some access to western resources, they have way more internet usage than Japan. Also no Hindu and Arab. That's absolutely wrong statistics you found.
@@Max_Jacoby I don't think that they are restricted from using the internet, it's just that there's the Great Firewall that restricts access to foreign websites. I looked up your list and it seems to report the percentage of websites, not content. From the methodology section "We investigate technologies of websites, not of individual web pages. If we find a technology on any of the pages, it is considered to be used by the website.". So Chinese only being 1.5% could be because they only have a few large websites that everyone uses instead of a whole bunch of smaller websites.
Absolutely off topic, but @5:50-- I really like her paintings 😁
Nice one there.
Yeah, a spotty dataset is a problem for language processing. I personally know people, who speak a dialect of an high resource langugage, but because the language they speak is written down standardized, the dialect/language they speak is basically not represented in the dataset.
e.g. good luck getting your point across to the computer speaking irish english or swiss german.
Never expected to find a Yaadie 🇯🇲 researcher on Vox.. big up
0:35 Chat GPT is based in GTP 4 only for the paid version, for the free version it's GPT 3.5
Very nice explanatory video. I am catalan, and I too was surprised by how well chatGPT could write in Catalan, though it still had some mistakes. Let’s hope we can add some clarity and transparency to this issue
I guess it's time to learn Gaelic for the future machine wars
Cha chuir dad stad air a’ Ghàidhlig!!!
After learning anything about languages, I always wonder just how many have been lost to the sands of time... never to be heard again... sad and tragic, but also inevitable for most languages as time marches on.
All languages in time
I've discussed this with my Cherokee philosophy professor. ChatGPT will just confidently lie about Cherokee translations.
It’s not necessarily a ”lie” it’s called AI hallucination, u should search it up it’s super interesting to read about :3
Because ithas almost no trainign data on Cherokee.
when you started explaining entailment, it made me glad to see a 'useless' specialisation like linguistics intersect with the biggest tech trend
I work at an AI startup as a prompt engineer.
I think the main reason AI LLM (Large language models) don't know every language has to do with the linguistic knowledge of the people compliling the LLMs.
Rarer languages have less data to be scraped, so you need to try harder to find the content to generate new data.
However, there are some languages, such as Hebrew or Basque, that models such as GPT4 know quite well, and they have fewer than 10 million native speakers, each.
Not exactly about less data , mostly is about which data you chose to train the LLM with . I remember an article that came out two ir three weeks ago which was about kinda close to the same topic. In the article they were stating that sometimes the wirds that were being translated weren't always the correct ones , and that ked to problems
They also stated that for the LLM to understand the words that needed ro finetune / filter the data that it was fed or feed a smaller bach if correct data.
"I work at an AI startup as a prompt engineer." that's a job? does your ai company even train models?
That's very important for indigenous minority and endangered languages like Occitan, Breton, Cherokee, Ojibwe, Paiwan, etc
In addition to this whole mess I would say, there are words that you cannot directly connect by meaing to other languages, you would need to guestimate them all.
For example you can say that russian "smekalka" is something like "savy" but historically and culturally it's very different.
I think the AI "thinks" in each language. It doesn't translate, unless you ask it to.
@@MateusSFigueiredo Right but it cannot inherit what people think of the word, the AI can't find the underlying context. Idk how to explain it properly english is not even my first language.
@@danser_theplayer01 I agree. The robot has read every book but doesn't really understand any of them, much less understand the feelings, the subjective interpretations, the insinuations...
I dont think it matters when you say historically and culturally really different. For example i am argentinian from the capital buenos aires. The chat box needs to understand formal neutral spanish the most not our own sub language lunfardo of which you can speak almost whole sentences with no actual Spanish in them. This words also dont translate well to 1 single word in spanish or english.
My like is for Ruth and the JamPatoisNLI, I never heard of this
Thanks Vox for highlighting the genuine concerns surrounding language models and the significant influence that large corporations have on the source data that is used to train AI models.
Literalmente le pregunté Chat GPT ¿Cuáles son los canales que consumirias si ipoteticamente fueras una persona? Y me recomendó tu canal
Awesome vid right here! :) I learned a lot here! :) Thank you and cheers to the uploader! :)
An additional problem for low resource languages is that they're often not standardized, so there may be different writing conventions, different words depending on local dialects, etc. All of this makes it extremely difficult even to get good data to train on.
Omg what cameras do you guys use? The quality and the processing are absolutely stunning 🩷
An oddity for you: when my father graduated from college, in pharmacy, in the 1930s, his first job was translating English-language pharmacy texts into Welsh. He lasted two weeks in the job, he told me, and quit, believing the whole thing was useless.
My former partner's first job, when she first left higher education because of the civil war, was translating English language texts into Tik-Monjiaang, ("Language of The People," aka "Dinka"). Exactly parallel story: she quit within weeks, believing that the whole effort was going nowhere.
(Today, a generation later, both efforts, in Welsh and in Dinka, are back in operation -- at scales comparable perhaps to that in Catalan in this excellent video.)
I remember a friend who lived in the country of Georgia back in the early 2000s. He said technology companies didn't have Georgian language settings in their devices. So many Georgians used Russian or English on their devices.
When I use ChatGPT I use English and then whenever I need to show the result to other people I use Google translate on a browser level to convert the whole conversation to Korean(my native language). I think general-purpose chatbots like ChatGPT is the niche where it makes sense to imitate what has been done on a dataset that is mostly English and replicating the same task on a different geographic context. It does work as a localized replacement rather than a redundant vanity project.
imagine javanese language that has 3 extreme level of register language for daily conversation that sometimes sounds like a whole different language
Such an important and insightful video! Thank you for sharing I look forward to learning more.
My friend is working for a local Indonesian AI startup. He was majoring Indonesian Literature and actually the perfect major for teaching AI about Bahasa Indonesia structure, vocab, and rules
Actually AI and chatgp[t like applications can be a saviour of small languages. I'm catalan, and never imagined I coul have the option to read any documentation about any software application for example... now I can. I get the same quality of answerd in English that in Catalan.
The Bofa language of south-central Cameroon is famously known among linguists as being impossible to learn unless you were born into the Bofa culture. Figuring it out will be the true litmus twst for AI.
Well to bad for them…they better learn a known language
@@Student0Toucher bofa deez nuts gottem
It's a good thing to deal with such issues in first hand. AI should be a platform where all the people can do everything.
Dear Mr. Phil,
Congratulations on your newest RUclips video! I wanted to take a moment to extend my warmest congratulations to you on this exciting accomplishment.
Your dedication and hard work really shine through in your videos, and it's evident that you're passionate about creating content that resonates with your audience. Your latest video was engaging, informative, and truly entertaining. The way you effortlessly connect with your viewers and share your unique perspective is truly impressive.
I know how much effort and creativity go into producing high-quality content, and your latest video is a testament to your talent and commitment to excellence. You continue to impress and inspire your viewers, and I'm sure your RUclips channel will continue to grow and thrive.
Once again, congratulations on your latest RUclips video, Mr. Phil! I'm excited to see what you have in store for your future videos, and I wish you continued success in all your endeavors.
Best,
ChatGPT
Big Up Jamaica!!! Big up Ruth-Ann!!
I study reactions to AI and I love this series for explaining AI to my family, Phil is the best
Do the video editors at Vox use Adobe software? Keep up the great content
Gpt-4 is brilliant with Romanian! It is a very very difficult language to write correctly and it gets it spot on every time.
Same with Dutch. I can write the most informal stuff and ChatGTP just gets it! I have never experienced that before with software.
That will change in a few years, exponentially so.
In Mexico, we have 60 languages, and a lot gonna disappear, I saw hope in AI, but with this I don't think so more.
That was SO interesting . Thank you
Really tough job done, by you and your team.
Congratulations 💐
Always love your stuff Phil❤
The information that the next most common language for words in GPT-3 was German - as a German speaker - absolutely baffled me... Especially that it's ahead of Spanish!
AGI is coming. It will change the world. It will help choosing the right job and answering difficult questions. It will also open many possibilities.
Interesting video. My question is why do you bleep out the percentage of GPT-3 that is made up of Common Crawl?
Probably because he signed an NDA when he talked to them.
Because it's 69 🙃
@@merry_christmas Not 42?
while most people wonder if the chat is sentient, you hit the nail on the head and understood how it works, when it doesn't and its social impact. very professional, journalistic aproach. Vox is really one of the best sources of information these days. it is not perfect... nor does it cease to be a company... but it does have good people working on good stories
There's a political aspect to this too. Catalan language for example is absolutely central to the Catalan independence movement. If you can marginalise a language by preventing new technologies from understanding or accepting that language, that could be a powerful tool for political control. If AI is going to be half as central to the future as its proponents claim, then excluding a language from AI system will be every bit as devastating to a culture as excluding that language today from the internet.
Good point
The fact that GPT can do a passable job of writing in languages for which it has very small datasets makes me wonder if it has discovered something consistent among all languages.
*One reason maybe is that AI speech tools rely heavily on the use of big data sets to learn a language, its pronunciation, grammar and semantics. The ability or quality of the resulting tools is mainly limited by how much data is available (and how “good” it is).*
It's almost like the video was talking about that exact phenomenon
bro didn't watch the video 💀
My academic background is in Philosophy of AI and language. I don't know if we will learn anything about "artificial intelligence", but we are going to learn a lot about natural stupidity. This "AI histeria" is going to be the mother of all investment scams.
AI doesn't actually need to be intelligent by the normative standards of human intelligence in order to be *effectively* intelligent in the context of common types of work. I sense you're implying that AI ought to be, a bit like Chomsky, when that isn't even necessary.
Sometime ago, I've simulated the same narrative using GPT-3, firstly using English and then Portuguese. I've noticed that in English it was able to create not only way better dialogues, but also a superior quality narrative, more informal and creative.
What I don't understand is why it is hard to find the Arabic language even though it is spoken a lot?
Thank your for the insight.
Phil is my motivation for waking up in the morning.
I asked chat gpt on percentages of language in GPT-2. It tild me the following
GPT-2 was trained on a large and diverse corpus of text data from various sources, which includes a significant amount of text in English.
GPT-2 was trained on a web crawl corpus that contained a diverse set of texts from websites, news articles, books, and other sources. The corpus includes texts in several languages, but the majority of the data is in English. According to the OpenAI team, the corpus used to train GPT-2 included around 40 GB of text in English, out of a total of 45 GB of text data.
It is worth noting that GPT-2 was designed to generate high-quality text in English, and its performance in other languages may not be as good. However, GPT-2 can still generate text in other languages, and the quality of the output depends on the quality and quantity of the training data available for each language.
This is a great video talking about what gpt actually is. I think most people think it is sentient computers when all it is is trying to model language like a kid in a library trying to learn but having no actual experience
you could start answer queries from a vector database (i.e. [x] language dictionary, common expressions, Q&A's etc.) and gradually train it the GPT with those precise recalls till it reaches optimal "comprehension" levels.
This is super interesting. Would love to see a bit more on how AI is or isn't accommodating people with disabilities i.e. deaf, hard of hearing, blind, etc.
I'd assume it can understand French so that's one disability covered
Was it really necessary to print out all of that to make a point?
I'm a bit surprised there was no mention of embeddings at all in this video. Once a large language model has learned a diversity of different types of languages, there's a lot of transferable knowledge that it can pull from for low resource languages. For example the reason why it's performing so well on Catalan given only 140 pages of example is because it's pulling from similar languages that it knows in much more detail. This will be doubly true for the pidgin creole languages. I agree with the fundamental premise of the video being that data diversity is key. Since all our languages are related at some level then transferable knowledge means that learning new languages will bolster the quality of the whole.
Come to think of it, a lot of best selling books have been published in dozens of languages. It might be a good evaluation for multi lingual systems to find inconsistencies between versions that we already know of.
"Our Bloom model needs about 360 GB of RAM to run" - that's beyond practical! I have a Raspberry Pi 4B with 4GB RAM directly soldered on the mainboard. Should I hook another 500 GB external USB hard disk to it and create 500 GB swap on it? My second hard disk enclosure is USB 2.0 only, won't it slow down Bloom too much?
I have open source GPT models on my local machine, but, I suppose because those models have to be compacted to use fewer parameters than the ones OpenAI provide on the cloud, they can't even speak Spanish very well. It would say Alpaca is about B2 in Spanish. They basically are only truly proficient in English. I agree language-specific models should be developed, as LLMs are a fantastic tool for conserving minority languages that are in danger of extinction for future generations.
This is so true.
Like, why do we have to build a universal translation if we can just use GPT and create separate models catering for specific languages?
That would be much much more feasible imo
Accessibility is great and has to be pursued but at the same time how do you protect the language, its culture and the people that produce/reproduce it? all those AI make so much easy to track and control what is said online while deciding what can be read or watch and removing, hiding or blocking the access for any other information and certain languages. It's known that intelligence services in the 60's and 70's sent anthropologist to many third countries to learn about their languages and ideologies, we don't have access but the data is there and more likely they're already using it.
1:40 He arrange documents on the floor
Me: *A Bungou Stray Dogs reference intensifies*
That stache is getting out of control 🎉 Love the videos
Thank you Vox for this insightful video. Always relevant and makes my day :) ❤
finnaly a good video on the topic of AI and LLMs, I actually might have found a solution to this it is not just the data but there are other fundamental problems too (it's more related to indexing and embeddings)
I know so many non native English speakers that put all their digital content in English to reach a wider audience, and as a result all of their content is in English and not their native language.
Urdu and hindi are also prime examples of having little digital footprint
Open source AI is definitely a good step for more transparency on a technology that will likely dominate our future
Android is an open source OS. Good look uninstalling OEMs. The notion there is an abandoned supercomputer in Paris made me laugh. That's a bad start for a real open source project.
Phil makes good videos.
because AI heavily relies on a written language. All the AI's would struggle to speak any other Chinese language other than Mandarin because modern Chinese script only makes sense in Mandarin thus giving other Chinese languages less of an edge in the world of technology
Honestly they should use students
Pay them the equivalent of $20 a week or day to click through the sentences and teach the models in their languages
I think the real takeaway from this is, language changes and evolves some go extinct always have and always will. This a simply a new way languages are evolving
Thanks for this. I’ll just say that I was surprised when ChatGPT was able to translate English statements to Cree.
Are there higher value sources vs lower value sources? Like are there tags for verified trustworthy or high usefulness or other important markers?
Also, do they include dictionaries and thesauruses?
... same reason AI (still) doesn't speak whale🐳: data. but that's a different story 😉
so what you're saying is if we want to evade the notice of AI surveillance, use smaller languages. cool
I was wondering how language models deal with primarily spoken language ? if we manage to add audio in training dataset, it can make a huge difference. Some languages have a lot of variations in “speaking/writting” form or “colloquial/formal” form such as cantonese or javanese… Also imagine if languages models can output regional vocabularies and mimic regional accents… it will change the translation app, the way we learn languages, etc…. 😎
That is very interesting, I didn't know this was a think
I "invent" and app and train it to do well on a specific language, why am i forced to include others? just bc I do well and you want it too? shouldnt ppl of each language, if they want, develop their own app? and if they can't, why am I responsible to include them?
7:21 I wasn't expecting to see Hugging Face featured in this video! Props for giving them the coverage they deserve. 👍
great video gary oldman..
I'm from Cape Verde and i was amazed to know that chat Gpt can speak Cape verdean creol, not perfect but it can😂
I was there once briefly :)
@@karmasutra4774 I hope that you enjoyed
When speaking Ukrainian, ChatGPT sometimes tries to "invent" words taking its inspiration from Russian. Interestingly enough, these are really common words, you wouldn't blame the lack of data for those.
And yeah, it's fair to say the model is still ahead of anything else out there.