There is no rest when you're in this industry. Always some part of the tech stack that's developed, some new feature. Thanks for covering the best bits!
Yeah I agree with you on the whole though I am finding the increments of getting better with each of these models lately is getting smaller for a lot of them. There have been a few where I decided not to make a video because I felt there wasn't enough value in changing etc. The interesting stuff is moving away from the model a lot I feel now.
I wish they would release a mixture of agents option for people to use natively through their API. I have my own setup I can use, but I see a lot of people using LLMs who dont have the ability to do that. Function calling has great utility, but any model can do this. If you give it the tool list with definition and the schema to use and give it a few examples in your messages array of a back and forth user and assistant messages that show the assistant using them in various scenarios most decent models will do really well with using them. In places where you're 100% sure it should be using at least one tool, you simply pair this with a function that just re-asks the same question recursively until you parse the response you know you're looking for.
From my limited testing, it's significantly more prone to hallucinations than gpt family of models that I've been using (it hallucinates argument values, creates argument values out of thin air, and even creates new functions). For my use case, even gpt-3.5-turbo and the vanilla version of llama3 that they're hosting is doing better on my custom evals than this new one, which is honestly kinda disappointing. I'm starting to feel like those benchmarks are not as good of a source of evaluation as they're wanting us to believe.
I don't think they'll release the dataset, as Groq wants to keep it as a competitive advantage to increase their developer base. Anyway, you mentioned query rewriting, so let me share something. You know, from my actual production experience, it's too bold to release software with function calling without query rewriting. Recently, in a project where we needed function calling and tried many models, we faced unpredictability. Instead of fine-tuning those models, we fine-tuned GPT-2 specifically for query rewriting using synthetic data tailored to our case. And voila! Once we implemented that, all the nuances and unpredictability were gone. Query rewriting, either using a strong model or our approach, allows for effective use of many language models supporting function calling without fine-tuning the entire model. Like in your last example, with or without the keyword "search," query rewriting is definitely one of the best steps in the pipeline.
Thanks for the video, an interesting model. Am I right in thinking that what this model is good at is actually extracting data from a text to make properly formatted input data to tool calls, but weaker in making the decision to call a tool or not? Like you showed with your "(search) when do the olympics start" example, I was a bit surprised that a 70b model couldn't get that one. I see they also mention this in their blog post, a hybrid/routing approach. It would be interesting to see the benchmarks/performance if the models were allowed such a "reasoning layer" on top.
In my local testing, it seems Llama 3 8b is already pretty good for function calling (couldn't find cases where it fails) Would be interesting to see in which function calling cases these high performing FC models succeed while Llama 3 from Meta fails.
Agree I hope they release the dataset so we can see what they added etc. I am still testing it and just got the Ollama version going and it seems a bit hit and miss there.
I really dont understand why we need this? cant you just send a prompt to the LLM, "calculate this formula and return the result in json format [ { "formula": "", "result": "" } ] why do we complicate things with a lot of text that 100% you will have typo somewhere and you will spend hours finding that typo, to achieve what exactly???
This model is trash, I’m sorry but whoever did the benchmarking needs to be fired. It fails on every 3-4 calls quite regularly. It’s ok for super super simple function calls and it’s no better than the base Llama 3 model. Thumbs down on this model for me.
There is no rest when you're in this industry. Always some part of the tech stack that's developed, some new feature. Thanks for covering the best bits!
Yeah I agree with you on the whole though I am finding the increments of getting better with each of these models lately is getting smaller for a lot of them. There have been a few where I decided not to make a video because I felt there wasn't enough value in changing etc. The interesting stuff is moving away from the model a lot I feel now.
Wow! How exciting! Man you're my hero Sam. You are literally 8 steps ahead of the curve.
I wish they would release a mixture of agents option for people to use natively through their API. I have my own setup I can use, but I see a lot of people using LLMs who dont have the ability to do that.
Function calling has great utility, but any model can do this. If you give it the tool list with definition and the schema to use and give it a few examples in your messages array of a back and forth user and assistant messages that show the assistant using them in various scenarios most decent models will do really well with using them. In places where you're 100% sure it should be using at least one tool, you simply pair this with a function that just re-asks the same question recursively until you parse the response you know you're looking for.
From my limited testing, it's significantly more prone to hallucinations than gpt family of models that I've been using (it hallucinates argument values, creates argument values out of thin air, and even creates new functions). For my use case, even gpt-3.5-turbo and the vanilla version of llama3 that they're hosting is doing better on my custom evals than this new one, which is honestly kinda disappointing. I'm starting to feel like those benchmarks are not as good of a source of evaluation as they're wanting us to believe.
This is amazing
I don't think they'll release the dataset, as Groq wants to keep it as a competitive advantage to increase their developer base. Anyway, you mentioned query rewriting, so let me share something. You know, from my actual production experience, it's too bold to release software with function calling without query rewriting. Recently, in a project where we needed function calling and tried many models, we faced unpredictability. Instead of fine-tuning those models, we fine-tuned GPT-2 specifically for query rewriting using synthetic data tailored to our case. And voila! Once we implemented that, all the nuances and unpredictability were gone. Query rewriting, either using a strong model or our approach, allows for effective use of many language models supporting function calling without fine-tuning the entire model. Like in your last example, with or without the keyword "search," query rewriting is definitely one of the best steps in the pipeline.
Thanks for the video, an interesting model. Am I right in thinking that what this model is good at is actually extracting data from a text to make properly formatted input data to tool calls, but weaker in making the decision to call a tool or not? Like you showed with your "(search) when do the olympics start" example, I was a bit surprised that a 70b model couldn't get that one. I see they also mention this in their blog post, a hybrid/routing approach. It would be interesting to see the benchmarks/performance if the models were allowed such a "reasoning layer" on top.
In my local testing, it seems Llama 3 8b is already pretty good for function calling (couldn't find cases where it fails)
Would be interesting to see in which function calling cases these high performing FC models succeed while Llama 3 from Meta fails.
Agree I hope they release the dataset so we can see what they added etc. I am still testing it and just got the Ollama version going and it seems a bit hit and miss there.
That was fast!
Noice!
sick
great:)
I think phidata does the best open source function calling
we can still fine tune it further right?
would take make a difference?
I really dont understand why we need this? cant you just send a prompt to the LLM, "calculate this formula and return the result in json format
[ {
"formula": "",
"result": ""
} ]
why do we complicate things with a lot of text that 100% you will have typo somewhere and you will spend hours finding that typo, to achieve what exactly???
This model is trash, I’m sorry but whoever did the benchmarking needs to be fired. It fails on every 3-4 calls quite regularly. It’s ok for super super simple function calls and it’s no better than the base Llama 3 model. Thumbs down on this model for me.