I've been curious about this topic. I really appreciate how you approached the evaluation. I would have liked to see an n of 5 for each example to limit errors related to model entropy.
Great comparison. Something to consider is to break down the scores by model. Why? To see if there are preferences of format by model. E.g. we know that Anthropic likes XML and that format might be the best for their models. That does not mean that this holds true for other models.
Shouldn't it be possible to layer a deterministic MD-to-XML convertor in your prompting process? Then you, as a human, could still work in MD while your LLMs get the XML they crave.
Absolutely possible, but not as easy as you'd think at first blush. For example, the XML tags you choose have information in them, telling them "what the thing is" that you're wrapping in the tag, whereas in markdown all you really have is "sections" and various types of divisions. I can say this as an experienced programmer who tried to create a Markdown-based parser for exactly this purpose. It's *way* harder to cleanly interpret semantic divisions when all you have to work with is stuff like blank lines.
@@BTFranklin I don't think XML *has* to have more information, and for this particular test I assume it doesn't. If the XML prompts he's using do indeed provide more information than the markdown ones do, doesn't invalidate these results as a measure of format (and only format) effectiveness?
YAML is nice for toying around but is an awful format once you start using it, make a google search "yaml sucks" and you'll see, I regret having adopted it in some projects.
I’m with you! I started with YAML and then moved to some mix of that and TOML/XML. That would be fun to have a central leaderboard for prompt format performance tracking based on different metrics like here!
Fascinating. I've been using raw with small json elements where strucutre was needed in autogen based flows. Works really well. Json does get brittle when there's too much of it though. I'm not shocked that the whole prompt in json wasn't great. That being said, definitely going to try some xml.
Please, a basic video related with llms how to deploy, expected uses of local llms ... I think it will be interesting for creating a Small company's running by theamself
When JSON is the worst performing format. Feels bad men. I will keep this in mind... never wouldh have guessed that it handels xml so well but then again most of the data is raw text and html wish looks like xml because of the tags so i see why llms wouldh be good at understanding and generating with it.
Dan, is there a way to get access to the files you used in this video? I dont have coding knowledge and am learning about prompt scripting. From the video and the files you ran it comes across as if you have a methodology to write your scripts that could help me with developing my own scripts following your examples.
In my testing of llama3.1 8b for instruction following I find it severely lacking compared with codestral. Llama3.1 8b was unable to return a simple yes or no response. It always included a fluffy explaining response (which was correct but not requested). YMMV.
There’s a couple things that you missed. To make this video actually useful, you need to experiment more. - 1 you missed using yaml, it’s a dark horse and I’ve had stellar results with it. - 2 use something harder like tool calling - 3 try instructions that are system prompt heavy - 4 try prompts that put the Instructions as the very last thing the model sees - Use the seed param - use an automation that changes the temp by 0.1 for each call. I have to say I’m a bit disappointed with the video, I mean I kind of get it, but I want to see these models tested on the bleeding edge of what they can do, I want to see it where your dialling in that last couple of percent of performance. They’re so much more powerful than the examples in the video.
Just love your whole approach to AI and coding in general
I've been curious about this topic. I really appreciate how you approached the evaluation. I would have liked to see an n of 5 for each example to limit errors related to model entropy.
Great comparison.
Something to consider is to break down the scores by model. Why?
To see if there are preferences of format by model.
E.g. we know that Anthropic likes XML and that format might be the best for their models. That does not mean that this holds true for other models.
True
Thanks for all your hard work! You do such a great job brother. Appreciate you very much.
Shouldn't it be possible to layer a deterministic MD-to-XML convertor in your prompting process? Then you, as a human, could still work in MD while your LLMs get the XML they crave.
Absolutely possible, but not as easy as you'd think at first blush. For example, the XML tags you choose have information in them, telling them "what the thing is" that you're wrapping in the tag, whereas in markdown all you really have is "sections" and various types of divisions. I can say this as an experienced programmer who tried to create a Markdown-based parser for exactly this purpose. It's *way* harder to cleanly interpret semantic divisions when all you have to work with is stuff like blank lines.
@@BTFranklin I don't think XML *has* to have more information, and for this particular test I assume it doesn't. If the XML prompts he's using do indeed provide more information than the markdown ones do, doesn't invalidate these results as a measure of format (and only format) effectiveness?
Incredible value, please more of this type of content
Great content! Would have been nice to also compare YAML.
YAML is nice for toying around but is an awful format once you start using it, make a google search "yaml sucks" and you'll see, I regret having adopted it in some projects.
I’m with you! I started with YAML and then moved to some mix of that and TOML/XML. That would be fun to have a central leaderboard for prompt format performance tracking based on different metrics like here!
This is an excellent, detailed analysis. Highly appreciated, sir. Subbed.
Always great insights, need to give promptfoo a shot!
One of the best videos I have seen regarding all things LLMs. Do you think the results from 4o-mini replicate with 4o, 4-turbo and gpt4?
your videos always do real help, great work.
What a great video and unexpectedly outcome, I’ve been using MD but am swapping to XML for complex persona instructions. Great video!
Fascinating. I've been using raw with small json elements where strucutre was needed in autogen based flows. Works really well. Json does get brittle when there's too much of it though. I'm not shocked that the whole prompt in json wasn't great.
That being said, definitely going to try some xml.
XML is what I've been using since day 1. 😊
I started using markdown but after looking over the anthropic workbench I started using xml. Havent looked back.
Great tests, which open model 8B or 9B is the best with long context ? To my tests Gemma2 q4_k_m performs quite well
Great setup. Please evaluate the Gemini Flash. Capabilities of these low cost workhorse models are the most important edge cases to understand.
On top of that, could be interesting to provide an xsd (xml schema definition) so that the response format is fully predictable.
This is what I’ve been looking to test myself. I suspected Markdown wasn’t performing well. I asked llama3.1 what it prefers, and it gave me XML.
Good use of markdown 2 XML converters so we can conveniently write the prompt in markdown then send it as XML to the LLM.
Please, a basic video related with llms how to deploy, expected uses of local llms ... I think it will be interesting for creating a Small company's running by theamself
do you have this code on github? would love to play around with it myself
Amazing content as always thank you.
When JSON is the worst performing format. Feels bad men. I will keep this in mind... never wouldh have guessed that it handels xml so well but then again most of the data is raw text and html wish looks like xml because of the tags so i see why llms wouldh be good at understanding and generating with it.
Have you tought about mixing xml tags into your markdown prompt? Like claude sonnet does in the prompt generator?
Really useful video!
Dan, is there a way to get access to the files you used in this video? I dont have coding knowledge and am learning about prompt scripting. From the video and the files you ran it comes across as if you have a methodology to write your scripts that could help me with developing my own scripts following your examples.
Does it really matter with tab indentations and newlines when using XML tags? 🤔
In my testing of llama3.1 8b for instruction following I find it severely lacking compared with codestral. Llama3.1 8b was unable to return a simple yes or no response. It always included a fluffy explaining response (which was correct but not requested). YMMV.
There’s a couple things that you missed. To make this video actually useful, you need to experiment more.
- 1 you missed using yaml, it’s a dark horse and I’ve had stellar results with it. - 2 use something harder like tool calling
- 3 try instructions that are system prompt heavy
- 4 try prompts that put the Instructions as the very last thing the model sees
- Use the seed param
- use an automation that changes the temp by 0.1 for each call.
I have to say I’m a bit disappointed with the video, I mean I kind of get it, but I want to see these models tested on the bleeding edge of what they can do, I want to see it where your dialling in that last couple of percent of performance. They’re so much more powerful than the examples in the video.
Do you share results in any other format
My approach its Xml for titles tags and inside i write in markdown.
It works and still its really human readeable
Full xml its not the best to read.
best prompt format is l337 sp33ch
Markdown and xml hands down for reports. Markdown converted to vectors.