I've been curious about this topic. I really appreciate how you approached the evaluation. I would have liked to see an n of 5 for each example to limit errors related to model entropy.
Great comparison. Something to consider is to break down the scores by model. Why? To see if there are preferences of format by model. E.g. we know that Anthropic likes XML and that format might be the best for their models. That does not mean that this holds true for other models.
Shouldn't it be possible to layer a deterministic MD-to-XML convertor in your prompting process? Then you, as a human, could still work in MD while your LLMs get the XML they crave.
Absolutely possible, but not as easy as you'd think at first blush. For example, the XML tags you choose have information in them, telling them "what the thing is" that you're wrapping in the tag, whereas in markdown all you really have is "sections" and various types of divisions. I can say this as an experienced programmer who tried to create a Markdown-based parser for exactly this purpose. It's *way* harder to cleanly interpret semantic divisions when all you have to work with is stuff like blank lines.
@@BTFranklin I don't think XML *has* to have more information, and for this particular test I assume it doesn't. If the XML prompts he's using do indeed provide more information than the markdown ones do, doesn't invalidate these results as a measure of format (and only format) effectiveness?
Amazing video! One could argue that there's no real difference between XML and RAW formats but the power of XML is having a bunch of well pre-structured prompts that you have only to fill certain areas. Writing a good pre-formatted raw prompt can be more annoying, while with XML you can just add a few more tags here and there and refine the desired output as much as needed in a rather simpler way. Even Perplexity works well with XML and it's easier to restrict the kinds of outputs or searches with it.
There’s a couple things that you missed. To make this video actually useful, you need to experiment more. - 1 you missed using yaml, it’s a dark horse and I’ve had stellar results with it. - 2 use something harder like tool calling - 3 try instructions that are system prompt heavy - 4 try prompts that put the Instructions as the very last thing the model sees - Use the seed param - use an automation that changes the temp by 0.1 for each call. I have to say I’m a bit disappointed with the video, I mean I kind of get it, but I want to see these models tested on the bleeding edge of what they can do, I want to see it where your dialling in that last couple of percent of performance. They’re so much more powerful than the examples in the video.
Fascinating. I've been using raw with small json elements where strucutre was needed in autogen based flows. Works really well. Json does get brittle when there's too much of it though. I'm not shocked that the whole prompt in json wasn't great. That being said, definitely going to try some xml.
YAML is nice for toying around but is an awful format once you start using it, make a google search "yaml sucks" and you'll see, I regret having adopted it in some projects.
I’m with you! I started with YAML and then moved to some mix of that and TOML/XML. That would be fun to have a central leaderboard for prompt format performance tracking based on different metrics like here!
Great content!! I was breaking my head with the way how to structure instructions especialy for meta prompting and first I was thinking about json bcs of its unlimited nesting nature. then I realized that XML might be better bcs of the problem closing brackets.. and then I realized the reason why XML is the best format is bcs LLM are trained on websites - tudum tudum tudum tada - XML formated content :D I kind of realized all those things on my own and I was thinking, why is nobody talking about it and then 2 days lateer booom - this video :D thx for references - Ill study what others came up with, since I kinda reinvented wheel on my own :D Thx
is this also true for RAG documents? I read at least one place fine tuning is best or even requires JSONL my use cases are RAG, maybe eventually JSONL but it seems formatting RAG docs right is even more important than fine tuning.
Dan, is there a way to get access to the files you used in this video? I dont have coding knowledge and am learning about prompt scripting. From the video and the files you ran it comes across as if you have a methodology to write your scripts that could help me with developing my own scripts following your examples.
Please, a basic video related with llms how to deploy, expected uses of local llms ... I think it will be interesting for creating a Small company's running by theamself
In my testing of llama3.1 8b for instruction following I find it severely lacking compared with codestral. Llama3.1 8b was unable to return a simple yes or no response. It always included a fluffy explaining response (which was correct but not requested). YMMV.
Markdown _is_ XML, but only a subset of it, which is why it performs worse than XML. Think about it: Markdown will give the LLM a clue as to how the information is structured, but it doesn’t include as much meta-data as XML. A shopping list of ingredients in Markdown would look like an unordered-list of list-items, but in XML it could be represented as a shopping-list of ingredient-items. I didn’t know XML would perform this well, but after having watched your video, I’ll be switching. Great stuff.
When JSON is the worst performing format. Feels bad men. I will keep this in mind... never wouldh have guessed that it handels xml so well but then again most of the data is raw text and html wish looks like xml because of the tags so i see why llms wouldh be good at understanding and generating with it.
THIS IS GOLD.
I've been curious about this topic. I really appreciate how you approached the evaluation. I would have liked to see an n of 5 for each example to limit errors related to model entropy.
Just love your whole approach to AI and coding in general
Great comparison.
Something to consider is to break down the scores by model. Why?
To see if there are preferences of format by model.
E.g. we know that Anthropic likes XML and that format might be the best for their models. That does not mean that this holds true for other models.
True
I started using markdown but after looking over the anthropic workbench I started using xml. Havent looked back.
Shouldn't it be possible to layer a deterministic MD-to-XML convertor in your prompting process? Then you, as a human, could still work in MD while your LLMs get the XML they crave.
Absolutely possible, but not as easy as you'd think at first blush. For example, the XML tags you choose have information in them, telling them "what the thing is" that you're wrapping in the tag, whereas in markdown all you really have is "sections" and various types of divisions. I can say this as an experienced programmer who tried to create a Markdown-based parser for exactly this purpose. It's *way* harder to cleanly interpret semantic divisions when all you have to work with is stuff like blank lines.
@@BTFranklin I don't think XML *has* to have more information, and for this particular test I assume it doesn't. If the XML prompts he's using do indeed provide more information than the markdown ones do, doesn't invalidate these results as a measure of format (and only format) effectiveness?
Use ai to convert it😊
Amazing video! One could argue that there's no real difference between XML and RAW formats but the power of XML is having a bunch of well pre-structured prompts that you have only to fill certain areas. Writing a good pre-formatted raw prompt can be more annoying, while with XML you can just add a few more tags here and there and refine the desired output as much as needed in a rather simpler way. Even Perplexity works well with XML and it's easier to restrict the kinds of outputs or searches with it.
Thanks for all your hard work! You do such a great job brother. Appreciate you very much.
This is an excellent, detailed analysis. Highly appreciated, sir. Subbed.
Incredible value, please more of this type of content
One of the best videos I have seen regarding all things LLMs. Do you think the results from 4o-mini replicate with 4o, 4-turbo and gpt4?
Always great insights, need to give promptfoo a shot!
There’s a couple things that you missed. To make this video actually useful, you need to experiment more.
- 1 you missed using yaml, it’s a dark horse and I’ve had stellar results with it. - 2 use something harder like tool calling
- 3 try instructions that are system prompt heavy
- 4 try prompts that put the Instructions as the very last thing the model sees
- Use the seed param
- use an automation that changes the temp by 0.1 for each call.
I have to say I’m a bit disappointed with the video, I mean I kind of get it, but I want to see these models tested on the bleeding edge of what they can do, I want to see it where your dialling in that last couple of percent of performance. They’re so much more powerful than the examples in the video.
your videos always do real help, great work.
What a great video and unexpectedly outcome, I’ve been using MD but am swapping to XML for complex persona instructions. Great video!
Fascinating. I've been using raw with small json elements where strucutre was needed in autogen based flows. Works really well. Json does get brittle when there's too much of it though. I'm not shocked that the whole prompt in json wasn't great.
That being said, definitely going to try some xml.
Great setup. Please evaluate the Gemini Flash. Capabilities of these low cost workhorse models are the most important edge cases to understand.
Great tests, which open model 8B or 9B is the best with long context ? To my tests Gemma2 q4_k_m performs quite well
Great content! Would have been nice to also compare YAML.
YAML is nice for toying around but is an awful format once you start using it, make a google search "yaml sucks" and you'll see, I regret having adopted it in some projects.
I’m with you! I started with YAML and then moved to some mix of that and TOML/XML. That would be fun to have a central leaderboard for prompt format performance tracking based on different metrics like here!
This is what I’ve been looking to test myself. I suspected Markdown wasn’t performing well. I asked llama3.1 what it prefers, and it gave me XML.
Good use of markdown 2 XML converters so we can conveniently write the prompt in markdown then send it as XML to the LLM.
Great content!! I was breaking my head with the way how to structure instructions especialy for meta prompting and first I was thinking about json bcs of its unlimited nesting nature. then I realized that XML might be better bcs of the problem closing brackets.. and then I realized the reason why XML is the best format is bcs LLM are trained on websites - tudum tudum tudum tada - XML formated content :D
I kind of realized all those things on my own and I was thinking, why is nobody talking about it and then 2 days lateer booom - this video :D
thx for references - Ill study what others came up with, since I kinda reinvented wheel on my own :D
Thx
do you have this code on github? would love to play around with it myself
XML is what I've been using since day 1. 😊
impresive!
is this also true for RAG documents?
I read at least one place fine tuning is best or even requires JSONL
my use cases are RAG, maybe eventually JSONL but it seems formatting RAG docs right is even more important than fine tuning.
Dan, is there a way to get access to the files you used in this video? I dont have coding knowledge and am learning about prompt scripting. From the video and the files you ran it comes across as if you have a methodology to write your scripts that could help me with developing my own scripts following your examples.
Have you tought about mixing xml tags into your markdown prompt? Like claude sonnet does in the prompt generator?
On top of that, could be interesting to provide an xsd (xml schema definition) so that the response format is fully predictable.
Please, a basic video related with llms how to deploy, expected uses of local llms ... I think it will be interesting for creating a Small company's running by theamself
Does it really matter with tab indentations and newlines when using XML tags? 🤔
Would it make sense to put your RAG files in XML format as well?
In my testing of llama3.1 8b for instruction following I find it severely lacking compared with codestral. Llama3.1 8b was unable to return a simple yes or no response. It always included a fluffy explaining response (which was correct but not requested). YMMV.
Amazing content as always thank you.
Really useful video!
Do you share results in any other format
Markdown _is_ XML, but only a subset of it, which is why it performs worse than XML. Think about it: Markdown will give the LLM a clue as to how the information is structured, but it doesn’t include as much meta-data as XML. A shopping list of ingredients in Markdown would look like an unordered-list of list-items, but in XML it could be represented as a shopping-list of ingredient-items. I didn’t know XML would perform this well, but after having watched your video, I’ll be switching. Great stuff.
When JSON is the worst performing format. Feels bad men. I will keep this in mind... never wouldh have guessed that it handels xml so well but then again most of the data is raw text and html wish looks like xml because of the tags so i see why llms wouldh be good at understanding and generating with it.
HOW HARD IS IT TO COPY PASTE A PROMPT INTO THE VIDEO DESCRIPTION :(
My approach its Xml for titles tags and inside i write in markdown.
It works and still its really human readeable
Full xml its not the best to read.
best prompt format is l337 sp33ch
Markdown and xml hands down for reports. Markdown converted to vectors.