I'm not sure I got the question at 24:03 but it looked like the comparison was between vLLM and TRT-LLM. A few months ago, when we tried to work with Llama 2 70B model for RAG based systems, we noticed that the KV caching mechanism seemed to be better with vLLMs when memory usage was concerned. As our inference server now supports vLLM, we generally tend to default to vLLM engines. A survey (months old, so might be outdated) by Ray also highlighted that vLLM performed better than TRT-LLM across quite a few scenarios. This is a great video and the presenters look like they know what they are talking about. Can you expand on what people generally don't tune on top of default options when trying to optimise LLMs using TRT?
I'm not sure I got the question at 24:03 but it looked like the comparison was between vLLM and TRT-LLM. A few months ago, when we tried to work with Llama 2 70B model for RAG based systems, we noticed that the KV caching mechanism seemed to be better with vLLMs when memory usage was concerned. As our inference server now supports vLLM, we generally tend to default to vLLM engines. A survey (months old, so might be outdated) by Ray also highlighted that vLLM performed better than TRT-LLM across quite a few scenarios.
This is a great video and the presenters look like they know what they are talking about. Can you expand on what people generally don't tune on top of default options when trying to optimise LLMs using TRT?