Thanks! I found the ability to update the TTL very interesting. Imagine building an assistant application for answering questions or customer service. On the server side, we could update the TTL another let's say 5 minutes. When a new user sends a question, we can update it again. When there's no new user, it will be gone. Five minutes is just an example, but it's a great way to keep your cache ready and clear it when you don't need it. I think the minimum token requirement is likely about profit. They need a minimum number to offer the service economically, saving expenses. Below that threshold, it wouldn't be cost-effective for them. That's my guess.
Dynamically controlling TTL can be really helpful and I agree the token limit is probably related to cost. I hope they implement the latency reduction soon, since that will make more sense.
A couple of thing that differentiate it from vector storage. When you use retrieve info with vector based search, you only get some "chunks" where the LLM doesn't have the whole context of the document, an approach like this will provide complete context to the LLM. Caching can also be really useful with RAG as well. I agree it is going to be more expensive than vectorstores but will potentially save on the infra. Will be interesting to see how it evolves.
@@engineerprompt yeah chunking would have to be perfect to match the context. But if vector representation and chunking are accurate it should match in context quality. Time will tell ehh?
Thanks! I found the ability to update the TTL very interesting. Imagine building an assistant application for answering questions or customer service. On the server side, we could update the TTL another let's say 5 minutes. When a new user sends a question, we can update it again. When there's no new user, it will be gone. Five minutes is just an example, but it's a great way to keep your cache ready and clear it when you don't need it.
I think the minimum token requirement is likely about profit. They need a minimum number to offer the service economically, saving expenses. Below that threshold, it wouldn't be cost-effective for them. That's my guess.
Dynamically controlling TTL can be really helpful and I agree the token limit is probably related to cost. I hope they implement the latency reduction soon, since that will make more sense.
this is very helpful buddy, very time saving and quickly updating my own biological cache without searching for it explicitly. Thanks!
Great video - thanks!
thank you.
Seems similar but more expensive to vector storage. What am I missing?
A couple of thing that differentiate it from vector storage. When you use retrieve info with vector based search, you only get some "chunks" where the LLM doesn't have the whole context of the document, an approach like this will provide complete context to the LLM. Caching can also be really useful with RAG as well. I agree it is going to be more expensive than vectorstores but will potentially save on the infra. Will be interesting to see how it evolves.
@@engineerprompt yeah chunking would have to be perfect to match the context. But if vector representation and chunking are accurate it should match in context quality. Time will tell ehh?
Great news! Thanks!!
thank you.
So much lazy voice 😴😴😴