"380% Lower Latency." A percentage above 100% in this context is incorrect because latency cannot be reduced by more than 100%. A latency reduction of 100% is a latency of 0 ms.
This changed my entire approach to a project. I wish they held the cache longer than 5 minutes! Even 10 or 15 would be nice, but how about an hour!? Love that it will cache images, too.
Script execution time is different to latency. Latency is effectively how long it takes to return the first token, and and then the time to the next token and so on. This will always be a very low number, with or without caching, so for short responses such as in your second example the script execution will always be fast. For longer responses such as book summary the latency makes a difference as it accumulates for each token, and there are many more tokens. I didn't look at your code and haven't used the anthropic API, but I guess you weren't streaming the tokens so you couldn't actually measure the latency by this method. Still really appreciate the video as I was curious about this caching and this explained a lot to me, thank you!
The part you didn't understand happened because the meaning of you added sentence was far away in the terms of the book understanding tokens, so it got it pretty easily
Thank you for the video. Second time you are asking a question, I am curious why you are passing the context again in the system prompt if its already cached? Can we just ask the question without calling the system prompt to send the context?
We are using anthropic claude v3.5 sonnet on amazon bedrock. Since the prompt caching feature is in beta, I wanted to clarify if it is available for bedrock. I tried reaching out to Anthropic support for the same but could not get through. It will be great if someone could answer this for me?
Example: My AI companion uses facts about myself when answering. 5 facts are pulled based on the average vector of the latest input. this is done after each message. But I can dump all in facts into cache and forgo this system entirely. will it be better? damn if I know. probably? requires a lot of testing. it would be awesome if anthropic wasn't this censored. I'm not sure I can even use their models in my companion without it getting triggered. But it's definitely not a replacement for RAG... something different, but really cool
I’m thinking of trying the Anthropic cache will a local pgvector store or neo4j. Might make things better… or weird. Kris could do it better. Is this a good idea?
Damn, I completely forgot about Google's caching. Looked at the prices. It seems like Google caching is 4 times cheaper than normal. In contrast Anthropic is 10 times cheaper BUT it's more expensive to create Cache by 25%... So I have no idea what the math is here someone help me out.
Nice, but still expensive, $15 per MTok output is rough. Hopefully we will see this decrease in the future, specially since OpenAI probably has something similar on the works.
5 minutes is very short, worse when you are programming. You can easly have more than 5 min. between 2 prompts. It shoud be 30 minimum, hope they will update this.
It's as good as your implementation of it is. Use crappy embedding models and crappy text organization, and get crappy output. The inverse is true as well.
"380% Lower Latency." A percentage above 100% in this context is incorrect because latency cannot be reduced by more than 100%. A latency reduction of 100% is a latency of 0 ms.
AI is making RUclipsrs dumber. That's the only explanation otherwise I don't know how this can happen.
We have just identified the non-LLM entity here... unless you are Grog 10 "watermelon"
Wow man, go out, do something, but this is a bad look 🤣
This changed my entire approach to a project. I wish they held the cache longer than 5 minutes! Even 10 or 15 would be nice, but how about an hour!? Love that it will cache images, too.
Script execution time is different to latency. Latency is effectively how long it takes to return the first token, and and then the time to the next token and so on. This will always be a very low number, with or without caching, so for short responses such as in your second example the script execution will always be fast. For longer responses such as book summary the latency makes a difference as it accumulates for each token, and there are many more tokens. I didn't look at your code and haven't used the anthropic API, but I guess you weren't streaming the tokens so you couldn't actually measure the latency by this method. Still really appreciate the video as I was curious about this caching and this explained a lot to me, thank you!
The part you didn't understand happened because the meaning of you added sentence was far away in the terms of the book understanding tokens, so it got it pretty easily
It would be very useful if you showed us the cost of your entire testing, thanks.
Your videos being very helpful . I don’t know when do we get the desktop app
Thank you for the video. Second time you are asking a question, I am curious why you are passing the context again in the system prompt if its already cached? Can we just ask the question without calling the system prompt to send the context?
Do you write the code yourself or you generate it with LLM?
Thanks, Kris!
Sorry for posting a question randomly, but do you have any tutorial for ai voicebot for discord ?
It would be really nice to see how you talk to a book that wasn't in the training dataset (I am pretty sure that Harry Potter was there).
thanks for update:)
We are using anthropic claude v3.5 sonnet on amazon bedrock. Since the prompt caching feature is in beta, I wanted to clarify if it is available for bedrock. I tried reaching out to Anthropic support for the same but could not get through. It will be great if someone could answer this for me?
How is this replacing RAG?
Example: My AI companion uses facts about myself when answering. 5 facts are pulled based on the average vector of the latest input. this is done after each message. But I can dump all in facts into cache and forgo this system entirely.
will it be better? damn if I know. probably? requires a lot of testing. it would be awesome if anthropic wasn't this censored. I'm not sure I can even use their models in my companion without it getting triggered.
But it's definitely not a replacement for RAG... something different, but really cool
wheres link to code you used?
also i recommend putting timing in the script!
I’m thinking of trying the Anthropic cache will a local pgvector store or neo4j. Might make things better… or weird. Kris could do it better. Is this a good idea?
You can cache the “500 page book” context you find in Claude projects, btw
What if I ask another question after caching the entire book?
Damn, I completely forgot about Google's caching. Looked at the prices. It seems like Google caching is 4 times cheaper than normal. In contrast Anthropic is 10 times cheaper BUT it's more expensive to create Cache by 25%... So I have no idea what the math is here someone help me out.
Nice, but still expensive, $15 per MTok output is rough. Hopefully we will see this decrease in the future, specially since OpenAI probably has something similar on the works.
I do not understand why they not just cache everything that changed when you set this flag... Why the cache points?
5 minutes is very short, worse when you are programming. You can easly have more than 5 min. between 2 prompts. It shoud be 30 minimum, hope they will update this.
Interesting but not a replacement for rag I dont think.
5 minutes is very short
FYI Google Gemini has been doing prompt caching for some time now.
He mentions that at 13:33
Who actually uses rag? I've found it so unreliable
It's as good as your implementation of it is. Use crappy embedding models and crappy text organization, and get crappy output. The inverse is true as well.
Claude is getting stupid (its being quantized) To bad.