Another great video from my birthplace. Thank you! Why are you using a overlap at the chunks, when you have to join the chunks for clustering? Or does this have no effect?
Thanks for this! I’ve been extending this to find values for n_neighbors, dims, etc that maximize the quality of the clusters. I’m applying this to 10-k filings which are pretty similar in overall “semantic” content and organization, so I’m hoping that as I process more 10-Ks, I’ll gradually find a set of parameters that generalizes well across most 10-ks. Kind of surprising that I haven’t seen anyone talk about this as far as RAPTOR
Hi, is line 157 in the code meant to come before the iteration summaries loop or after in line 165? i.e. are we updating the all_summaries field with the previous cluster texts or does it not matter? Otherwise we would be updating "iteration summaries["texts"]" with the same value as "iteration summaries["summaries"]"
Awesome. If I might suggest: how about a tutorial on CodeGen-specific advanced RAG, i.e. repository-wide code "understanding" and generation? :) Cheers!
The state of the art models change so fast which is why I prefer OpenAI. But the code should stay pretty much the same and not really matter for this concept.
Another great video from my birthplace. Thank you!
Why are you using a overlap at the chunks, when you have to join the chunks for clustering? Or does this have no effect?
Thanks for this! I’ve been extending this to find values for n_neighbors, dims, etc that maximize the quality of the clusters. I’m applying this to 10-k filings which are pretty similar in overall “semantic” content and organization, so I’m hoping that as I process more 10-Ks, I’ll gradually find a set of parameters that generalizes well across most 10-ks. Kind of surprising that I haven’t seen anyone talk about this as far as RAPTOR
Have you had the chance to evaluate the performance of that compared to other retrieval techniques? would be interested in the results :)
Hi, is line 157 in the code meant to come before the iteration summaries loop or after in line 165? i.e. are we updating the all_summaries field with the previous cluster texts or does it not matter? Otherwise we would be updating "iteration summaries["texts"]" with the same value as "iteration summaries["summaries"]"
Awesome. If I might suggest: how about a tutorial on CodeGen-specific advanced RAG, i.e. repository-wide code "understanding" and generation? :) Cheers!
I like the idea for a Video. But currently no Clue how i would tackle that yet
Thanks, why not using open source LLM and embeddings?
The state of the art models change so fast which is why I prefer OpenAI. But the code should stay pretty much the same and not really matter for this concept.
Great:)
Is this subject to loss in the middle problem?
Yes, like any other ingestion step. You got methods like reranking to fight problems like this :)
Ho it look like my idea in the previous video no ?
So you don’t need anymore the code ?
I did not create subclusters:)
Whatever you do is always really interesting. Thanks for sharing