Semantic Chunking for RAG with
HTML-код
- Опубликовано: 10 дек 2024
- In this event, we’ll learn how the semantic chunking algorithm works! Text is split into sentences that are converted vectors through an embedding model. Similarity is measured between each pair of consecutive sentences. If sentences are too similar, as defined by a threshold, additional chunks are created. We can ensure that if any two consecutive sentences are too different from one another, additional chunks can be created. In theory, this will allow us to achieve better results during retrieval within our RAG system.
Event page: lu.ma/chunkingrag
Have a question for a speaker? Drop them here:
app.sli.do/eve...
Speakers:
Dr. Greg, Co-Founder & CEO
/ gregloughane
The Wiz, Co-Founder & CTO
/ csalexiuk
Join our community to start building, shipping, and sharing with us today!
/ discord
Apply for our new AI Engineering Bootcamp on Maven today!
bit.ly/aie1
How'd we do? Share your feedback and suggestions for future events.
forms.gle/1Uxk...
#chunking #rag
Google Colab notebook: colab.research.google.com/drive/1gGLd-rdPsM1iy4JmL1V1mfZm90CmDcXR?usp=sharing
Event Slides: www.canva.com/design/DAGAtxFPH2M/3oo8gElRKU21fQH-ZzYNNA/view?DAGAtxFPH2M&
Awesome job guys! I wached this video with my coffe this morning and it was a perfect way to start my day (learning, drinking a coffe and lisening a really good spekears/teachers)
This is awesome Damian - thank you! We're pumped we got to spend the morning with you :)
Love this video and new strategy of semantic chunking. Thanks to Greg and Chris for explaining this concept the way how it should be. Again thanks for making it open source.
Thanks bananamaker!! We enjoyed getting down into the weeds of some often-overlooked pieces today, and we're also fans of the new strategy! Look for more content like this from us soon!
Please make many more awesome explainers like this!
You can count on it @JankayYashwant!
Trong thang đánh giá kỹ thuật Chunking thì Chunking theo ngữ nghĩa và chunking theo agent được đánh giá ở cấp 4 và 5. Thực nghiệm cho thấy chunking agentic sử dụng LLMs cho kết quả cao nhất.
Cấp 1: Tách ký tự - Các đoạn dữ liệu ký tự tĩnh đơn giản
Cấp 2: Tách văn bản ký tự đệ quy - Chia nhỏ đệ quy dựa trên danh sách các dấu phân cách
Cấp 3: Tách theo từng loại tài liệu - Các phương pháp chia nhỏ khác nhau cho các loại tài liệu khác nhau (PDF, Python, Markdown)
Cấp 4: Tách ngữ nghĩa - Chia nhỏ dựa trên embedding. Kỹ thuật này chia đoạn văn bản thành các đoạn nhỏ dựa trên ngữ nghĩa, thay vì chỉ dựa vào độ dài cố định.
Cấp 5: Tách dùng agent - Agentic Chunker: Agentic Chunker tự động nhóm các propositions (mệnh đề) có liên quan vào các chunks (nhóm). Khi thêm một proposition mới, hệ thống sẽ xác định xem có nên thêm nó vào một chunk hiện có hay tạo một chunk mới.
Can you introduce some related articles? Thanks!
medium.com/the-ai-forum/semantic-chunking-for-rag-f4733025d5f5
heyyy u guys look familiar from the fourthbrain bootcamp i took! nice
When doing RAG in general is it best to insert it into the system prompt or to have an assistant message for it?
It's really up to you - and depends on if you're using examples or not.
You forgot to show how we can combine semantic chunking with parent document retriever)
I mean what chunks we need to use as parents and as childs.
I'm sorry! We didn't intend to explore this in the session!