Semantic Chunking - 3 Methods for Better RAG

Поделиться
HTML-код
  • Опубликовано: 23 июл 2024
  • Semantic chunking allows us to build more context-aware chunks of information. We can use this for RAG, splitting video and audio, and much more.
    In this video, we will use a simple RAG-focused example. We will learn about three different types of chunkers: StatisticalChunker, ConsecutiveChunker, and CumulativeChunker.
    At the end, we also discuss semantic chunking for video, such as for the new gpt-4o and other multi-modal use cases.
    📌 Code:
    github.com/aurelio-labs/seman...
    ⭐️ Article:
    www.aurelio.ai/learn/semantic...
    👋🏼 AI Consulting:
    aurelio.ai
    👾 Discord:
    / discord
    Twitter: / jamescalam
    LinkedIn: / jamescalam
    #ai #artificialintelligence #chatbot #nlp
    00:00 3 Types of Semantic Chunking
    00:42 Python Prerequisites
    02:44 Statistical Semantic Chunking
    04:38 Consecutive Semantic Chunking
    06:45 Cumulative Semantic Chunking
    08:58 Multi-modal Chunking
  • НаукаНаука

Комментарии • 23

  • @KenRossPhotography
    @KenRossPhotography Месяц назад

    Super interesting - thanks for that! I'll definitely be experimenting with those chunking variations.

    • @jamesbriggs
      @jamesbriggs  Месяц назад

      Awesome, would love to hear how it goes

  • @wassfila
    @wassfila Месяц назад +1

    this is really promising, thank you. It's really hard to get an overview on cost/benefit for end results from a RAG end user perspective. Like a comparison table.

  • @BB-ou5ui
    @BB-ou5ui Месяц назад +1

    Hi! That's exactly what I was looking for and explaining with some personal implementation, and trying to implement something with different strategies from dense vectors... Have you considered using multivec models like ColBERT? To some extent, you could work with matrix similarities on bigger contexts... I'm also testing some weighted strategies using splade, but that's too early to make claims 😊

  • @ariugarte
    @ariugarte Месяц назад

    Hello, it's a fantastic tool! but I encountered some problems with tables in PDFs and with strings that use characters such as '-' to separate phrases or sections.I end up with chunks that are much bigger than the maximum size.

  • @looppp
    @looppp Месяц назад

    great video

  • @Piero-xi1yi
    @Piero-xi1yi Месяц назад +1

    Could you please explain the logic and concept of your code? How does this compare with semantic_chunker from langchain / llama index (it use something like your comulative, using a sliding window of n sentences, and with an "adaptive" threshold based on percentile)

  • @user-eh2ji5xs8k
    @user-eh2ji5xs8k Месяц назад

    Can we use Ollama for the embedding?

  • @maxlgemeinderat9202
    @maxlgemeinderat9202 Месяц назад

    Nice video! So e.g if i am reading in docs with unstructured io, i can then use the semantic chunker instead of a RecursiveCharacterSplitter?

    • @jamesbriggs
      @jamesbriggs  Месяц назад

      yes you can, there's an (old, I should update) example here github.com/aurelio-labs/semantic-router/blob/main/docs/examples/unstructured-element-splitter.ipynb
      ^ the "splitter" here is equivalent to the StatisticalChunker in semantic-chunkers

  • @AGI-Bingo
    @AGI-Bingo Месяц назад +3

    Hi James, could you please cover how to do "citing" with rag? With option to open the original source. That would be cool ❤
    Also if love to see an example for LiveRag, that watches certain files or folders for changes, and rechunks, embeddes, removes outdated and saves diffs.
    What do you think about these?
    Thanks a lot!

    • @tarapogancev
      @tarapogancev Месяц назад

      If you are using Pinecone or similar vector database, along with the vector entry you can usually also add specific metadata. I mostly keep the original text stored within that vector as a 'content' metadata field, and then add other fields for the file's name, topic etc. :) This way, you can cross-reference your data for the users to navigate easily.

    • @AGI-Bingo
      @AGI-Bingo Месяц назад

      got it so you could also add "filepath" and trigger opening the file, wonder if there's a way to jump and highlight a specific part of text after opening (i.e pdf)
      Also,@@tarapogancev do you know of a way to run diffs on files and delete/reupload all relevant chunks. Watching files and folders for changes, then triggering re-RAG embeddings, to keep everything automatically up-to-date. Thanks 🙏 👍

    • @tarapogancev
      @tarapogancev Месяц назад +1

      @@AGI-Bingo The idea of highlighting relevant text sounds great! I am yet to face the UI portion of this problem, trying to achieve similar results. :)
      I haven't worked with automatic syncs, but they would be very useful! So far, from what I've seen AWS Knowledge Bases and Azure's AI Search (if I remember correctly) both offer options to sync data manually when needed. It's not as convenient but I'm thinking it's not a bad solution either, considering it is possibly less work on the server-side, and maybe less credits for OpenAI or other LLM services.
      Sorry I couln't offer help on this topic, but I hope you come uo with a great solution! :D

  • @talesfromthetrailz
    @talesfromthetrailz Месяц назад

    How would you compare the Statistical chunker with the rolling window splitter you used for semantic chunking? Do you prefer one over the other? I'm designing a recommendation system that uses user queries to match to certain outputs they may want. Thanks!

    • @jamesbriggs
      @jamesbriggs  Месяц назад +1

      StatisticalChunker is actually just a more recent version of the rolling window splitter, it includes handling for larger documents and some other optimizations so I'd recommend the statistical

  • @samcavalera9489
    @samcavalera9489 Месяц назад

    Hi James,
    First off, I want to express my immense gratitude for your insightful videos on RAG and other AI topics. Your content is so enriching that I find myself watching each video at least twice!
    I do have a couple of questions that I hope you can shed some light on:
    1) When using OpenAI’s small embedding model with the recursivecharctertextsplitter, is there a general guideline for determining the optimal chunk size and overlapping size? I’m looking for a rule of thumb that could help me set the right values for these parameters.
    2) My work primarily involves using RAG on scientific papers, which often include figures that sometimes convey more information than the text itself. Is there a technique to incorporate these figures into the vector database along with the paper’s text? Essentially, for multi-modal vector embedding that includes both text and images, what’s the best approach to achieve this?
    I greatly appreciate your insight 🙏🙏🙏

    • @jamesbriggs
      @jamesbriggs  Месяц назад +1

      Hey thanks for the message! For (1) my rule of thumb is 200-300 tokens with a 20-40 token overlap, for (2) you can use the multimodal models (like gpt-4o) to describe what is in the image, then embed that - alternatively you could use an text-image embedding model but they don’t capture as much detail as what you could get from a multimodal LLM. Hope that helps :)

    • @samcavalera9489
      @samcavalera9489 Месяц назад

      @@jamesbriggs many thanks James 🙏🙏🙏

  • @CBCELIMUPORTALORG
    @CBCELIMUPORTALORG Месяц назад

    🎯 Key points for quick navigation:
    📘 The video introduces three semantic chunking methods for text data, improving retrieval-augmented generation (RAG) applications.
    💻 Demonstrates use of the "semantic chunkers library," showcasing practical examples via a Colab notebook, requiring OpenAI's API key.
    📊 Focuses on a dataset of AI archive papers, applying semantic chunking to manage the data's complexity and improve processing efficiency.
    🤖 Discusses the need for an embedding model to facilitate semantic chunking, highlighting OpenAI's Embedding Model as a primary tool.
    📈 Outlines the "statistical chunking method" as a recommended approach for its efficiency, cost-effectiveness, and automatic parameter adjustments.
    🔍 Explains "consecutive chunking" as being cost-effective and relatively fast, but requiring more manual input for tuning parameters.
    📝 Presents "cumulative chunking" as a method that builds embeddings progressively, offering noise resistance but at a higher computational cost.
    🌐 Notes the adaptability of chunking methods to different data modalities, with specific mention of their suitability for text and potential for video.
    Made with HARPA AI

  • @lavamonkeymc
    @lavamonkeymc Месяц назад

    Where’s the advanced lamb graph video?

  • @jamesbriggs
    @jamesbriggs  Месяц назад

    📌 Code:
    github.com/aurelio-labs/semantic-chunkers/blob/main/docs/00-chunkers-intro.ipynb
    ⭐ Article:
    www.aurelio.ai/learn/semantic-chunkers-intro