LangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101

Поделиться
HTML-код
  • Опубликовано: 10 дек 2024

Комментарии • 101

  • @jamesbriggs
    @jamesbriggs  Год назад +12

    LangChain docs have moved so the original wget command in this video will no longer download everything, now you need to use:
    !wget -r -A.html -P rtdocs python.langchain.com/en/latest/

    • @mohitagarwal9007
      @mohitagarwal9007 Год назад +1

      This is not downloading everything as well is there anything else we can use to get the necessary files ?

    • @jamesbriggs
      @jamesbriggs  Год назад +9

      @@mohitagarwal9007 yes I have created a copy of the docs on Hugging Face here huggingface.co/datasets/jamescalam/langchain-docs-23-06-27
      You can download by doing a `pip install datasets` followed by:
      ```
      from datasets import load_dataset
      data = load_dataset('jamescalam/langchain-docs-23-06-27', split='train')
      ```

    • @deniskrr
      @deniskrr Год назад +1

      @@mohitagarwal9007 just go to the above link and see where you're getting redirected to now. Then copy the link from the browser to the wget command and it should always work.

  • @RamasubramaniamM
    @RamasubramaniamM Год назад +1

    Chunking the most important idea and largely ignored. Thanks James love your technical depth.

  • @fgfanta
    @fgfanta Год назад

    I need to chunk text for retrieval augmentation and did a search on RUclips and found... James Briggs' video. I know I will find in it what I need. Nice!

  • @dikshyakasaju7541
    @dikshyakasaju7541 Год назад +1

    Thank you for sharing this informative video showcasing Lanchain's powerful text chunking capabilities using the RecursiveCharacterTextSplitter. Previously, I had to write several functions to tokenize and split text while managing context overlap to avoid missing crucial information. However, accomplishing the same task now only requires a few lines of code. Very impressive.

  • @ADHDOCD
    @ADHDOCD Год назад +4

    Great video. Finally somebody that goes in depth into data prep. I've always wondered unecessary (key,value) pairs in json files.

  • @grandplazaunited
    @grandplazaunited Год назад +1

    Thank you for sharing your knowledge. these are some of the best videos on LangChain.

  • @harleenmann6280
    @harleenmann6280 Год назад

    Great video series. Appreciate you sharing your thought process as we go - this is the part most online content creators of tech mill. They cover the how, and more often than not miss the why. Thanks again. Enjoying all the videos in this playlist

  • @BrianStDenis-pj1tq
    @BrianStDenis-pj1tq Год назад

    At first, it seemed like you switched from tiktoken len to char len of your chunks, when explaining RecursiveCharacterTextSplitter. That wasn't going to work, so I went back and found that you did show, maybe not so much explain, that the splitter is using the tiktoken len function. Makes sense now, thanks!

  • @temiwale88
    @temiwale88 Год назад

    I'm @ 12:34 and this is an amazing explanation thus far. Thank you!

  • @hashiromer7668
    @hashiromer7668 Год назад +10

    Wouldn't chunking lose information about long term dependencies between passages? For example, if a term is defined in the start of document which is used in the last passage, this dependency won't be captured if we chunk documents.

    • @jamesbriggs
      @jamesbriggs  Год назад +5

      yes, this is an issue with it, if you're lucky and using a vector db with returning 5 or so chunks, you might return both chunks and then the LLM sees both, but naturally there's no guarantee of this - I'm not aware of a better approach for tackling this problem with large datasets though

    • @bobjones7274
      @bobjones7274 Год назад +4

      @@jamesbriggs Somebody on another video said the following, is it relevant here? "You could aggregate chunk togethers asking the LLM to summarize and group them in "meta chunks", you could repeat the process until all years are contained into a single max limit tokens batch. Then, with the meta Data, you'll be able to perform a much more powerful search over the corpus, providing much more context to your LLM with different level of aggregation."

    • @rodgerb2645
      @rodgerb2645 Год назад

      @@bobjones7274 sounds interesting, do you remember the video? Can you provide the link? Tnx

    • @astro_roman
      @astro_roman Год назад

      @@bobjones7274 link, please, I beg you

    • @JOHNSMITH-ve3rq
      @JOHNSMITH-ve3rq Год назад

      I’ve seen this in many places but where has it been implemented??

  • @jamesbriggs
    @jamesbriggs  Год назад +1

    if the code isn't loading for you from video links, try opening in Colab here:
    colab.research.google.com/github/pinecone-io/examples/blob/master/generation/langchain/handbook/xx-langchain-chunking.ipynb

  • @SnowyMango
    @SnowyMango Год назад

    This was great! I made a terrible mistake of chunking without considering this simple math and embedded indexed into Pinecone at the larger size. Now I have to go redo all them after realizing at their current sizes it isn’t quite suitable for LangChain retrieval

  • @videowatching9576
    @videowatching9576 Год назад

    Fascinating channel, thanks! Remarkable to learn about LLMs, how to interact with LLMs, what can be built, and what could be possible over time. I look forward to more.

  • @siamhasan288
    @siamhasan288 Год назад

    Ayo I was literally looking for how to prepare my data for the past hour . Thank you for making these.

    • @eRiicBelleT
      @eRiicBelleT Год назад +1

      In my case the last two weeks xD

  • @alvinpinoy
    @alvinpinoy Год назад +1

    Very helpful and very well explained. Thanks for sharing your knowledge about this! LangChain really feels as the missing glue between the open web and all those new AI models popping up.

  • @SuperYoschii
    @SuperYoschii Год назад +2

    Thanks for the content James! I think they changed something when downloading the htmls with wget. When I run the colab, it only downloads a single index.html file

  • @redfield126
    @redfield126 Год назад

    Thank you James for the in depth explanation of data prep. Learning a lot with your videos.

  • @murphp151
    @murphp151 Год назад

    these videos are pure class

  • @codecritique
    @codecritique 8 месяцев назад

    Thanks for the tutorial, really clear explaination !!

  •  Год назад +1

    Great content once again, thanks for sharing. I wish I had this a couple weeks ago :D

  • @MaciekMorz
    @MaciekMorz Год назад +2

    I have seen a lot of materials on how to store embeddings in Pinecone vector db. But I haven't seen any tutorial yet on how to store vectorstores with different embeddings of different users in one index. I.e. how to retrieve embeddings depending on which user they belong to. What would be the best strategy for this, whether through metadata or something else?
    It would be great to see a tutorial on this especially using langchain although it seems to me that the current wrapper doesn't really allow this. BTW. The whole series with langchain is great!

  • @lf6190
    @lf6190 Год назад

    Awesome I was just trying to figure out how to do this with the langchain docs so that I can learn it quicker!

  • @rodgerb2645
    @rodgerb2645 Год назад

    Amazing James, I've learned so much from you!

  • @AlexBego
    @AlexBego Год назад

    James, I should say Thank You a Lot for your interesting and so useful videos!

    • @jamesbriggs
      @jamesbriggs  Год назад

      you're welcome, thanks for watching them!

  • @eRiicBelleT
    @eRiicBelleT Год назад

    Uff the video that I was expecting! Thank youuu!

  • @TomanswerAi
    @TomanswerAi Год назад

    Nice one James. Demystified that step for me there 👍 As you say if people get this part wrong everything else will underperform

  • @muhammadhammadkhan1289
    @muhammadhammadkhan1289 Год назад +2

    You always know what I am looking for thanks for this 🙏

  • @calebmoe9077
    @calebmoe9077 Год назад

    Thank you James!

  • @matheusrdgsf
    @matheusrdgsf Год назад

    James you are helping a lot in my activities. Thank you.

  • @fraternitas5117
    @fraternitas5117 Год назад

    James dropping the great content as usual.

  • @sevilnatas
    @sevilnatas 7 месяцев назад +1

    I am struggling with a chunking scenario that includes PDFs that include a lot of columnar data in tables and the primary questions users will be asking of the PDF data will be contained in the tables. Qu4estions that depend on finding the value in the first column and then retrieving a value on that row found with the original value in a specified column. This means that the chunked data needs to be able to maintain the integrity of the table. Any suggestions?

    • @absar66
      @absar66 4 месяца назад

      any solution? i am struggling with the same..

    • @sevilnatas
      @sevilnatas 4 месяца назад

      @@absar66 No silver bullets. I did see a project called Marker, I think, that can take PDFs and convert them to markdown text. If that is effective at translating to markdown, columnar type text will probably be better chunked if it is markdown. Anyway, just a thought I was thinking about trying. If you give it a try, let me know how it goes.

  • @raypixelz
    @raypixelz Год назад

    Awesome. thank you!

  • @henkhbit5748
    @henkhbit5748 Год назад

    Thanks James, for sharing this information.👍 I always thought that 4k token limit for chatgpt-turbo was independent for input and output completion and not combined...

  • @rishniratnam
    @rishniratnam Год назад

    Nice video James.

  • @mohammadsunasra
    @mohammadsunasra Год назад

    So James, what you mean to say is it will first split based on the first character splitter, then compare of the no of tokens > chunk size and if yes, then split again based on the next split until the no of tokens < chunk size right?

  • @kevon217
    @kevon217 Год назад

    any tips for dealing with datasets that have missing values? doesn’t seem like the various transformer encoding classes have defaults for handling entirely empty strings/values. it’ll still spit out a vector which i assume is just padding tokens?

  • @mintakan003
    @mintakan003 Год назад +1

    I played with this awhile ago in LangChain. My impression is in order to do Q&A on documents, one has to do a sequential scan. Every chunk has to be read in. Wouldn't this be prohibitively expensive for a large document set? I know there are vector databases (indices) which can do a pre-screen based on vector similarity. This would be an improvement. But it still involves a sequential scan, now at the vector level. Are there attempts to address this problem? Perhaps parallelism maybe one part of the solution (?)

    • @jamesbriggs
      @jamesbriggs  Год назад +3

      it isn't a sequential scan with (most, maybe all) vector DBs, they use approximate search, so the answer is approximated and not everything is fully compared - a good vector db will make this approximation very accurate (like 99% accuracy)

  • @ketangote
    @ketangote Год назад

    Great Video

  • @Sergedable
    @Sergedable Год назад +1

    nice job, also it would be grateful if you could make a video on how to combine, for example, multiple documents doc1 doc2 doc3..etc and use LangChain and ChatGraph 4 to analyze them.

  • @GrahamAndersonis
    @GrahamAndersonis Год назад

    Is there a best practice for chunking mixed documents that also include tables and images? Are you extracting tables/images (out of the chunk) and into a separate CSV/other file, and then providing some kind of ‘hey llm, the table for this chunk is located in this CSV file’ ? If so, how do you write the syntax for this note (within the chunk) to the LLM? Much appreciation in advance.

  • @ayushgautam9462
    @ayushgautam9462 Год назад

    are you using a jupyter notebook or are you working on the google colab, and how can i run these codes on vs code if possible

  • @mrchongnoi
    @mrchongnoi Год назад

    You talk about adding context. Where can I get information on adding context? Sorry if it is a remedial question.

  • @younginnovatorscenterofint8986

    thanks for the content James. I am trying to build document conversational asistant,using langchain and huggingface but I have been getting this error .Token indices sequence length is longer than the specified maximum sequence length for this model (2842 > 512). Running this sequence through the model will result in indexing errors
    d

  • @ylazerson
    @ylazerson Год назад

    great video - super appreciated!

  • @generichuman_
    @generichuman_ Год назад

    I'm training a transformer model from scratch just to get a better intuition on how they work. I'm curious if you know the best way to setup the text dataset so that each text chunk is it's own entity and won't bleed over into other chunks. For example if I have a dataset of stories, when one ends and another one begins, I don't want the next story to still have context from the last story. I'm using the hugging face tokenizer to implement BPE. I hope this makes sense and I would greatly appreciate any guidance!

  • @nazimtairov
    @nazimtairov Год назад

    thanks for tutorial, how text after splitting into chunks can be processed further to LLMChain?
    I'm getting an error from openai api:
    chain = LLMChain(llm=llm, prompt=chat_prompt, verbose=True)
    chain_result = chain.run({
    'source_code': python_code,
    'target_tech': 'python',
    'source_tech': 'Go'
    })
    This model's maximum context length is 8193 tokens. However, your messages resulted in 13448 tokens. Please reduce the length of the messages.

  • @alivecoding4995
    @alivecoding4995 Год назад

    How do you work remotely in VSCode with the notebook on Colab?

  • @jacobgoldenart
    @jacobgoldenart Год назад

    Thanks James! About chunking. What about when your documentation has a lot of code example’s interspersed throughout the text. Is the recursive text splitter able to work with say python code where retaining white space is important?

    • @jamesbriggs
      @jamesbriggs  Год назад +1

      it won't distinguish any special difference between normal text and code unfortunately, so it will just split on newlines, whitespace, etc as per usual

  • @gunderhaven
    @gunderhaven Год назад

    Hi James, thanks for sharing your work. In this video, you briefly mention cleaning up the "messy bits" in the plain text page content and that it is not necessary in your estimation. Could you suggest an approach to clean up those messy bits to some degree? Thanks in advance.

  • @maximchuprynsky7472
    @maximchuprynsky7472 Год назад

    Hello. I have a question/problem. I have a rather large prompt and it exceeds the token limit. Is there any possibility to split it as well as the basic information from the pdf file?

  • @jimjones26
    @jimjones26 Год назад

    I have a question. I am working on loading documentation for several different technologies into 1 vector database. I want to use this as a AI development assistant for the tech stack that I use to create web applications. I am assuming the way you are categorizing your chunks would be an appropriate way to have these different 'columns' of data within one vector db?

  • @krisszostak4849
    @krisszostak4849 Год назад

    Hi James, Thanks for your amazing work!
    I've been playing with this lately and I'm not sure if I understand the connection between the tiktoken_len function and the chunk_size and the length_function args in RecursiveCharacterTextSplitter. So the question is this:
    In the RecursiveCharacterTextSplitter - if the "length_function=len" (by default), then the "chunk_size" sets the max amount of CHARACTERS in the chunk, but if the "lenght_function=tiktoken_len" (or any other token counter) - then the "chunk_size" sets the max amount of TOKENS? Is that correct?
    Thanks!

  • @artchess0
    @artchess0 Год назад

    Hi James, thank you very much for your videos. I have a question. What if we need to pass context to our LLM model to translate from one language to another. It is better to chunk in smallest sizes or to the limit of tokens for the request to the model? Im thinking of processing the chunks in parallel and then join the results of the translation together. But i dont know what is the best aproach. Thank you in advance

  • @ChronicleContent
    @ChronicleContent Год назад

    I am kinda clueless and don't know much about any of this but why are we doing this? Don't you think that in the future chatgpt or other will use live internet and have the information available? And also have bigger limits? I am trying to understand the vision on this. Or is it just for now to be able to "bypass" the limits and use it on updated stuff till they find out a way to have a live trained model? Sorry if it sounds totally clueless.

  • @dreamphoenix
    @dreamphoenix Год назад

    Thank you.

  • @RedCloudServices
    @RedCloudServices Год назад

    James I hope I am asking this question correctly. Would it not be cheaper to fine tune an existing GPT model with your entire custom corpus (i.e. your langchain docs) and then have chatbot using your finished fine tuned LLM published up to openai?

  • @LucaMainieri68
    @LucaMainieri68 Год назад

    Thank you for your amazing video and all work you do. I was wondering how to use langchain to perform data analysis on one or more datasets. Let's say I have leads, sell and orders dataset. Can I use langchain to perform some analysis, such as ask which customers placed the last order? How were sales last month? Can you aim me in the right direction? Thanks 🙏

  • @jesusperdomo8388
    @jesusperdomo8388 Год назад

    please, is it possible for you to work the code in visual studio code?

  • @li_tsz_fung
    @li_tsz_fung Год назад

    Is LLaMA langchain a thing now? It makes sense to me that we should use open source stuff, so that we can run it locally soon.

    • @jamesbriggs
      @jamesbriggs  Год назад

      I believe so, but haven't had the chance to check it out yet - for sure, will be focusing more on open source soon

  • @ewanp1396
    @ewanp1396 Год назад

    Great video. What software are you using for the video (as in the notebook with blue background)?

  • @paenget
    @paenget Год назад

    Amazing❤

  • @Sunghoon4life
    @Sunghoon4life Год назад

    Is the RecursiveCharacterTextSplitter split the text based on token or text? as per dos seems like it's based on charector but in the video u said it's based on token. Could you please confirm?

    • @jamesbriggs
      @jamesbriggs  Год назад

      it's splitting on character (the "

      ", "
      ", " ", "" characters), but the length function is based on tokens, so it is kind of doing both, meaning it is identifying a satisfactory length based on tokens, but then the split itself is using characters

    • @rmehdi5871
      @rmehdi5871 Год назад

      @@jamesbriggs does this splitting words on any text?? in my data, taken, I think, via xml format, has these tags: , and . Should I split on those rather than with "

      ", "
      ", " ", "", or do both, perhaps? What is your recommendation?

  • @tadavid1999
    @tadavid1999 Год назад

    Could anyone help me? I'm trying to use the !wget -r -A but it is not recognised as a command. I do not understand where I am going wrong as far as I know I have all modules installed as well as correct permissions. I have tried running this in terminal of Microsoft visual code, powershell as admin (with ChatGPT to put it in a diff format) and as a script importing os. Just is not working for me and I am very interested in the practical applications of this. Great video by the way I like how everything is explained step by step!

    • @jamesbriggs
      @jamesbriggs  Год назад +1

      I think it should be recognized as a command, the issue may be that the webpage is outdated, could you try `!wget -r -A.html -P rtdocs python.langchain.com/en/latest/` - also another thought, if you're running in terminal drop the `!`, leaving you with `wget -r -A.html -P rtdocs python.langchain.com/en/latest/`

    • @tadavid1999
      @tadavid1999 Год назад

      @@jamesbriggs i've just figured this out. It's because im not linux based. This video helped me fix the issue for anyone wanting to follow along: ruclips.net/video/gCrF8Zx13wg/видео.html

    • @tadavid1999
      @tadavid1999 Год назад

      @@jamesbriggs I'm trying to use the wget command to download my own website for context but it keeps downloading the first page only, any tips on how i can get it to go for the rest?

  • @mohammedsaheer4700
    @mohammedsaheer4700 Год назад

    Can we pass more than 10000 tokens into langchain using chunking ?

    • @jamesbriggs
      @jamesbriggs  Год назад

      Yes you can pass in as many as you like, billions even

  • @StephenStrong-x1s
    @StephenStrong-x1s Год назад

    James, this video (and all your postings) are excellent! Exactly what a long time developer, looking to expand into AI needs to get started! Do you do any lectures at conferences?

  • @fraternitas5117
    @fraternitas5117 Год назад

    Could you make content about Nvidia's NeMo?

  • @rafaelprudencioleite7291
    @rafaelprudencioleite7291 Год назад

    Thanks so much for the video. when i use
    !wget -r -A.html -P rtdocs link...
    It download the index.html page. I tried in the terminal and won't work too. There's a way to handle that?

    • @Clubcloudcomputing
      @Clubcloudcomputing Год назад +2

      Looks like the website changed, and does a redirect to a different domain. Hence you get only 1 file. Instead, index the domain that it redirects to.

  • @TheCloudShepherd
    @TheCloudShepherd Год назад