LangChain Data Loaders, Tokenizers, Chunking, and Datasets - Data Prep 101

Поделиться
HTML-код
  • Опубликовано: 22 авг 2024

Комментарии • 101

  • @jamesbriggs
    @jamesbriggs  Год назад +12

    LangChain docs have moved so the original wget command in this video will no longer download everything, now you need to use:
    !wget -r -A.html -P rtdocs python.langchain.com/en/latest/

    • @mohitagarwal9007
      @mohitagarwal9007 Год назад +1

      This is not downloading everything as well is there anything else we can use to get the necessary files ?

    • @jamesbriggs
      @jamesbriggs  Год назад +9

      @@mohitagarwal9007 yes I have created a copy of the docs on Hugging Face here huggingface.co/datasets/jamescalam/langchain-docs-23-06-27
      You can download by doing a `pip install datasets` followed by:
      ```
      from datasets import load_dataset
      data = load_dataset('jamescalam/langchain-docs-23-06-27', split='train')
      ```

    • @deniskrr
      @deniskrr 9 месяцев назад +1

      @@mohitagarwal9007 just go to the above link and see where you're getting redirected to now. Then copy the link from the browser to the wget command and it should always work.

  • @user-wy9fc5vi3j
    @user-wy9fc5vi3j Год назад +1

    Chunking the most important idea and largely ignored. Thanks James love your technical depth.

  • @fgfanta
    @fgfanta Год назад

    I need to chunk text for retrieval augmentation and did a search on RUclips and found... James Briggs' video. I know I will find in it what I need. Nice!

  • @grandplazaunited
    @grandplazaunited Год назад +1

    Thank you for sharing your knowledge. these are some of the best videos on LangChain.

  • @ADHDOCD
    @ADHDOCD Год назад +4

    Great video. Finally somebody that goes in depth into data prep. I've always wondered unecessary (key,value) pairs in json files.

  • @dikshyakasaju7541
    @dikshyakasaju7541 Год назад +1

    Thank you for sharing this informative video showcasing Lanchain's powerful text chunking capabilities using the RecursiveCharacterTextSplitter. Previously, I had to write several functions to tokenize and split text while managing context overlap to avoid missing crucial information. However, accomplishing the same task now only requires a few lines of code. Very impressive.

  • @harleenmann6280
    @harleenmann6280 9 месяцев назад

    Great video series. Appreciate you sharing your thought process as we go - this is the part most online content creators of tech mill. They cover the how, and more often than not miss the why. Thanks again. Enjoying all the videos in this playlist

  • @jamesbriggs
    @jamesbriggs  Год назад +1

    if the code isn't loading for you from video links, try opening in Colab here:
    colab.research.google.com/github/pinecone-io/examples/blob/master/generation/langchain/handbook/xx-langchain-chunking.ipynb

  • @muhammadhammadkhan1289
    @muhammadhammadkhan1289 Год назад +2

    You always know what I am looking for thanks for this 🙏

  • @codecritique
    @codecritique 4 месяца назад

    Thanks for the tutorial, really clear explaination !!

  • @alvinpinoy
    @alvinpinoy Год назад +1

    Very helpful and very well explained. Thanks for sharing your knowledge about this! LangChain really feels as the missing glue between the open web and all those new AI models popping up.

  • @BrianStDenis-pj1tq
    @BrianStDenis-pj1tq 10 месяцев назад

    At first, it seemed like you switched from tiktoken len to char len of your chunks, when explaining RecursiveCharacterTextSplitter. That wasn't going to work, so I went back and found that you did show, maybe not so much explain, that the splitter is using the tiktoken len function. Makes sense now, thanks!

  • @videowatching9576
    @videowatching9576 Год назад

    Fascinating channel, thanks! Remarkable to learn about LLMs, how to interact with LLMs, what can be built, and what could be possible over time. I look forward to more.

  •  Год назад +1

    Great content once again, thanks for sharing. I wish I had this a couple weeks ago :D

  • @siamhasan288
    @siamhasan288 Год назад

    Ayo I was literally looking for how to prepare my data for the past hour . Thank you for making these.

    • @eRiicBelleT
      @eRiicBelleT Год назад +1

      In my case the last two weeks xD

  • @SnowyMango
    @SnowyMango Год назад

    This was great! I made a terrible mistake of chunking without considering this simple math and embedded indexed into Pinecone at the larger size. Now I have to go redo all them after realizing at their current sizes it isn’t quite suitable for LangChain retrieval

  • @redfield126
    @redfield126 Год назад

    Thank you James for the in depth explanation of data prep. Learning a lot with your videos.

  • @AlexBego
    @AlexBego Год назад

    James, I should say Thank You a Lot for your interesting and so useful videos!

    • @jamesbriggs
      @jamesbriggs  Год назад

      you're welcome, thanks for watching them!

  • @temiwale88
    @temiwale88 Год назад

    I'm @ 12:34 and this is an amazing explanation thus far. Thank you!

  • @hashiromer7668
    @hashiromer7668 Год назад +10

    Wouldn't chunking lose information about long term dependencies between passages? For example, if a term is defined in the start of document which is used in the last passage, this dependency won't be captured if we chunk documents.

    • @jamesbriggs
      @jamesbriggs  Год назад +5

      yes, this is an issue with it, if you're lucky and using a vector db with returning 5 or so chunks, you might return both chunks and then the LLM sees both, but naturally there's no guarantee of this - I'm not aware of a better approach for tackling this problem with large datasets though

    • @bobjones7274
      @bobjones7274 Год назад +4

      @@jamesbriggs Somebody on another video said the following, is it relevant here? "You could aggregate chunk togethers asking the LLM to summarize and group them in "meta chunks", you could repeat the process until all years are contained into a single max limit tokens batch. Then, with the meta Data, you'll be able to perform a much more powerful search over the corpus, providing much more context to your LLM with different level of aggregation."

    • @rodgerb2645
      @rodgerb2645 Год назад

      @@bobjones7274 sounds interesting, do you remember the video? Can you provide the link? Tnx

    • @astro_roman
      @astro_roman Год назад

      @@bobjones7274 link, please, I beg you

    • @JOHNSMITH-ve3rq
      @JOHNSMITH-ve3rq Год назад

      I’ve seen this in many places but where has it been implemented??

  • @murphp151
    @murphp151 Год назад

    these videos are pure class

  • @matheusrdgsf
    @matheusrdgsf Год назад

    James you are helping a lot in my activities. Thank you.

  • @TomanswerAi
    @TomanswerAi Год назад

    Nice one James. Demystified that step for me there 👍 As you say if people get this part wrong everything else will underperform

  • @eRiicBelleT
    @eRiicBelleT Год назад

    Uff the video that I was expecting! Thank youuu!

  • @lf6190
    @lf6190 Год назад

    Awesome I was just trying to figure out how to do this with the langchain docs so that I can learn it quicker!

  • @rodgerb2645
    @rodgerb2645 Год назад

    Amazing James, I've learned so much from you!

  • @MaciekMorz
    @MaciekMorz Год назад +2

    I have seen a lot of materials on how to store embeddings in Pinecone vector db. But I haven't seen any tutorial yet on how to store vectorstores with different embeddings of different users in one index. I.e. how to retrieve embeddings depending on which user they belong to. What would be the best strategy for this, whether through metadata or something else?
    It would be great to see a tutorial on this especially using langchain although it seems to me that the current wrapper doesn't really allow this. BTW. The whole series with langchain is great!

  • @SuperYoschii
    @SuperYoschii Год назад +2

    Thanks for the content James! I think they changed something when downloading the htmls with wget. When I run the colab, it only downloads a single index.html file

  • @calebmoe9077
    @calebmoe9077 Год назад

    Thank you James!

  • @raypixelz
    @raypixelz Год назад

    Awesome. thank you!

  • @fraternitas5117
    @fraternitas5117 Год назад

    James dropping the great content as usual.

  • @Sergedable
    @Sergedable Год назад +1

    nice job, also it would be grateful if you could make a video on how to combine, for example, multiple documents doc1 doc2 doc3..etc and use LangChain and ChatGraph 4 to analyze them.

  • @rishniratnam
    @rishniratnam Год назад

    Nice video James.

  • @ylazerson
    @ylazerson Год назад

    great video - super appreciated!

  • @henkhbit5748
    @henkhbit5748 Год назад

    Thanks James, for sharing this information.👍 I always thought that 4k token limit for chatgpt-turbo was independent for input and output completion and not combined...

  • @ketangote
    @ketangote Год назад

    Great Video

  • @mohammadsunasra
    @mohammadsunasra Год назад

    So James, what you mean to say is it will first split based on the first character splitter, then compare of the no of tokens > chunk size and if yes, then split again based on the next split until the no of tokens < chunk size right?

  • @mintakan003
    @mintakan003 Год назад +1

    I played with this awhile ago in LangChain. My impression is in order to do Q&A on documents, one has to do a sequential scan. Every chunk has to be read in. Wouldn't this be prohibitively expensive for a large document set? I know there are vector databases (indices) which can do a pre-screen based on vector similarity. This would be an improvement. But it still involves a sequential scan, now at the vector level. Are there attempts to address this problem? Perhaps parallelism maybe one part of the solution (?)

    • @jamesbriggs
      @jamesbriggs  Год назад +3

      it isn't a sequential scan with (most, maybe all) vector DBs, they use approximate search, so the answer is approximated and not everything is fully compared - a good vector db will make this approximation very accurate (like 99% accuracy)

  • @dreamphoenix
    @dreamphoenix Год назад

    Thank you.

  • @gunderhaven
    @gunderhaven Год назад

    Hi James, thanks for sharing your work. In this video, you briefly mention cleaning up the "messy bits" in the plain text page content and that it is not necessary in your estimation. Could you suggest an approach to clean up those messy bits to some degree? Thanks in advance.

  • @ewanp1396
    @ewanp1396 Год назад

    Great video. What software are you using for the video (as in the notebook with blue background)?

  • @mrchongnoi
    @mrchongnoi Год назад

    You talk about adding context. Where can I get information on adding context? Sorry if it is a remedial question.

  • @sevilnatas
    @sevilnatas 3 месяца назад +1

    I am struggling with a chunking scenario that includes PDFs that include a lot of columnar data in tables and the primary questions users will be asking of the PDF data will be contained in the tables. Qu4estions that depend on finding the value in the first column and then retrieving a value on that row found with the original value in a specified column. This means that the chunked data needs to be able to maintain the integrity of the table. Any suggestions?

    • @absar66
      @absar66 19 дней назад

      any solution? i am struggling with the same..

    • @sevilnatas
      @sevilnatas 18 дней назад

      @@absar66 No silver bullets. I did see a project called Marker, I think, that can take PDFs and convert them to markdown text. If that is effective at translating to markdown, columnar type text will probably be better chunked if it is markdown. Anyway, just a thought I was thinking about trying. If you give it a try, let me know how it goes.

  • @ayushgautam9462
    @ayushgautam9462 Год назад

    are you using a jupyter notebook or are you working on the google colab, and how can i run these codes on vs code if possible

  • @kevon217
    @kevon217 Год назад

    any tips for dealing with datasets that have missing values? doesn’t seem like the various transformer encoding classes have defaults for handling entirely empty strings/values. it’ll still spit out a vector which i assume is just padding tokens?

  • @user-tk7os4dm5j
    @user-tk7os4dm5j Год назад

    James, this video (and all your postings) are excellent! Exactly what a long time developer, looking to expand into AI needs to get started! Do you do any lectures at conferences?

  • @GrahamAndersonis
    @GrahamAndersonis Год назад

    Is there a best practice for chunking mixed documents that also include tables and images? Are you extracting tables/images (out of the chunk) and into a separate CSV/other file, and then providing some kind of ‘hey llm, the table for this chunk is located in this CSV file’ ? If so, how do you write the syntax for this note (within the chunk) to the LLM? Much appreciation in advance.

  • @RedCloudServices
    @RedCloudServices Год назад

    James I hope I am asking this question correctly. Would it not be cheaper to fine tune an existing GPT model with your entire custom corpus (i.e. your langchain docs) and then have chatbot using your finished fine tuned LLM published up to openai?

  • @krisszostak4849
    @krisszostak4849 Год назад

    Hi James, Thanks for your amazing work!
    I've been playing with this lately and I'm not sure if I understand the connection between the tiktoken_len function and the chunk_size and the length_function args in RecursiveCharacterTextSplitter. So the question is this:
    In the RecursiveCharacterTextSplitter - if the "length_function=len" (by default), then the "chunk_size" sets the max amount of CHARACTERS in the chunk, but if the "lenght_function=tiktoken_len" (or any other token counter) - then the "chunk_size" sets the max amount of TOKENS? Is that correct?
    Thanks!

  • @paenget
    @paenget Год назад

    Amazing❤

  • @ChronicleContent
    @ChronicleContent Год назад

    I am kinda clueless and don't know much about any of this but why are we doing this? Don't you think that in the future chatgpt or other will use live internet and have the information available? And also have bigger limits? I am trying to understand the vision on this. Or is it just for now to be able to "bypass" the limits and use it on updated stuff till they find out a way to have a live trained model? Sorry if it sounds totally clueless.

  • @LucaMainieri68
    @LucaMainieri68 Год назад

    Thank you for your amazing video and all work you do. I was wondering how to use langchain to perform data analysis on one or more datasets. Let's say I have leads, sell and orders dataset. Can I use langchain to perform some analysis, such as ask which customers placed the last order? How were sales last month? Can you aim me in the right direction? Thanks 🙏

  • @generichuman_
    @generichuman_ Год назад

    I'm training a transformer model from scratch just to get a better intuition on how they work. I'm curious if you know the best way to setup the text dataset so that each text chunk is it's own entity and won't bleed over into other chunks. For example if I have a dataset of stories, when one ends and another one begins, I don't want the next story to still have context from the last story. I'm using the hugging face tokenizer to implement BPE. I hope this makes sense and I would greatly appreciate any guidance!

  • @jacobgoldenart
    @jacobgoldenart Год назад

    Thanks James! About chunking. What about when your documentation has a lot of code example’s interspersed throughout the text. Is the recursive text splitter able to work with say python code where retaining white space is important?

    • @jamesbriggs
      @jamesbriggs  Год назад +1

      it won't distinguish any special difference between normal text and code unfortunately, so it will just split on newlines, whitespace, etc as per usual

  • @maximchuprynsky7472
    @maximchuprynsky7472 Год назад

    Hello. I have a question/problem. I have a rather large prompt and it exceeds the token limit. Is there any possibility to split it as well as the basic information from the pdf file?

  • @younginnovatorscenterofint8986

    thanks for the content James. I am trying to build document conversational asistant,using langchain and huggingface but I have been getting this error .Token indices sequence length is longer than the specified maximum sequence length for this model (2842 > 512). Running this sequence through the model will result in indexing errors
    d

  • @artchess0
    @artchess0 Год назад

    Hi James, thank you very much for your videos. I have a question. What if we need to pass context to our LLM model to translate from one language to another. It is better to chunk in smallest sizes or to the limit of tokens for the request to the model? Im thinking of processing the chunks in parallel and then join the results of the translation together. But i dont know what is the best aproach. Thank you in advance

  • @jimjones26
    @jimjones26 Год назад

    I have a question. I am working on loading documentation for several different technologies into 1 vector database. I want to use this as a AI development assistant for the tech stack that I use to create web applications. I am assuming the way you are categorizing your chunks would be an appropriate way to have these different 'columns' of data within one vector db?

  • @nazimtairov
    @nazimtairov Год назад

    thanks for tutorial, how text after splitting into chunks can be processed further to LLMChain?
    I'm getting an error from openai api:
    chain = LLMChain(llm=llm, prompt=chat_prompt, verbose=True)
    chain_result = chain.run({
    'source_code': python_code,
    'target_tech': 'python',
    'source_tech': 'Go'
    })
    This model's maximum context length is 8193 tokens. However, your messages resulted in 13448 tokens. Please reduce the length of the messages.

  • @alivecoding4995
    @alivecoding4995 Год назад

    How do you work remotely in VSCode with the notebook on Colab?

  • @jesusperdomo8388
    @jesusperdomo8388 Год назад

    please, is it possible for you to work the code in visual studio code?

  • @li_tsz_fung
    @li_tsz_fung Год назад

    Is LLaMA langchain a thing now? It makes sense to me that we should use open source stuff, so that we can run it locally soon.

    • @jamesbriggs
      @jamesbriggs  Год назад

      I believe so, but haven't had the chance to check it out yet - for sure, will be focusing more on open source soon

  • @Sunghoon4life
    @Sunghoon4life Год назад

    Is the RecursiveCharacterTextSplitter split the text based on token or text? as per dos seems like it's based on charector but in the video u said it's based on token. Could you please confirm?

    • @jamesbriggs
      @jamesbriggs  Год назад

      it's splitting on character (the "

      ", "
      ", " ", "" characters), but the length function is based on tokens, so it is kind of doing both, meaning it is identifying a satisfactory length based on tokens, but then the split itself is using characters

    • @rmehdi5871
      @rmehdi5871 10 месяцев назад

      @@jamesbriggs does this splitting words on any text?? in my data, taken, I think, via xml format, has these tags: , and . Should I split on those rather than with "

      ", "
      ", " ", "", or do both, perhaps? What is your recommendation?

  • @yourmom-in4po
    @yourmom-in4po Год назад

    For some reason, when i try to download all the HTML files using wget, it only downloads the index.html file, is there any reason for this? i used the provided google collaboratory notebook and nothing :(

    • @jamesbriggs
      @jamesbriggs  Год назад

      I don't know why that would happen using the same command, may be a system difference I'm not sure - but maybe you can refer to this:
      www.linuxjournal.com/content/downloading-entire-web-site-wget
      and try modifying the command as per the info above?

    • @jamesbriggs
      @jamesbriggs  Год назад +2

      sorry I realize this is because the webpage for the langchain docs moved, it's actually nothing to do with the command, try this:
      !wget -r -A.html -P rtdocs python.langchain.com/en/latest/

    • @yourmom-in4po
      @yourmom-in4po Год назад +1

      @@jamesbriggs Thank you so much!

  • @tadavid1999
    @tadavid1999 Год назад

    Could anyone help me? I'm trying to use the !wget -r -A but it is not recognised as a command. I do not understand where I am going wrong as far as I know I have all modules installed as well as correct permissions. I have tried running this in terminal of Microsoft visual code, powershell as admin (with ChatGPT to put it in a diff format) and as a script importing os. Just is not working for me and I am very interested in the practical applications of this. Great video by the way I like how everything is explained step by step!

    • @jamesbriggs
      @jamesbriggs  Год назад +1

      I think it should be recognized as a command, the issue may be that the webpage is outdated, could you try `!wget -r -A.html -P rtdocs python.langchain.com/en/latest/` - also another thought, if you're running in terminal drop the `!`, leaving you with `wget -r -A.html -P rtdocs python.langchain.com/en/latest/`

    • @tadavid1999
      @tadavid1999 Год назад

      @@jamesbriggs i've just figured this out. It's because im not linux based. This video helped me fix the issue for anyone wanting to follow along: ruclips.net/video/gCrF8Zx13wg/видео.html

    • @tadavid1999
      @tadavid1999 Год назад

      @@jamesbriggs I'm trying to use the wget command to download my own website for context but it keeps downloading the first page only, any tips on how i can get it to go for the rest?

  • @mohammedsaheer4700
    @mohammedsaheer4700 Год назад

    Can we pass more than 10000 tokens into langchain using chunking ?

    • @jamesbriggs
      @jamesbriggs  Год назад

      Yes you can pass in as many as you like, billions even

  • @fraternitas5117
    @fraternitas5117 Год назад

    Could you make content about Nvidia's NeMo?

  • @rafaelprudencioleite7291
    @rafaelprudencioleite7291 Год назад

    Thanks so much for the video. when i use
    !wget -r -A.html -P rtdocs link...
    It download the index.html page. I tried in the terminal and won't work too. There's a way to handle that?

    • @Clubcloudcomputing
      @Clubcloudcomputing Год назад +2

      Looks like the website changed, and does a redirect to a different domain. Hence you get only 1 file. Instead, index the domain that it redirects to.

  • @TheCloudShepherd
    @TheCloudShepherd 9 месяцев назад