Understanding Embeddings in LLMs (ft LlamaIndex + Chroma db)

Samuel Chan

Просмотров 37 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 20 апр 2023
We do a deep dive into one of the most important pieces of LLMs (large language models, like GPT-4, Alpaca, Llama etc): EMBEDDINGS! :) In every langchain or llamaindex tutorial you'll come across this idea, but they can feel quite "rushed" or opaque, so this video presents a deeper look into what embeddings really are, and the role it plays in a world dominated by large language models like GPT.
In Chroma's own words, Embeddings are "the A.I-native way to represent any kind of data, making them the perfect fit for working with all kinds of A.I-powered tools and algorithms. They can represent text, images, and soon audio and video. There are many options for creating embeddings, whether locally using an installed library, or by calling an API".
LlamaIndex (GPT Index) is a project that provides a central interface to connect your LLM’s with external data.
LangChain is a fantastic tool for developers looking to build AI systems using the variety of LLMs (large language models, like GPT-4, Alpaca, Llama etc), as it helps unify and standardize the developer experience in text embeddings, vector stores / databases (like Chroma), and chaining it for downstream applications through agents.
Mentioned in the video:
- Chroma Embeddings database (vector store):
docs.trychroma.com/embeddings
- Watch PART 1 of the LangChain / LLM series: • LangChain + OpenAI tut...
- Watch PART 2 of the LangChain / LLM series:
• LangChain + OpenAI to ...
- Watch PART 3 of the LangChain / LLM series
LangChain + HuggingFace's Inference API (no OpenAI credits required!)
• LangChain + HuggingFac...
- HuggingFace's Sentence-Transformer model
huggingface.co/sentence-trans...
All the code for the LLM (large language models) series featuring GPT-3, ChatGPT, LangChain, LlamaIndex and more are on my github repository so go and ⭐ star or 🍴 fork it. Happy Coding!
github.com/onlyphantom/llm-py...
Наука

Комментарии • 93

@ShashankSharma-kg9hu Год назад ⁺⁵
I am in the office laptop and here we are dicussing about harry potter first kiss
@SamuelChan Год назад ⁺³
Oh I hope it wasn’t any spoilers :( don’t want any unhappy coworkers
@ShashankSharma-kg9hu Год назад ⁺¹
@@SamuelChan BTW Bro your content is lovely you can surely be a best teacher.More you will teach more you will grow :)
@SampadMohanty7 Год назад
You are in the office laptop? How are you inside a laptop?
@ChitralPatil 11 месяцев назад
Hahah exactly
@SMCGPRA Год назад ⁺⁵
Seeing videos after video all things flying off the brain 🧠🧠🧠🧠🧠
@SamuelChan Год назад ⁺³
Haha! slow it down! get a coffee nap and then continue learning!
@pabloandres7973 11 месяцев назад
Legitimately finally a very well paced and yet easy to follow guide! Subscribed, please do lots more and thank you!
@SamuelChan 11 месяцев назад
Hey thank you! Very encouraging, I appreciate it!
@AI-LLM 11 месяцев назад ⁺¹
Great video. And no they are NEVER too short. Your detail and extra insights is good 🎉
@SamuelChan 11 месяцев назад
Hey it means a lot, thank you! ;)
@leonardogrig Год назад ⁺³
This was so well explained, I was learning so much that all the time I kept hovering the video to see if it's not ending yet. It's been ages since I've watched a entire 30min video on 1x speed. Congrats!
@SamuelChan Год назад ⁺²
This is so kind! Thanks for taking the time out to write this - really warms my heart 😊
@sanjayplays5010 11 месяцев назад
Nice work on this tutorial, it really helped clear up all the questions I had.
@SamuelChan 11 месяцев назад
Love to hear! Thank you!
@kingofaces97 11 месяцев назад
Very few are as knowledgeable and thorough as you on this topic. Thank you
@SamuelChan 11 месяцев назад
Wow that’s probably too generous! Thanks for being so kind, appreciate it
@SamAnuRhea 10 месяцев назад
Fantastic explanation, keep up the great work Samuel.
@SamuelChan 10 месяцев назад
Thank you! I really appreciate it!
@piyakornmunegan6003 Год назад ⁺¹
You explained that really well! It was easy to understand
@SamuelChan Год назад ⁺¹
Thank you! 😊
@vladimir_egay Год назад ⁺¹
Nice explanation! Definitely going to help me get started with LLMs! Thanks~
@SamuelChan Год назад ⁺¹
Glad you find it helpful! 😊
@Spock-AI Год назад
Wonderfully taught!! You have a gift for teaching this domain knowledge. Thank You. Liked and previously subscribed.
@SamuelChan Год назад
Wow thank you! I really appreciate it!
@rQ816 4 месяца назад
Great work, thank you!
@jdray 8 месяцев назад
Great video. It covered several items in detail that I've been wondering about, so thank you. I struggled some to keep up with your fast pace, but if you'd gone slower the video would have been twice as long and I probably wouldn't have started watching! 😀
I work for a government organization, and I'm trying to figure out how to index our large corpus of documents. I need to come up with a schema for a vector store, and will probably follow the State's cataloging format for the administrative codes, which follows the "Title-Chapter-Section" format, with each Section being a single document, which can then be broken up into Subsection, Paragraph, and Sentence, with appropriate metadata at each layer. Your video helped me understand how to do that, or at least get started, so thank you.
@noualiibrahimyassine1336 10 месяцев назад
Nice work , thank you so much.
@SamuelChan 10 месяцев назад
Thank you, means a lot!
@WelcomemynameisLiam Год назад
Watching this now!
@gustavofelicidade_ Год назад
Thank you so much, it is very helpful!
@SamuelChan Год назад
Great to hear! Thank you! :)
@MrSuntask 7 месяцев назад ⁺¹
Great! Thank you!
@SamuelChan 7 месяцев назад
My pleasure!
@ekorudiawan 4 месяца назад
Mantap bg, auto subscribe
@kevinehsani3358 Год назад ⁺¹
Thank you for the great video. I was wondering if you have done any videos or have any comments with regard to chroma db and other vector dbs such as pinecone? On a side question what editor do you use for coding?
@SamuelChan Год назад ⁺²
Hey Kevin for my personal off screen work I use NeoVim and VSCode interchangably. For most of my video recordings I use VSCode as it’s more “mainstream”, making the tutorials easier to follow. Some of my older videos (on Docker, on Bash coding etc) you’ll see me use Vim over VSCode :)
On Video 8 of the LLM playlist (released a few days ago) I used Pinecone to build an AI Tutor that teaches me German. That gives me the opportunity to briefly compare Pinecone to ChromaDB. I would love to do a deep dive into all the other choices some point of this series but now I’m focusing on the “let’s build LLM application” tutorial videos to scratch some of my own itch :)
LangChain & LLM tutorials (ft. gpt3, chatgpt, llamaindex, chroma)
ruclips.net/p/PLXsFtK46HZxUQERRbOmuGoqbMD-KWLkOS
@craigcullum4360 Год назад ⁺¹
Amazing videos!! Could you do a video on how to read from ChromaDB using persisted storage and llamachain. I’ve followed your tutorial but can’t figure out how to read from an already built Chroma db collection with persisted storage on.
@SamuelChan Год назад ⁺²
Hey thank you Craig! That’s a great suggestion - I’ll add it to the backlog; Currently working on a few other langchain videos that will be added to this playlist so this may be a good addition. Thank you!
@wilfredomartel7781 Год назад
Great work! But in many videos doesnt teach how to train from scratch sentence transformer to generste sentence embeddings. Maybe a video training an existing sentence embedding would be really great!
@SamuelChan Год назад ⁺¹
Will see if it can fit thematically into this LLM series (9 videos and counting); afraid to go too theoretical and low level as people lose interest, so the series is more about building useful stuff with immediately applications / “end results”
@ritikajoshi2372 Год назад
Hi Samuel really great video you are clear and concise, I learnt so much! I just have one area I am struggling to wrap my head around - what is the difference between embeddings in indexing data(I'm using GPTVectorStoreIndex) vs embeddings with a vector store/db like Chroma? Are they the same thing but Chroma makes it more optimised?
You mentioned that Chroma helps us reuse the embeddings instead of using the GPT embeddings API. Is this referring to embeddings of the user's input or the custom data we've provided as I noticed that my input is charged to text-embeddings-ada separately from the prompt to gpt-3.5-turbo by OpenAI.
@SamuelChan Год назад ⁺¹
Hey Ritika, thank you! Yes, your understanding is right. Same thing but having it as a db gives it more persistence, you can restart a new session, and still use the embeddings created if it's in a persistent vector store / database!
GPT's embeddings cost OpenAI credits: platform.openai.com/docs/guides/embeddings, Chroma supports that too but it defaults to using Sentence Transformer from huggingface, thus saving you some money. The vectors produced by both will differ (naturally) but the purpose of the index is still intact since fundamentally what you want is to be able to tell the semantic similarity between sentences (their positions) in an n-dimensional vector space.
@bongimusprime7981 Год назад ⁺¹
Thank you for the succinct explanations!
One question - I am currently using a GPTVectorStoreIndex for my simple use case of indexing a few simple docs. I persist the index to local storage, then in a separate scipt, I load it, and set it as my query engine.
Does this mean that my embeddings go to OpenAI every time I query the query engine? If so, does this send ALL of the contents of all of my docs in embedding format? Or is there some "smarts" that only sends the relevant bits, like in your examples? Just trying to understand if that behaviour is exclusive to Chroma db.
Thanks again!
@SamuelChan Год назад
If its a GPTVectorStoreIndex / GPTSimpleVectorIndex, it does seem like you are sending all of the index over, no "smarts"
You can use logging (import logging; then logging.getLogger().addHandler()) as well and monitor the console output to further confirm. Vector databases implement these "smarts" as you call it by way of (1) tree structure eg btree, (2) keyword table, (3) other database-optimization techniques
@datasciencetoday7127 9 месяцев назад
i love you bro
@datasciencetoday7127 9 месяцев назад
let me know if you are interested in colab
@SamuelChan 9 месяцев назад
Thanks brother ❤️
@cstan2381 Год назад
Thanks, Samuel, for your video. Extremely in-depth. I understand a bit more about the indexing and tokenizing of the embedding.
But I am a newbie and starting with langchain embedding to Pinecone and FAISS. I have a few questions, and I hope to get your guidance.
- Should we split the text up before inserting these chunks into the index? When is this necessary?
- When we use langchain to do the embedding, we also pass it the embedding object, which calls the LLM e.g. OpenAIEmbeddings(). Is this the same as llama_index's GPTvectorIndex or GPT ChromaIndex? Is there a cost associated?
- Finally, is r.index(query) the same as langchain's vectorstore.similarity_search(query, k=3)
@SamuelChan Год назад
Hey CS, thanks for the kind words!
- no; you typically don't split text manually. Instead, a library like LlamaIndex takes care of that chunking work for you. If you use Pinecone, you may choose to set up logical chunks in your index like you would in a python dict (key-value). For this watch the Pinecone video on my LLM series :)
- There is a cost to creating embeddings. Price differ greatly. Da Vinci is $0.2000 / 1K tokens, and Ada v2 is $0.0004 / 1K tokens for example. Chroma defaults to SentenceTransformers, an open source model (explained in the video) which is free, but provides wrappers over OpenAI's embeddings as well.
If you watch the full LLM playlist some of these concepts get clearer through sheer repetition and seeing how to apply them in different projects help with building up a better mental model too: ruclips.net/p/PLXsFtK46HZxUQERRbOmuGoqbMD-KWLkOS
Full source code on my github: github.com/onlyphantom/llm-python if you're stuck somewhere and want to have a code reference. Star it and fork it! :)
@kevin238in 8 месяцев назад
Thanks a great video on Embeddings. Many times you mentioned that you do not want to spend a lot of money. I am not sure, why you said that as creating of embeddings, storing them in db and querying is done using open source. Right ? So where is the question of spending money here. Are we sending some data to ChatGPT model inernally ?
@bongimusprime7981 Год назад ⁺¹
What kind of VS Code extension are you using to dynamically display the list of available methods and its usage? That was pretty cool!
@SamuelChan Год назад
Its got to be just the official Python extension. I dont have much else on my VSCode its quite bare bone 😬
@aniketkalamkar227 Год назад ⁺³
In this example, When we are doing GPTChromaIndex.from_documents() method. Is it going to call OpenAI embedding APIs to create embeddings? Or only during query part it is going to call open AI. I am little bit confused with how it works from cost perspective.
@SamuelChan Год назад ⁺²
Short answer is no, it default to SentenceTransformer, and doesn't default to OpenAI's embedding API during index construction unless you specify it to. docs.trychroma.com/embeddings
The query part does call OpenAI's API.
Also note that this video was recorded with LlamaIndex 0.5.7 and if you're on the latest LlamaIndex 0.6.1, the syntax does change a bit (from llama_index import GPTChromaIndex no longer works). If you git clone from the repo and pip install -r requirements.txt you'll be good
@larawehbee Год назад
@@SamuelChan Hello Samuel, how can i use another model other than openai to use the query? Im trying to build on premises model so i need to avoid any api key. thank you in advance
@DUXia Год назад
Thank you Samuel for the great explanation. It seems that LlamaIndex changes a lot for the current version (0.6.9). There is no GPTChromaIndex, it seems replace by ChromaVectorStore. it is not very clear for me to create chroma index (not using GPT api), because in the document of llamaindex, they do this to read from document: GPTVectorStoreIndex.from_documents ...
is there any chance to have a new video about the new version of llamaIndex?
@SamuelChan Год назад ⁺¹
Hey D, check out the latest video published in the LLM series - they use the newest API (>0.6.x) of LlamaIndex.
Seeing that all of these tools are new and have yet to reach v1.0 I expect lots of breaking changes till then, at the time of recording I always try to use the latest version possible but 4 weeks later it might be outdated again 😁
LangChain & LLM tutorials (ft. gpt3, chatgpt, llamaindex, chroma)
ruclips.net/p/PLXsFtK46HZxUQERRbOmuGoqbMD-KWLkOS
@DUXia Год назад ⁺¹
@@SamuelChan Thanks a lot !!
@larst4244 Год назад ⁺¹
Thanks for the content Bro. UVloop is not supporting Windows. Do you have any workaround?
@SamuelChan Год назад ⁺¹
Uvloop is a “drop-in replacement for asyncio event loop” but implemented in Cython for the speed. I don’t think anything necessarily breaks if you omit it but I don’t have a windows machine. Would WSL work? ;)
@Dani-rt9gk Год назад ⁺¹
Thanks for the video Samuel! You might want to edit the title if this is a typo: "ft LlamadIndex" -> "LlamaIndex"
@SamuelChan Год назад ⁺¹
Don’t know how I’ve never caught that 🤓 thank you!
@ST0orz Год назад ⁺¹
Am I right in saying that vector databases just allow you to query them using natural language (by using similarity scores based on the dot product of the embeddings), but have nothing to do with LLMs? or is the LLM somehow being used to generate the embeddings? (perhaps via an encoder?).
And I also don't understand how pinecone claims that it gives "Long-term Memory for AI", isn't it more like "AI powered search"? Given an LLM today like LLaMA I'm not sure how I could empower it and make it do things that it otherwise could not using vector databases.
@SamuelChan Год назад ⁺²
Hey Leonard, vector databases are useful and yes it was useful well before the surge of LLMs. You have these n-dimensional vectors that represent your sentences correct? so if you have on hand a financial record of 100 pages, you want to:
financial record pdf -> parsed as text -> text used to generate embeddings -> these embeddings are just n-dimensional vectors -> you also want metadata associated with these vectors, and then store them in a tree / node structure. You use this tree structure to do quick lookups, so when a query that goes "what is the R&D spending of all Temasek-backed companies in the whole of last year" is executed, this doesn't scan the whole 100 pages of pdf. Instead, it uses similarity (i.e. cosine distance) to search for nodes, which drastically reduce the search space. These are then combined to create the context window that is passed to the LLM.
In my video I use the Harry Potter's first kiss example, which I still is a better example, but in a nutshell, that's what vector databases does.
@larawehbee Год назад
@@SamuelChan Hello Samuel, "These are then combined to create the context window that is passed to the LLM. ", please can you explain more about this point?
@kajasheriff Год назад ⁺¹
Hi Sam,
I have a doubt,
Which embedding is the best?
(OpenAi Embeddings or Hugging face embeddings)
@SamuelChan Год назад ⁺¹
Hey Kaja, HuggingFace hosts open source models that you can use, like OPT, T5, GPT2 etc. So a better comparison is between OpenAI’s embeddings and one of the open source LLMs.
Chroma by default uses the Sentence Transformer model, which is also hosted on HuggingFace. That saves you some OpenAI API credits from having to send in your documents to OpenAI for the embedding + indexing. It does reasonably well in my experience with it - which is why many LLM Videos in this playlist feature the use of Chroma as an embedding store :)
@unknownpig5957 11 месяцев назад ⁺¹
Hey Samuel, can i ask, is llama_index only usable for GPT with openai? It looks like llama_index is LLM independent, but all the examples i seen in internet seems to only use llama_index with openai with modules like GPTxxxx, and it also seems like i need openai key to access the modules in llama_index. If that is the case, is there any alternative with other LLM?
@SamuelChan 11 месяцев назад
Hey no llama_index is agnostic with whichever LLM model you choose. In fact if you’re trying to go from raw data to vector store you could even go without any LLM, using the sentence transformer hosted on huggingface for the sentence embedding.
I have many videos ft open source LLM without OpenAI, that still uses LlamaIndex and LangChain. Here’s the playlist: ruclips.net/p/PLXsFtK46HZxUQERRbOmuGoqbMD-KWLkOS
@unknownpig5957 11 месяцев назад
@@SamuelChan , that is what i thought. I can't understand why this line of code: index = VectorStoreIndex.from_documents(documents), is looking for openai key ? i have done import: from llama_index import SimpleDirectoryReader, VectorStoreIndex; something i misunderstood here?
@SamuelChan 11 месяцев назад
check out the code examples in my repo! this doesn't require the openai key and uses LlamaIndex as well:
github.com/onlyphantom/llm-python/blob/main/07_custom.py
Here's the video where I walk you through the process:
ruclips.net/video/qAvHs6UNb2k/видео.html
@kevinehsani3358 Год назад ⁺¹
There is one comment you made that I am not clear about where you llama_index and langchain are almost the same. GPT comment on difference between these two are "LlamaIndex is a smart storage mechanism that provides a simple interface between LLMs and external data sources like APIs, PDFs, SQL etc. It provides indices over structured and unstructured data, helping to abstract away the differences across data sources1. LangChain is a tool that brings multiple tools together. It allows you to leverage multiple instances of ChatGPT, provide them with memory, even multiple instances of LlamaIndex2."
@SamuelChan Год назад ⁺⁴
There are plenty of overlaps. You can use LlamaIndex to handle reading in a data, but you can also swap out LlamaIndex for LangChain to do the same. You can use LlamaIndex to handle storage, but you can also swap it out for LangChain and still do the same. So yeah there are overlaps between what both libraries provide.
The key difference is their emphasis; LlamaIndex was formerly known as GPT-Index and is focused on providing as many interfaces as possible in a unified API to read data (pdf, csv, excel, books, sql, other software etc) and turn them into vectors and embeddings. It’s focus is on doing this well.
LangChain’s focus is on the assembly of a “pipeline” (I think the word you’ll hear in the future is LLMOps, LLM + dev ops) where you chain multiple systems together. For example in one of my video in this series I rely on LangChain to execute python code for me, or execute sql code for me, that is what it means by “chaining” different ops together.
Then, to turn them into embeddings, both LangChain and LlamaIndex has wrapper over OpenAI’s embeddings API, so you can pick either. Hope this is clear? :)
@Bubbalubagus 5 месяцев назад
When I try loading in GPTSimpleVectorIndex , it goesn't get found and it isn't styled which indicates that it is not importaed
@M-ABDULLAH-AZIZ Год назад ⁺²
how can i store json data? like dataset from the datasets package, like for pinecone i saw in colab example they had stored the youtube-transcriptions dataset
@SamuelChan Год назад ⁺¹
Hey Abdullah, I can expand on that example a bit more, but I’m not sure I understand what dataset “from the datasets package” mean.
So we have some data, whether it’s csv or txt or pdf or sqlite, we build the embeddings for it and then subsequently index them. These are already in json in the examples using chroma if you use the save to disk / save to json helper method.
@M-ABDULLAH-AZIZ Год назад ⁺¹
@@SamuelChan and is it necessary to have documents i.e. can I convert using strings or do I necessarily need txt files?
@SamuelChan Год назад ⁺¹
You can hard code your strings in the python file but that’s only useful in a tutorial / demo context right? If you’re building something real you would parse the text off from the csv, some pdf files, some text documents etc.
You could also build a front end and allow user to enter their text strings directly and your python functions delegate it to OpenAI. An example is this video I did a few months ago:
GPT3 Tutorial by projects: Teaching GPT-3 to write emails, ad slogans! (Web app)
ruclips.net/video/aNI8pMjzgqg/видео.html
@sumangautam4016 9 месяцев назад ⁺¹
A lot of your code is not working as it seems like a significant change made to the llama index code repo. But, excellent explanations! Kudos!
@SamuelChan 9 месяцев назад
Thank you! The GitHub repo for this series are semi-regularly updated though so I’d use the explanations here in this video and then use the code repo as reference too if that helps :)
@fragileandweak Год назад ⁺²
why my llama_index does not have GPTChromaIndex in it? Im using colab
@SamuelChan Год назад ⁺¹
This video was recorded with LlamaIndex 0.5.7 and a week later, LlamaIndex 0.6.1 was released and the syntax does change a bit (from llama_index import GPTChromaIndex no longer works). If you git clone from the repo and pip install -r requirements.txt you'll be good. Easy solution is to just do pip install llama-index==0.5.7 and you're all set!
Maybe at LlamaIndex v1.0.0 stable I'll do a refreshed course again using its latest api!
@fragileandweak Год назад ⁺¹
@@SamuelChan Thank you
@jianitian9692 Год назад
thanks! where can I find your github page?
@SamuelChan Год назад
Hey it’s in the description too, but here is the direct link to the repo:
GitHub.com/onlyphantom/llm-python
@chukypedro818 Год назад ⁺¹
You are a bit too fast bro. Nice explanation though
@SamuelChan Год назад ⁺¹
Thank you bro! Not a native English speaker so sometimes I get the pacing all wrong :(
@climatebabes Год назад
embeddings is pronounced like 'bad', em - bad- dings
@SamuelChan Год назад
Thank you! I really appreciate it
@mattmcmahon8311 7 месяцев назад ⁺¹
Between the accent and how quickly he speaks I have trouble following. Might just be me.

Следующие

Автовоспроизведение

GPT scrapes + answers from any sites (ft. Chromadb, Trafilatura)