Getting Started with ChromaDB - Lowest Learning Curve Vector Database For Semantic Search

Johnny Code

Просмотров 53 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 27 ноя 2024

Комментарии • 81

@johnnycode Год назад ⁺³
Check out my playlist of ChromaDB videos: ruclips.net/p/PL58zEckBH8fA-R1ifTjTIjrdc3QKSk6hI
@efexzium 8 месяцев назад
Hi Johnny, my names johnny nice to meet u lol.
@Bryan-mw1wj 5 месяцев назад ⁺¹
Perfect, all i was interested in was persisting the database. Thanks for adding that at the end
@deeplearning7097 Год назад ⁺⁹
Short and Sweet! Thank you very much.
@youngzproduction7498 7 месяцев назад ⁺⁴
Concise and precise are what you did here.
@5uryaprakashp1 5 месяцев назад ⁺⁶
Now I can put chroma db in my resume. Thanks you creating such a crisp and straightforward tutorial.
@einstein-munachi 2 месяца назад
haha same here!
@mpicuser 3 месяца назад
Just want to say thank you for creating this video.
@photorealm 13 дней назад
A+ explanation. Thank you for sharing and explaining your work, that was a HUGE help for me.
@sanjeevKumar-eg6hp 11 месяцев назад ⁺³
Such an amazing video man, Thanks for this valuable Knowledge
@kenchang3456 9 месяцев назад ⁺¹
You have a new subscriber. This video was very timely for my proof of concept.
@sherozeajmal Год назад ⁺²
Man, that was amazing. Thank you so much.
@devopsmentor9511 Год назад ⁺³
Awesome demo
@davidtindell950 3 месяца назад ⁺¹
Now, reviewed this vid yet again so that I can move ahead with CUDA and Multiprocessing !
@johnnycode 3 месяца назад ⁺¹
Did you watch my video on ChromaDB with CUDA and Multiprocessing?
ruclips.net/video/7FvdwwvqrD4/видео.html
@davidtindell950 3 месяца назад ⁺¹
@@johnnycode YES! I successfully worked thru "Chroma DB with CUDA" but that was some time ago so now I am carefully reviewing it AGAIN ! Thank You for your reply. BTW: I am now implementing a live 'Menu Kiosk' machine at "my" local 'China' restaurant. I need to include a 'Recommender' module if the customer does not find what they want via their 'semantic searches' :)
@johnnycode 3 месяца назад
Sounds like a great project, good luck with that.
@newcooldiscoveries5711 11 месяцев назад ⁺¹
Excellent tutorial. Thank You!
@thatoshebe5505 9 месяцев назад ⁺¹
great demo and concise
@kenchang3456 9 месяцев назад
The POC use case I want to try is to query vehicle parts based on a user description which can be somewhat vague and keyword search is not ideal most of the time. I'd also want to do search-as-you-type with a vector db. My understanding is that there are embedding models trained on vehicle parts and seeing how easy it so specify a new embedding model although you have to remember the collection, I hope I can prove my POC. In addition, I want to use the persisted collection for other ideas as well. Thanks for making this video it really helps.
@johnnycode 9 месяцев назад ⁺²
Thanks for sharing, Ken. Your app would have came in handy for me when I was searching for what turned out to be a "cap" or "flute" oil filter wrench :D
@enilec. 4 месяца назад
This is a perfect tutorial, thanks!
@VELTIONoptimum 11 месяцев назад ⁺¹
Excellent tutorial.
@davidtindell950 8 месяцев назад ⁺¹
Good Intro / Review. Thank You from a NEW Subscriber !!!
@davidtindell950 3 месяца назад ⁺¹
Now, reviewed this vid yet again so that I can move ahead with CUDA and Multiprocessing !
@webchicka 9 месяцев назад
Thank you so much for an easy-to-follow, practical example!
I’m curious about something… at one point in the video you note illustrate that the default embeddings are returning something unrelated (sesame ball) as the #1 choice. Your solution is to swap it out for another embedding provider.
But how would you go about digging in here and debugging further?
@johnnycode 9 месяцев назад ⁺¹
Unfortunately, when it comes to the search results, we are at the mercy of how 'smart' the embedding model is. The term 'sesame ball' is a translation and unique to this restaurant's menu, so I wouldn't expect models to know the meaning of the term. Somehow, its vector representation is close to the word shrimp for the first model, but we don't have a way to 'debug' it. Here are some things that we have control of:
1. Changing the amount of text (short phrase, sentences, paragraphs, vs pages) per embedding. The more that is packed into one embedding, the harder it is for the model to be accurate.
2. Switch distance function to another one like Cosine Similarity: docs.trychroma.com/usage-guide#changing-the-distance-function
3. Switch to a more powerful, possibly paid, model, see listed of model and the section on Custom Embedding Functions: docs.trychroma.com/embeddings
4. Fine tune a model to understand the terms used by your organization.
@webchicka 9 месяцев назад ⁺¹
@@johnnycode Awesome, thanks for the incredibly thorough and helpful answer!
@LauraMarcelaGrandasSuarez Год назад
Thanks for the excellent explanation
@DinosaurSuccess 10 месяцев назад ⁺¹
what do you do when all-MiniLM-L6-v2 is not very good at judging whats similar? it gets it wrong a lot!
@johnnycode 10 месяцев назад
Here are a few suggestions, hope this helps:
1. Are you embedding a short phrase, a few sentences, paragraphs, or pages? The more that is packed into 1 embedding, the harder it is for the model to be accurate.
2. Try switching the distance function to another one like Cosine Similarity: docs.trychroma.com/usage-guide#changing-the-distance-function
3. Switch to a more powerful, possibly paid, model, see listed of model and the section on Custom Embedding Functions: docs.trychroma.com/embeddings
@spinze 6 месяцев назад
How do the models know that shrimp and prawn is the same thing? Like how did the first model not get all 5 dishes, but the second model found all 5 dishes?
@johnnycode 6 месяцев назад ⁺¹
The models are trained to understand language like ChatGPT’s models. The second model in the video is a larger “smarter” model, so it performs better than the 1st. The disadvantages of a larger model is that it takes up more storage space, uses more computing power and memory, and is slower than a smaller model.
@terryliu3635 Месяц назад
Thanks for the video. It's really helpful. I got a question on ChromaDB used in the Google Colab environment. Quite often, i'm facing the issue "OperationalError: attempt to write a readonly database "...I've tried several approaches I got from the Internet without success. Do you have any suggestions for me? Thanks.
@johnnycode Месяц назад ⁺¹
I've used Colab with the ChromaDB files in Google Drive, is this what you're doing: ruclips.net/video/ziiLezCnXYU/видео.html
@terryliu3635 Месяц назад
@@johnnycode not exactly the same. i was using LangChain's wrapper for a simple RAG scenario... "Chroma.from_documents" was the function I used to ingested encodings out of a PDF doc.
@johnnycode Месяц назад
The only thing i can think of is if you are loading data using multiple colab sessions or using Python multiprocess, since that would lock up the backend SQLite database.
@Connor51440 5 месяцев назад
Very helpful! thank you
@freepythoncode 10 месяцев назад
Amazing video thank you so much 🙂❤
@bk3460 6 месяцев назад
@johnnycode, Is there any idea how to manage this error that occurs when I try to load previously saved chromadb file, e.g. "vectordb": InvalidDimensionException: Embedding dimension 384 does not match collection dimensionality 768?
@johnnycode 6 месяцев назад
You must use the same Embedding Function that you used to create that database. Embedding Functions convert text to a matrix of numbers and different Embedding Functions use different dimensions, so you can't use them interchangeably.
@SayeedAhmedKhan-d1x 3 месяца назад
those who are getting an error like: 'Collection' object has no attribute 'model_fields'
Try downgrading your chromadb version for 0.5.3. My current version was 0.5.5
my issue was resolved once I downgraded my chromadb version
@rsg477 7 месяцев назад
Hi there I have a csv file with 150 rows. I have created collection added the document to the collection. when i query the collection the document field is giving me None. ids field gives the correct id but document field is none. What should I do. Is there any size for the field returned by document
@johnnycode 7 месяцев назад
If you use my code and the CSV provided in my GitHub repo (github.com/johnnycode8/chromadb_quickstart), does the document field show up? If the document field does show up, then you should check your loading logic. If the document field does not show up, check the part of the video that talks about using "include":
collection.query(...
include=[ "documents" ]
)
I don't see a publish document field size limit. Are you loading extremely long docs?
@daenindanielrae5013 10 месяцев назад
Thank you my guy 🙂
@reubengeorge7470 7 месяцев назад
I have a CSV file with 150 rows. I have created a collection and added my document to it. when i query it the document field always contains none but the id field give me the correct id but document field is always none. Is there any size constraint for the document field? How should I solve it?
@johnnycode 7 месяцев назад
If you use my code and the CSV provided in my GitHub repo (github.com/johnnycode8/chromadb_quickstart), does the document field show up? If the document field does show up, then you should check your loading logic. If the document field does not show up, check the part of the video that talks about using "include":
collection.query(...
include=[ "documents" ]
)
I don't see a publish document field size limit. Are you loading extremely long docs?
@reubengeorge7470 7 месяцев назад
@@johnnycodeGot it! the problem was because of passing multiple columns from a row in csv file.
@reubengeorge7470 7 месяцев назад
Have you come across this warning: Add of existing embedding ID: 1
Add of existing embedding ID: 2
...till all ids
I am just querying the database only but I am getting this warning also.
@johnnycode 7 месяцев назад
I think you should create a fresh database and collection and try again. If you had tried to insert records into the same IDs, it causes weird issues.
@reubengeorge7470 7 месяцев назад
@@johnnycode Okay. I deleted the collection and did it again.... It worked Thanks!!!
@RevMan001 10 месяцев назад
This helped a lot! 👍
@ddricci12 6 месяцев назад ⁺¹
Thanks!
@johnnycode 6 месяцев назад
Thank you for the support!!!!😀😀😀
@VOGTLANDOUTDOORS 3 месяца назад
Nice video.
@VeyselAytekin-k2b 9 месяцев назад
thank you
@sivuyilesifuba 9 месяцев назад
@SoundTamilan 6 месяцев назад
Where user defind db space we can create
@johnnycode 6 месяцев назад
Sorry, I don’t understand your question.
@SoundTamilan 6 месяцев назад
@@johnnycode where it will take the memory if it is default database, user can manually create the db and assign that path?
@johnnycode 6 месяцев назад ⁺¹
You can run ChromaDB in-memory if you are prototyping and don't need to retain the data. However, if you want to retain the data, use persistence mode: client = chromadb.PersistentClient(path="/path/to/save/to")
You can see that I use persistence mode in my other videos:
ruclips.net/p/PL58zEckBH8fA-R1ifTjTIjrdc3QKSk6hI
I hope this answers your question.
@djs4553 8 месяцев назад
UserWarning: Unsupported Windows version (11). ONNX Runtime supports Windows 10 and above, only. беда с вами...
@kevinehsani3358 Год назад ⁺²
do you have a list of your code or colab link?
@johnnycode Год назад
Here you go: github.com/johnnycode8/chromadb_quickstart
@kevinehsani3358 Год назад
Thanks@@johnnycode
@kevinehsani3358 Год назад
Thanks. I do have one or two questions if you don't mind, first client = chromadb.PersistentClient(path='content/drive') does not create the db on colab , the folder exist. It just defaults to in memory and store it there, not sure if that is because of colab? Also when I retrieve using ' document = collection.get(ids=[document_id],
include=['documents'])' still brings the entire record instead of just documents, {'ids': ['kk'], 'embeddings': None, 'metadatas': None, 'documents': ['**Section 1: Numbers 1-5 in ......' am i doing this wrong? Thanks a bunch
@johnnycode Год назад
For question 1: path='content/drive' points to your Google Drive folder. Change it to something like path='content/myvectordb' or path='content/drive/My Drive/Colab Notebooks/myvectordb'.
For question 2: The 'get' function will always return the entire record structure, but you can see that the data is not returned for your example: embeddings:None,metadatas:None.
@kevinehsani3358 Год назад
I tried all sorts of combinations for persist directory like './', 'content', './content' nothing works.
@yyndsai 2 месяца назад
I am trying to run this in the vs code, but it's showing kernal dead, why??
@johnnycode 2 месяца назад ⁺¹
At which step?
@yyndsai 2 месяца назад
@@johnnycode Can you tell me steps before starting this?? like should we connect to docker or anything else, because all I did was :
'pip install chromadb'
and run that basic code,
but I am getting terminal dead immediately it reaches the line collections.add(....) line
@johnnycode 2 месяца назад ⁺¹
Make sure you have Python 3.10 or 3.11, since older or newer versions might not work. Install Anaconda or MiniConda. Then you can try my other video which starts from creating a Conda environment: ruclips.net/video/u_N1t0CBuqA/видео.htmlsi=g3V3uLfTK6zzPkNd
@yyndsai 2 месяца назад
@@johnnycode I am actually using python 3.11 version, also will it not work without conda environment
@johnnycode 2 месяца назад
It will work without conda. This is more of a vscode problem. When you create a new file in vscode and name it with a .ipynb extension, vscode will pop up a message asking if you want to install the Jupyter notebooks kernel. Maybe you need to reinstall vscode.
@popalex Год назад
Very interesting !
@dan_taninecz_geopol Год назад
Rad. Thanks.
@googlegoodies 2 месяца назад
The interface is like jupyter notebook how do i get it
@johnnycode 2 месяца назад
Install VSCode and create a file with a .ipynb extension.
@efexzium 8 месяцев назад
U should really sell ur code everyone's benefiting for free.

Следующие

Автовоспроизведение

How to Generate Notes/Summaries From YouTube Videos with Gemini Flash & ChromaDB (RAG)