Really interesting. This led me to a foray into better understanding vectors (in computing), vector space, vector database and ultimately vector search. What fun!
GREAT VIDEO (per usual) FROM YOU GUYS ! @ Liam... In the world of ambiguity in which we live (and drown), I would suggest there is high merit in describing "Traditional Search" using some term less ambiguous/more self-describing than "traditional"? Though there is room for debate, I would suggest that instead of calling it "Traditional Search" we might at least refer to it as "Keyword Density Search"? The phenomenon that you're attempting to describe - and what Scott is (pretending to) trying to understand (he understands EVERYTHING - he's just doing this for his audience ;-) because he's cool like that) can best be described even verbally if we actually try to IMAGINE vectors representing something like individual words in a sentence, or even the subject/verb/predicate of a typical sentence structure. Now imagine a LONG sentence - this would "vectorize" into a LOT of dimensions, so that lots of longer sentences would be "pointing" in many different "directions", have many different final lengths, and therefore it would quite easy to visibly SEE which small subset of sentences are roughly a SIMILAR length and point in a SIMILAR direction, and therefore ARE "similar"... Now start imagining that your sentences are getting shorter & shorter & SHORTER, until now they're 3-word, 2-word or even 1-word "sentences" that have pretty much lost all of that "dimensionality" which made them easy to differentiate. Instead as you look the "origin" of your vector coordinate system you now have this massive, densely-packed set of itty-bitty-LITTLE vectors all packed like those fluffy seeds on the head of a dandelion, where MANY (thousands) of them could look nearly "identical" and the entire ability to determine only a SMALL SUBSET of 'similar' becomes IMPOSSIBLE, because you have a MYRIAD of nearly-"identical" vectors... UPSHOT: When the DIMENSIONALITY - the "universe" in which your vectors lives starts to SHRINK, TOO MANY vectors start to look (and BE) "similar" - almost like an "infoglut" of vectors. No matter which "topic" you pick, there are TOO MANY SIMILAR VECTORS :-| ... The FLIP side can ALSO be said - if your sentences are TOO LONG, then your VECTOR -SPACE "UNIVERSE" becomes TOO LARGE, and you have the opposite-but-problem-but-STILL-a-PROBLEM: no matter which "topic" you pick, there are now TOO FEW (often NONE) SIMILAR VECTORS :-| ... SO IT ALL COMES DOWN TO THIS: "WHAT DOES 'The Searcher' KNOW ABOUT HIS/HER SEARCH?" IF he/she can only describe his/her search in a LOTS-of-'connected'-words-that-we-term-a-(not-too-long)-SENTENCE, then VECTOR SEARCH can identify a "USEABLE SUBSET" of search results; that is, not-too-few-needles and also not-too-many-needles from the "haystack" of vectors ;-) IF however he/she can only describe his/her search in a FEW DISCONNECTED words, then KEYWORD DENSITY (aka "Traditional") Search has the potential to produce that "USEABLE SUBSET" of search results. And I think (hope!) I've help your viewers understand BOTH of these search strategies, WHY they work, and WHY Microsoft INTELLIGENTLY understood this and opted to devise HYBRID SEARCH :-) All we have left to find a way to transform (I'll use the term "distill") OVERLY-LONG sentences into SHORTER sentences with EQUIVALENT MEANING, so that they too can get mapped into lower-dimensional vector spaces and you don't have too LARGE of a dimension in your vector space. So...HYBRID SEARCH with LONG-SENTENCE-DISTILLING might be a really powerful combination, using the GOAL of seeking that "USEABLE SUBSET of Search Results" as our destination. Oh - and that LONG-SENTENCE-DISTILLING is just part of a larger body of "transformations" that need to be developed - transformations that handle MANY of the challenges we humans get ourselves into when we put multiple words together to form BAD sentences and DOUBLE-ENTENDRES and - UGH - TABLES ! If Microsoft continues to be at the forefront of not only creating HYBRID SEARCH (we still need a MORE self-describing term here!) but of also TRANSFORMING data currently NOT-well-suited to being 'vectorized' (see my note above about run-on sentences, tables and even "politician-speak" and "consultant-speak") into "equivalent-meaning" sentences which ARE well-suited to vectorization... WOWSERS. Keep up the great videos ! -Mark Vogt (Principal Solution Architect/Data Scientist, AVANADE)
Is there a doc/tutorial on how to make azure indexer to create those vector fields from a database record(say CosmosDB) fields containing text/audio/image?
Hey very interesting work with lots of uses! I have a question. The cognitive search compute the vector similarity between a "query" and a "source". In the image shown, "source" and "query" don't have to be the same modality; "query" can be text and "source" can be a video file. Does that mean that the produced embeddings from those two are directly comparable? or the query_emb and source_emb must be generated by the same model? In other terms my question is: can I search a query_emb (given from a textual model) over a source_emb (generated from a visual model)?
Thanks for this demo! Question about what is the 'vector search' addition, maybe I missed it. I have been using Azure Cognitive Search for a couple of years and the text search capabilities and features (synonym map, context search, indexers and extensions to handle other files) have always been there, so in a way Az Cog Search was always 'a vector store + in-built search and training with job automation' framework. What is new with vector search that wasn't previously available? thx
It's a new data type in the index (Collection(Edm.Single)) that stores a generated vector (or embedding). The query engine was also extended to perform vector queries and look for the most similar vector in the search index. That's kind of it in a nutshell.
Very clearly explained Liam! Thanks. I have an app that uses the Github project Azure Smart Search, how can I start working with content with Vector embeddings? How can I migrate my Cognitive Search Indexes to have Vector content?
Hi there, Ulises. If your service was created after Jan 1, 2019, you just need to create vector fields in your existing indexes and start indexing your embeddings: msft.it/60559rGlb
This is fantastic. I thought I understood how the vector search but how did it know the relationship between "amigo" "buddies" and "friends" . Is there a base model that has all these basic word relationships defined as embeddings? ie. is the vector "amigo" already in the vector databse? if so, how? Thanks!
Thanks for reaching out Mike, for a better understanding of the process, please take a look at our available resources regarding embedding models, here: msft.it/60599yWhL msft.it/60519yWhF
Scott is awesome in introducing people with humility as well as in making complex concepts simple by explaining it himself.
This was the simplest explanation of Vector Search I could get. Thanks Scott and Liam!
Very well explained, thanks Scott and Liam, keep up the good work.
Really interesting. This led me to a foray into better understanding vectors (in computing), vector space, vector database and ultimately vector search. What fun!
🙏 🙏 🙏
The vector search code is not working and I don't know the reason why
GREAT VIDEO (per usual) FROM YOU GUYS !
@ Liam...
In the world of ambiguity in which we live (and drown), I would suggest there is high merit in describing "Traditional Search" using some term less ambiguous/more self-describing than "traditional"?
Though there is room for debate, I would suggest that instead of calling it "Traditional Search" we might at least refer to it as "Keyword Density Search"?
The phenomenon that you're attempting to describe - and what Scott is (pretending to) trying to understand (he understands EVERYTHING - he's just doing this for his audience ;-) because he's cool like that) can best be described even verbally if we actually try to IMAGINE vectors representing something like individual words in a sentence, or even the subject/verb/predicate of a typical sentence structure. Now imagine a LONG sentence - this would "vectorize" into a LOT of dimensions, so that lots of longer sentences would be "pointing" in many different "directions", have many different final lengths, and therefore it would quite easy to visibly SEE which small subset of sentences are roughly a SIMILAR length and point in a SIMILAR direction, and therefore ARE "similar"...
Now start imagining that your sentences are getting shorter & shorter & SHORTER, until now they're 3-word, 2-word or even 1-word "sentences" that have pretty much lost all of that "dimensionality" which made them easy to differentiate. Instead as you look the "origin" of your vector coordinate system you now have this massive, densely-packed set of itty-bitty-LITTLE vectors all packed like those fluffy seeds on the head of a dandelion, where MANY (thousands) of them could look nearly "identical" and the entire ability to determine only a SMALL SUBSET of 'similar' becomes IMPOSSIBLE, because you have a MYRIAD of nearly-"identical" vectors...
UPSHOT: When the DIMENSIONALITY - the "universe" in which your vectors lives starts to SHRINK, TOO MANY vectors start to look (and BE) "similar" - almost like an "infoglut" of vectors. No matter which "topic" you pick, there are TOO MANY SIMILAR VECTORS :-| ...
The FLIP side can ALSO be said - if your sentences are TOO LONG, then your VECTOR -SPACE "UNIVERSE" becomes TOO LARGE, and you have the opposite-but-problem-but-STILL-a-PROBLEM: no matter which "topic" you pick, there are now TOO FEW (often NONE) SIMILAR VECTORS :-| ...
SO IT ALL COMES DOWN TO THIS: "WHAT DOES 'The Searcher' KNOW ABOUT HIS/HER SEARCH?"
IF he/she can only describe his/her search in a LOTS-of-'connected'-words-that-we-term-a-(not-too-long)-SENTENCE, then VECTOR SEARCH can identify a "USEABLE SUBSET" of search results; that is, not-too-few-needles and also not-too-many-needles from the "haystack" of vectors ;-)
IF however he/she can only describe his/her search in a FEW DISCONNECTED words, then KEYWORD DENSITY (aka "Traditional") Search has the potential to produce that "USEABLE SUBSET" of search results.
And I think (hope!) I've help your viewers understand BOTH of these search strategies, WHY they work, and WHY Microsoft INTELLIGENTLY understood this and opted to devise HYBRID SEARCH :-)
All we have left to find a way to transform (I'll use the term "distill") OVERLY-LONG sentences into SHORTER sentences with EQUIVALENT MEANING, so that they too can get mapped into lower-dimensional vector spaces and you don't have too LARGE of a dimension in your vector space.
So...HYBRID SEARCH with LONG-SENTENCE-DISTILLING might be a really powerful combination, using the GOAL of seeking that "USEABLE SUBSET of Search Results" as our destination.
Oh - and that LONG-SENTENCE-DISTILLING is just part of a larger body of "transformations" that need to be developed - transformations that handle MANY of the challenges we humans get ourselves into when we put multiple words together to form BAD sentences and DOUBLE-ENTENDRES and - UGH - TABLES !
If Microsoft continues to be at the forefront of not only creating HYBRID SEARCH (we still need a MORE self-describing term here!) but of also TRANSFORMING data currently NOT-well-suited to being 'vectorized' (see my note above about run-on sentences, tables and even "politician-speak" and "consultant-speak") into "equivalent-meaning" sentences which ARE well-suited to vectorization... WOWSERS.
Keep up the great videos !
-Mark Vogt (Principal Solution Architect/Data Scientist, AVANADE)
exiting to see everyone so exited
Please share the notebook! Great video. I'd love to see a more in-depth explanation of the code used and library for azure search.
Is there a doc/tutorial on how to make azure indexer to create those vector fields from a database record(say CosmosDB) fields containing text/audio/image?
Searching data will change forever now! Its the rise on an entirely new era and paradigm of data engineering!
Great video! Fantastic walkthrough and explanation of the code.
Hey very interesting work with lots of uses! I have a question. The cognitive search compute the vector similarity between a "query" and a "source". In the image shown, "source" and "query" don't have to be the same modality; "query" can be text and "source" can be a video file. Does that mean that the produced embeddings from those two are directly comparable? or the query_emb and source_emb must be generated by the same model? In other terms my question is: can I search a query_emb (given from a textual model) over a source_emb (generated from a visual model)?
Thanks for this demo!
Question about what is the 'vector search' addition, maybe I missed it.
I have been using Azure Cognitive Search for a couple of years and the text search capabilities and features (synonym map, context search, indexers and extensions to handle other files) have always been there, so in a way Az Cog Search was always 'a vector store + in-built search and training with job automation' framework. What is new with vector search that wasn't previously available? thx
It's a new data type in the index (Collection(Edm.Single)) that stores a generated vector (or embedding). The query engine was also extended to perform vector queries and look for the most similar vector in the search index. That's kind of it in a nutshell.
I'm receiving a 404 on the create_index call. The delete works successfully. Any idea why?
Don’t you need a container in Azure to keep your files ? Or you are only saving the embedded docs? (In the example the content aren’t files)
Very clearly explained Liam! Thanks. I have an app that uses the Github project Azure Smart Search, how can I start working with content with Vector embeddings? How can I migrate my Cognitive Search Indexes to have Vector content?
Hi there, Ulises. If your service was created after Jan 1, 2019, you just need to create vector fields in your existing indexes and start indexing your embeddings: msft.it/60559rGlb
It was funny how Scott's expression switched from 😀to 😐 when Liam said "I hope this doesn't mean anything to you"
This is fantastic. I thought I understood how the vector search but how did it know the relationship between "amigo" "buddies" and "friends" . Is there a base model that has all these basic word relationships defined as embeddings? ie. is the vector "amigo" already in the vector databse? if so, how? Thanks!
That is Gpt-3.5
Or the embedding they have many language as bases
How are the actual vector numbers generated? Is there a white paper that explains?
Thanks for reaching out Mike, for a better understanding of the process, please take a look at our available resources regarding embedding models, here: msft.it/60599yWhL
msft.it/60519yWhF
Very cool video ❤
please provide the link where i can find the code
Hi there! You can find various samples here. msft.it/6052cs6c8
This is extremely useful! Thanks!
Happy to hear that, Biraj! 🙂
Thanks, It is really useful, I want to move my local vector store to Azure, looks this is the solution.
where can I get the source code?
Hi Ben, thanks for your query, please visit the link below for vector search code samples.
msft.it/60559JIu1
Good and timely
I loved it!
this is super cool.
We're happy to hear you enjoyed this, Arup!