ruclips.net/video/tOhWpF5-_z4/видео.html - here beam len is 2. ruclips.net/video/tOhWpF5-_z4/видео.html - here beam len is 3. ruclips.net/video/tOhWpF5-_z4/видео.html - here beam len is 6? Why do we take top 6 (num_beams * 2) as mentioned here ruclips.net/video/tOhWpF5-_z4/видео.html ? Also ruclips.net/video/tOhWpF5-_z4/видео.html with boy as input, 'and' and 'who' had highest prob (you chose top 2) but with 'dog' as input only 'who' i.e. top 1 was chosen? are you picking top 3 across outputs with inputs 'boy' 'dog' and 'woman'?
In the code example, the beam size is 3, but the batch size is 2. That's why it appears we have 6 sequences at a time, and this illustrates how beam search is combined with batching. About your question about taking the top 3: We are taking the top 3 beams overall, and they may correspond to any beams from the previous iteration (it's not necessarily a 1-to-1 correspondence). So we might use 2 candidates from the beam ending with "boy", 1 from the beam ending in "dog", and 0 from the beam ending in "woman". Hope this clarifies things!
I didn't mention it in this video, but the KV cache is supported in the Hugging Face implementation (and by default is turned on) -- it is the use_cache parameter.
I just read the Hugging Face transformers implementation. Sure, it does support kv cache, however beam search in transformers is implemented by simply expanding batch size. I'm sure this is not that efficient, especially for memory, since nothing is reused here, even the kv cache for the prompts in prefilling phase is not reused. Do you know any implementation that is more mature or optimized? Thanks a lot! @@EfficientNLP
ruclips.net/video/tOhWpF5-_z4/видео.html - here beam len is 2.
ruclips.net/video/tOhWpF5-_z4/видео.html - here beam len is 3.
ruclips.net/video/tOhWpF5-_z4/видео.html - here beam len is 6?
Why do we take top 6 (num_beams * 2) as mentioned here ruclips.net/video/tOhWpF5-_z4/видео.html ?
Also ruclips.net/video/tOhWpF5-_z4/видео.html with boy as input, 'and' and 'who' had highest prob (you chose top 2)
but with 'dog' as input only 'who' i.e. top 1 was chosen?
are you picking top 3 across outputs with inputs 'boy' 'dog' and 'woman'?
In the code example, the beam size is 3, but the batch size is 2. That's why it appears we have 6 sequences at a time, and this illustrates how beam search is combined with batching.
About your question about taking the top 3: We are taking the top 3 beams overall, and they may correspond to any beams from the previous iteration (it's not necessarily a 1-to-1 correspondence). So we might use 2 candidates from the beam ending with "boy", 1 from the beam ending in "dog", and 0 from the beam ending in "woman".
Hope this clarifies things!
nicely explained!
seems no kv cache is used in the implementation. How to make beam search compatible with kv cache and make it more efficient?
I didn't mention it in this video, but the KV cache is supported in the Hugging Face implementation (and by default is turned on) -- it is the use_cache parameter.
I just read the Hugging Face transformers implementation. Sure, it does support kv cache, however beam search in transformers is implemented by simply expanding batch size. I'm sure this is not that efficient, especially for memory, since nothing is reused here, even the kv cache for the prompts in prefilling phase is not reused. Do you know any implementation that is more mature or optimized? Thanks a lot! @@EfficientNLP
You can pass your custom past_key_values by doing a forward pass once and load it in generate @@feixyzliu5432
hey, great video! I just wanted to ask what you are using as a debugger to get the intermediate values of the variables? looks very interesting...
I used PyCharm for this video, but most modern IDEs should have a similar feature.
@@EfficientNLP thank you so much!
Very well explained!
what IDE is this ?
This is PyCharm, but VS Code has similar debugging functionality.