A GPT-4V Level Multimodal LLM on Your Phone ??? MiniCPM-Llama3-V-2_5

Поделиться
HTML-код
  • Опубликовано: 6 сен 2024
  • MiniCPM-V is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image and text as inputs and provide high-quality text outputs. Since February 2024, we have released 4 versions of the model, aiming to achieve strong performance and efficient deployment. The most notable models in this series currently include:
    • MiniCPM-Llama3-V 2.5: 🔥🔥🔥 The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, the model surpasses proprietary models such as GPT-4V-1106, Gemini Pro, Qwen-VL-Max and Claude 3 in overall performance. Equipped with the enhanced OCR and instruction-following capability, the model can also support multimodal conversation for over 30 languages including English, Chinese, French, Spanish, German etc. With help of quantization, compilation optimizations, and several efficient inference techniques on CPUs and NPUs, MiniCPM-Llama3-V 2.5 can be efficiently deployed on end-side devices.
    Relevant Links:
    github.com/Ope...
    huggingface.co...
    huggingface.co...
    If you like to support me financially, It is totally optional and voluntary. Buy me a coffee here: www.buymeacoff...
    If you like such content please subscribe to the channel here:
    www.youtube.co...

Комментарии • 1

  • @golden--hand
    @golden--hand 6 дней назад

    Thanks for this video. I actually appreciate your more realistic perspective on the models performance. Always seeing videos quoting the benchmark with little to no practical examples is often unhelpful.