A Retrieval Augmented Generation system to query the scikit-learn documentation

Поделиться
HTML-код
  • Опубликовано: 1 окт 2024
  • 🔊 Recorded at PyCon DE & PyData Berlin 2024, 22.04.2024
    2024.pycon.de/...
    🎓 Watch Guillaume Lemaitre introduce an experimental Retrieval Augmented Generation system to enhance querying the scikit-learn documentation, showcasing the advantages of this innovative approach over traditional methods.
    Speakers:
    Guillaume Lemaitre
    Description:
    In his talk, Guillaume Lemaitre, an open-source engineer and core developer of scikit-learn, introduced an experimental Retrieval Augmented Generation (RAG) system for querying the scikit-learn documentation. The current scikit-learn website uses an "exact" search engine that lacks the ability to handle spelling mistakes and natural language queries. To overcome these limitations, Guillaume and his team experimented with large language models (LLMs) and opted for a RAG system due to resource constraints.
    The talk detailed the RAG pipeline stages, including documentation scraping strategies based on numpydoc and sphinx-gallery for lexical and semantic searches. Comparisons were made between the RAG approach and an LLM-only system, highlighting the contextual advantages of the former. The experimental code is available on GitHub for further exploration.
    Challenges and benefits of integrating the RAG system into open-source projects, such as hosting and cost considerations, were discussed. Guillaume emphasized the importance of open-source software stack and open-weight models in developing the RAG system for scikit-learn documentation queries.
    ⭐️ About PyCon DE & PyData Berlin:
    The PyCon DE & PyData conference unite the Python, AI, and data science communities, offering a unique platform for collaboration and innovation. The PyCon DE & PyData Berlin 2024 conference, hosted in partnership with the local Berlin PyData chapter, provided an exceptional experience, fostering deeper connections within the Python community while showcasing advancements in AI and data science. Attendees enjoyed a diverse and engaging program, solidifying the event as a highlight for Python and AI enthusiasts nationwide.
    Follow us:
    • LinkedIn: / 28908640
    • X: www.x.com/pyconde
    • X: www.x.com/pyda...
    Links:
    • Conference website: pycon.de
    • Related sessions: 2024.pycon.de/p...
    The conference is organized by
    • Python Softwareverband e.V.: pysv.org
    • NumFOCUS Inc.: numfocus.org
    • Pioneers Hub gemeinnützige GmbH: pioneershub.org
    If you enjoyed this session, please like, comment, and subscribe to our channel for more insightful talks and discussions.
    Share this video with your network to spread the knowledge!
    Hashtags:
    #Python #PyConDE #PyData #OpenSource #AI #DataScience #MachineLearning #SoftwareDevelopment #LLMs #Community
    Acknowledgements:
    Special thanks to all the volunteers and sponsors who made this event possible.
    About:
    Python Softwareverband e.V.:
    PySV is a non-profit that promotes the use and development of Python in Germany through events, education, and advocacy, fostering an open Python community.
    NumFOCUS Inc.
    supports open-source scientific computing by providing financial and logistical support to key projects like NumPy and Jupyter, promoting sustainable development and collaboration.
    Pioneers Hub gemeinnützige GmbH:
    is a non-profit fostering innovation in AI and tech by connecting experts and promoting knowledge exchange through events and collaborative initiatives.
    www.pydata.org
    PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

Комментарии •