Building an LLM fine-tuning Dataset

Поделиться
HTML-код
  • Опубликовано: 27 сен 2024

Комментарии • 77

  • @loudsquad2324
    @loudsquad2324 6 месяцев назад +12

    ofc it pulls the nsfw subreddit first 😆 that's hilarious! Great content as always.

  • @nethrashri486
    @nethrashri486 6 месяцев назад +6

    Waiting for this long time... biggg thank youuuuuuuuu ..

  • @TrueTributes13
    @TrueTributes13 2 месяца назад

    This was such a fun watch, so much information, you make learning this stuff a blast🙌

  • @varshwalia
    @varshwalia 6 месяцев назад +2

    Man delivers every single time.

  • @BoobieBusiness
    @BoobieBusiness 6 месяцев назад +6

    At 32:15 you mention to be unsure of the meaning of the parent id's in the dataset. The reddit post you linked to the BigQuery contains a SELECT statement with REGEXP_REPLACE of 't[0-9]_' on the link_id. According to GPT-4, link_id is a field that represents the ID of the post (submission) to which a comment belongs. Reddit IDs for posts and comments, often prefixed with a type indicator t1-t6 are: Comment, Account (user), Link (post or submission), Message (private message), Subreddit, Award
    If you did not filter the data frame on ids starting with t1_, then you might have fine-tuned the model on all types of content, not just comments/conversations. If so, it might explain why adding all those other subreddits messed with the training process (as the prompt template is not formatted for other types of content).

    • @sentdex
      @sentdex  6 месяцев назад

      Hmm. something to dig into more for sure, thank you!

  • @peepleep7931
    @peepleep7931 6 месяцев назад +1

    hell yeah sentdex is back

  • @bigbena23
    @bigbena23 2 месяца назад

    First of all, your videos are amazing.
    I was thinking of doing the same but not for Reddit, but Slack discussions.
    In slack there are only 1 layer of discussion with threads (so I guess it's 2), but not more than that.
    What I couldn't quite understand from your video is how do you decide with which speaker you're replacing the chatbot.
    Is it simply the last one for each tier?
    I'm unsure with how to apply it for my use case - maybe I shall just replace a random user in the chats every X time with the bot?

  • @MaxM9000
    @MaxM9000 6 месяцев назад +2

    This project reminds me of Yannic's GPT4-Chan project. How cursed can we get a WSB AI bot in terms of memes and degenerate strategies?

  • @rook451
    @rook451 3 месяца назад

    Love your website. Thank you.

  • @nomanshiekh26
    @nomanshiekh26 6 месяцев назад

    Really informative video.
    Thanks!

  • @BenYu-v8e
    @BenYu-v8e 6 месяцев назад

    Your video is helpful for me to start finetuning models. One question, can the numpy library have the same performance sorting datasets?

  • @savagejinx8179
    @savagejinx8179 6 месяцев назад +1

    How come you never continued the Neural Networks from Scratch series?

  • @livinthrusound
    @livinthrusound 6 месяцев назад

    Surprised no one mentioned but … r/2007scape?? Love it

  • @samar1900
    @samar1900 5 месяцев назад +1

    I have started with Deep Learning, can anyone suggest from which video I should start, any flow is available on this channel where i can follow accordingly?

    • @TheInternalNet
      @TheInternalNet 4 месяца назад

      Yeah like an absolute beginner crash course. I'm so fired up to learn this.

  • @nidavis
    @nidavis 5 месяцев назад

    To solve for when the bot should respond, perhaps a simple classification model trained on whether or not the bot should reply, which then calls the chatbot based on that result?

  • @kadaliakshay6770
    @kadaliakshay6770 6 месяцев назад +2

    Amazing Explanation bro keep it up

  • @mr.daniish
    @mr.daniish 6 месяцев назад

    Another knowledge bomb!

  • @WetspongeUK
    @WetspongeUK 4 месяца назад

    would love to see the python from scratch series finish

  • @vipclassic105
    @vipclassic105 8 дней назад

    Hello sir can you reverse cython

  • @prathyushmadhu2861
    @prathyushmadhu2861 3 месяца назад

    Does anybody know about that copilot he used to speed up the decompressing process?

  • @Akhoon_faheem
    @Akhoon_faheem 2 месяца назад

    In this of age AI , i fell for your video's 😅

  • @bennguyen1313
    @bennguyen1313 6 месяцев назад

    I've seen some people use Google Colab / Jupyter - Spyder.. how does training using those compare to Google Cloud?
    Can a python application access a model running on the cloud for free (Google Colab).. or are there no free options? What's the cheapest?
    For example, aside from cloud services that host LLMs (railway , modal, render, beam cloud , Replicate , Streamlit , replit), I could run Ollama on my own computer and run models (Llama2 (XB), Mistral 7B, etc)?
    The downside is that my python API would need to be written for a specific API? For example, OpenAPI , Gemini, OpenAI's Assistant API , Au Mistral, Gemini Pro, llama2 , FastAPI are all different?

  • @WL113
    @WL113 6 месяцев назад

    finally! booya!

  • @BlueBearOne
    @BlueBearOne 5 месяцев назад +3

    Did you take down your Discord server?

  • @spxyo
    @spxyo 6 месяцев назад +1

    Hi! Have you checked the bills after downloading so much data from GCS ? It seems like a lot of class B operations and transferred data. Did it cost you more than $1000 ? Thanks

    • @sentdex
      @sentdex  6 месяцев назад +5

      The entire BigQuery cost for the operations here was $89.84, and that includes a few exports/downloads that I ended up doing a couple of times as I deved.

  • @dhyanais
    @dhyanais 6 месяцев назад

    Is it important to differentiate by language? I bet you'll find all kinds of languages there. Is it relevant to distinguish the language first and only use comments from one language?

    • @sentdex
      @sentdex  6 месяцев назад +1

      Good question when it comes to fine-tuning, especially with QLoRA. I would estimate that you'd want to keep it simpler, but we do know when it comes to fully training models that multi-lingual tends to produce better models.

  • @StephenRoseDuo
    @StephenRoseDuo 2 месяца назад

    You good Sentdex?

  • @MIH20788
    @MIH20788 19 дней назад

    bring back our nnfs tutorial reading the book only is hard😊😊😊😊😊😊

  • @mytechnotalent
    @mytechnotalent 6 месяцев назад

    awesome! 34,445 woohoo!

  • @davidschaupp5423
    @davidschaupp5423 6 месяцев назад

    I can´t find the dataset on bigquery?

    • @sentdex
      @sentdex  6 месяцев назад

      Still there, here's the link: bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2015_05

    • @donquixoteth
      @donquixoteth 6 месяцев назад

      It does not work for me@@sentdex

    • @thekingofallblogs
      @thekingofallblogs 4 месяца назад

      @@sentdex do you have to create a billing account to access it? or be part of some group ? all I see is just-landing-xxx under explorer.

  • @cod-newbie9166
    @cod-newbie9166 6 месяцев назад

    Why can’t I access the ebooks😢?

    • @sentdex
      @sentdex  6 месяцев назад

      Do you mean you made an order and havent gotten it?

    • @cod-newbie9166
      @cod-newbie9166 6 месяцев назад

      @@sentdex I mean I can’t open the web page

    • @sentdex
      @sentdex  6 месяцев назад

      @@cod-newbie9166 which one?

  • @adempc
    @adempc 6 месяцев назад

    Word

  • @human_agi
    @human_agi 6 месяцев назад

    Did you download from gcp to your local computer?

    • @sentdex
      @sentdex  6 месяцев назад

      Yes. When you go to export, gcp gives you a gsutil command example. I just took it and used * to get all the files with a single command

  • @60hit99
    @60hit99 6 месяцев назад

    Hi

    • @sentdex
      @sentdex  6 месяцев назад +2

      Hello

  • @hamzashaikh9795
    @hamzashaikh9795 6 месяцев назад

    First one to comment 🎉

  • @johnnywilliams2641
    @johnnywilliams2641 6 месяцев назад

    loving all comments is the same as loving none sentdex. We all know there is no information there.

    • @sentdex
      @sentdex  6 месяцев назад +2

      The good news is: I don't love all comments.

    • @johnnywilliams2641
      @johnnywilliams2641 6 месяцев назад

      @@sentdex You're a master of your craft. Learned a ton from your python tutorials many years ago. Didn't mean to offend. Thought it was a crafty information theory joke. Cuz I'm super witty and good looking too. You better like my damn comments senty.

    • @sentdex
      @sentdex  6 месяцев назад

      @@johnnywilliams2641 1 love, best I can do

  • @AlbertCelmaOrtega
    @AlbertCelmaOrtega 4 месяца назад +6

    Hi sentdex, Albert from Barcelona here! INCREDIBLE 1.33M SUBS!! YOU ARE AWESOME!!
    I learnt how to code and to do ML thanks to you. I studied civil engineering at Imperial College London and can tell your ability to convey ideas and teach is unprecedented! Plus it's always fun. I recall the first model I made, thanks to you, a binary classification model about breast cancer in 2019! Still here. I am starting a tech startup for logistics. I hope I can make it, and give back to you for so much you've already given me.

  • @codespace
    @codespace Месяц назад +1

    where are you dude? long time no see?

  • @gcm4312
    @gcm4312 6 месяцев назад +1

    52:28 I believe the format python json is expecting is like `[{"key1":"value1"},{"key2":"value2"}]`. Your database has newlines for key separators (not commas) and is not inside a list

  • @johnblomberg389
    @johnblomberg389 6 месяцев назад +2

    Hi Sentdex!
    First of all thanks for the video, it's interesting as always to see you tinker with this stuff and I'm really learning a lot :)
    after your previous videos with the WSB bot I decided to create my own scraper to collect comments from the daily WSB threads.
    I have just kept it running every now and then on my local computer and collected something like 57 mb of conversation data.
    I believe it is 229 000 comment threads, some of them are longer and some of them shorter. It has not been properly cleaned so there are also threads with only one comment in them but even after removing that it should be at least 150k of threads which are recent (collected during mid 2023 until now)
    If you want to play around with it I can clean it and upload it somewhere :)

  • @lovemedicine
    @lovemedicine 2 месяца назад

    Hi thanks for the video, can you create a video using meta-learning with example

  • @yureqandrade
    @yureqandrade Месяц назад

    @sentdex where’s your Bitcoin Whitepaper Playlist? I couldn’t find it. Anyone can shine a light here, please?

  • @kadaliakshay6770
    @kadaliakshay6770 6 месяцев назад +1

    waiting for more amazing videos and also just subscribed and liked the video

  • @mher_22
    @mher_22 Месяц назад

    ...maybe come back? pls?

  • @asiddiqi123
    @asiddiqi123 6 месяцев назад +1

    Harrison for President

  • @phils744
    @phils744 6 месяцев назад

    I really need to learn to be patience, you have excellent content, I would like to install this on my ha cluster, with my own database. From excel files to pdf, it's cool as heck. Be safe everyone

  • @MrunalAshwinbhaiMania-b1d
    @MrunalAshwinbhaiMania-b1d 5 месяцев назад

    Hello Sentdex! ,
    Thank you for such a wonderfull video. I just have one question, when I tried to get the fh-bigquery data, its not available at the link, can you please give us the big-query link.
    Much appriciated.
    Thanks,
    Mrunal Ashwinbhai Mania

    • @Zero-tg4dc
      @Zero-tg4dc Месяц назад

      if you still need help with this I know how to access the data

  • @rataash_x
    @rataash_x 6 месяцев назад +1

    You make the whole learning process so fun, it never gets boring.

    • @TrueTributes13
      @TrueTributes13 2 месяца назад

      I wholeheartedly agree, top tier stuff👌

  • @limajgarcia
    @limajgarcia 6 месяцев назад

    Hit the like and watch. let's go!

  • @shashwatxcodes
    @shashwatxcodes 5 месяцев назад

    sir is it true that mostly folks with masters or phd in ai only get packages over 100k usd ?
    pls reply sir as im confused between taking btech cse or btech aiml
    Im confused between targetting ai engineering right from 1st sem or web dev for the initial part and then switch to ai ml in my 3rd sem.
    My Target - 100k+ usd remote job
    Pls do reply sir id be extremely thankful to you ❤

    • @shitmandood
      @shitmandood Месяц назад

      You probably have to know somebody that would want Give you such a job with high salary and work from home because if it's something really important, they're gonna wanna have you nearby for discussions. I mean I could be wrong but I'd I'd be surprised it would depend on your credentials. If you wanna get >100 K job that's remote. It can be anything. It doesn't have to be an AI Engineering so I mean if it's just if money is all you want it doesn't really need to be AI, it can be anything.

  • @tcgvsocg1458
    @tcgvsocg1458 6 месяцев назад

    long time no see