I built my own Reddit API to beat Inflation. Web Scraping for data collection.

Поделиться
HTML-код
  • Опубликовано: 12 янв 2025

Комментарии • 285

  • @dreamsofcode
    @dreamsofcode  Год назад +39

    To get $15 credit for use with Brightdata to scrape your own APIs, visit: brdta.com/dreamsofcode

    • @meinkanal13378
      @meinkanal13378 Год назад

      Just an info: Not working anymore, only $5

    • @dreamsofcode
      @dreamsofcode  Год назад

      @@meinkanal13378 inflation strikes again 😭
      Let me reach out. Thank you for letting me know

    • @PaulSebastianM
      @PaulSebastianM Год назад

      Be careful, we scraping is illegal in some countries.

  • @sivuyilemagutywa5286
    @sivuyilemagutywa5286 Год назад +387

    The video was enjoyable, but it's important to acknowledge that sponsored content can introduce bias. One approach could be to make the entire video centered around the sponsor, or if you choose to feature the sponsor as you did, consider presenting alternative services similar to them. Your videos are consistently excellent, boasting high-quality production, a well-maintained pace, and crystal-clear explanations.

    • @aliengarden
      @aliengarden Год назад +10

      that was my exact thought, thanks for pointing it out.

    • @seanthesheep
      @seanthesheep Год назад +18

      when ChatGPT focuses more on the sponsor of the video than the video itself

    • @jaumsilveira
      @jaumsilveira Год назад +17

      Yeah, bro was talking about make everything as free as possible and then presents a service which is very expensive

    • @hqcart1
      @hqcart1 Год назад +3

      what about captcha?????? he didnt mention that his sponsor can go around it, and even his code did not handle captcha.

    • @TheMacWindows
      @TheMacWindows Год назад +2

      @@hqcart1 Death by captcha and related services exist for that

  • @foobars3816
    @foobars3816 Год назад +127

    This was never a technical limitation, it was a legal one.

    • @JustSomeGuy009
      @JustSomeGuy009 Год назад +4

      uh, no. It's a financial one. The idea that companies are going to offer network and compute resources for the sheer amount of API calls made for free was always comical. It's sad that so many programmers and general public think this stuff is just free or a charity. No matter what you do, eventually these costs will catch up to the business and HAVE to be charged to people or else the service will just die.

    • @fizzcochito
      @fizzcochito Год назад

      @@JustSomeGuy009 I am going to touch you without your consent

    • @Homiloko2
      @Homiloko2 Год назад

      @@JustSomeGuy009 Yep. People pretend webscraping is 'free', but it still costs the companies. The companies are willing to bear the cost of regular users browsing through pages, but a scraper browsing through the entire catalog is even more expensive for the company than if they just used the API. Scraping is definitely malicious.

    • @tabbytobias2167
      @tabbytobias2167 8 месяцев назад +2

      @@JustSomeGuy009 it costs a server less than a penny to serve 1000 requests.

    • @jameskim7565
      @jameskim7565 8 месяцев назад

      @@tabbytobias2167 yes, but for a service the size of reddit, it can lead to hundreds of thousands of dollars in losses, due to the sheer volume of those requests.

  • @shishsquared
    @shishsquared Год назад +465

    Crowdsourcing idea for this to prevent IPs getting blocked: a browser that pays its users for using it. Developers write scripts to scrape data, and pay to use the network of users. Users then get paid for using the web browser, which will create a private session, encrypted away from the user, run the web scraping tasks, and send the data back to the developer. Build it all on top of chromium, and if done correctly, websites would have a very difficult time blocking based on IP addresses, activity , or fingerprinting because it would be distributed across actual user IPs, and actual user login times (browser only runs when open). My only concern would be how to protect the users when malicious devs start doing illegal activities. You'd have to have very strong terms and conditions, have logging, and be able to trace back requests to devs. But then that opens a dev privacy can of worms. Still, interesting concept

    • @phoneywheeze
      @phoneywheeze Год назад +351

      Botnet as a Service

    • @levifig
      @levifig Год назад +148

      You just described 99% of the “VPN” apps available for your mobile device… ;)

    • @MuhsinunChowdhury
      @MuhsinunChowdhury Год назад +11

      Wouldn't residential sneaker botting proxies be able to accomplish the same thing?

    • @mathisd
      @mathisd Год назад +3

      @@MuhsinunChowdhury These costs..

    • @ajnart_
      @ajnart_ Год назад

      ahahahah you're not wrong, especially the free ones@@levifig

  • @shadez221
    @shadez221 Год назад +267

    For anyone planning to try this , use headless mode of puppeteer so that I doesn’t open multiple browser to improve performance and route it via a vpn setup on aws to obfuscate .
    And be ready to have your ip blocked 😊

    • @__sassan__
      @__sassan__ Год назад +1

      Even when using the VPN?

    • @tacokoneko
      @tacokoneko Год назад

      vpns also have an ip so when doing this if they block you you need an endless revolving door of new VPNs or proxys @@__sassan__

    • @tacokoneko
      @tacokoneko Год назад

      which is not that hard because if you port scan the entire internet with some strategic guessing (downloading public datacenter IP ranges, scan port 1080 for SOCKS5 proxys) you can find unsecured proxys for free, even some rare ones that work with SSL over SOCKS5

    • @tacokoneko
      @tacokoneko Год назад +2

      i asked someone if port scanning the internet to find proxys is illegal and they said no so i think it's completely legal, they didnt put a password or any authentication so they are allowing people to use it

    • @Dot_UwU
      @Dot_UwU Год назад

      @@__sassan__ if you send a ton of requests with the same IP, you'll get rate limited. Also most VPN ips are datacenter IPs which are almost always blocked.

  • @forresthopkinsa
    @forresthopkinsa Год назад +242

    This is an interesting idea but a really impractical approach. New Reddit is an SPA and you can just use the XHR endpoints to fetch the data raw. Don't bother with browser emulation and HTML parsing.
    Besides, the closure of the APIs was never about restricting access to a user like you're circumventing here. As you've acknowledged, that wouldn't really make sense on the Web. The API pricing is about charging for data farming and large-scale user interception. You can't accomplish either of these use cases by scraping; you'll get rate-limited very quickly.
    The only way around this is using Bright Data's borderline-illegal botnet, which seems like a pretty shady way to do business.

    • @tatianatub
      @tatianatub Год назад +82

      its called hostile interoperability and its the consequence to fucking over developers, its time we remind platform hosts why APIs were created in the first place

    • @mathgeniuszach
      @mathgeniuszach Год назад +17

      People will use their own embedded browsers and similar scraping methods will occur locally. It's basically the same as an extension modification of the site. People just browsing normally don't need botnets and access to all of reddit, they just want a better stinking interface.

    • @ArizeOW
      @ArizeOW Год назад

      @@tatianatub It's time to remind you, that Reddit doesn't belong to "us". It belongs to Reddit. And they can do whatever they want with it. If they don't want large applications like Apollo to scrape EVERY post, comment, upvote, downvote, user karma and such, there is nothing you can do about it. That's it. It's not that deep.

    • @x--.
      @x--. Год назад +6

      The internet is meant to be and should be open.
      That doesn't mean everything has to be free at-scale but fighting hostility to the _idea of an open internet_ is a good thing. You're free to put your content behind a paywall for everyone.

    • @Imperial_Squid
      @Imperial_Squid Год назад +1

      To add to your point about the Reddit API stuff being about large users only, I still use a third party app for my browsing, but I have my own API key, looking at my app usage on my phone I spend between 1-3 hours on it daily and haven't heard anything from Reddit HQ trying to shut me down...

  • @conaticus
    @conaticus Год назад +61

    Really cool project idea! Loved it

  • @FunctionGermany
    @FunctionGermany Год назад +28

    new reddit probably uses an internal API you can pull from by fetching from the browser window. also note another user's comment about old reddit + cheerio (no browser needed).

    • @eoussama
      @eoussama 10 месяцев назад

      He probably used Playwright just to have an excuse to shove the Bright Data sponsorship in the vide, which I understand.

  • @DodaGarcia
    @DodaGarcia Год назад +10

    Decoupling the data persistence from the business logic is always a good idea, but using a queue service for that is bonkers. It removes none of the existing complexity, since you still eventually have to map the message payload to the database schema, and then introduces more complexity because you now have to keep track of one more service, the publishing code, the consuming code and the asynchronicity itself.
    Just use the repository pattern with an adapter for the chosen database, or an ORM like Prisma if you really don't expect the app to scale much.

    • @goofynose2520
      @goofynose2520 Год назад

      Agreed. I swear 90% of queues I encounter are needless overcomplications

    • @ShaneZarechian
      @ShaneZarechian 10 месяцев назад

      Someone fork this and make it non-ridiculous

  • @IannoOfAlgodoo
    @IannoOfAlgodoo Год назад +62

    Curious how much you spend on bright data as their product is like 20$ / GB and 0.1/hour

    • @GoldenretriverYT
      @GoldenretriverYT Год назад +18

      Yeah, its expensive as heck. Also I am wondering how they claim they have 72 million residential ips?
      I can only imagine them having spread malware which then gave them a botnet to work with, or, less likely, they offer people money in exchange for them running a proxy.
      Edit: I looked it up, apparently they have an SDK which app developers can integrate which gives the users a choice between ads or allowing their connection to be used by BrightData as a proxy, thats where they (at least claim to) have the proxies from.

    • @tardistrailers
      @tardistrailers Год назад +9

      @@GoldenretriverYT It'd be insane to run a resold proxy on your personal IP, just to see no ads somewhere. Worst case you get your home raided by law enforcement, because someone did something highly illegal with it. But I wouldn't be surprised if less educated people still do this.

    • @OrangeYTT
      @OrangeYTT Год назад

      ​@@GoldenretriverYT99% of "residential proxies" are just computers under a botnet.
      Hola (that free Vpn) got in trouble a while back for making people who used their VPN join their botnet for this very reason!

  • @the_cobfather
    @the_cobfather Год назад +6

    Why use an SQS queue to abstract the db writing interface? The solution that immediately comes to mind is to just make an abstract class.
    The point of SQS is to be able to handle crazy amounts of throughput (like, up to 30,000 messages per second), which isn't really what you're doing.

  • @teamredstudio7012
    @teamredstudio7012 Год назад +23

    I would do this in a different way. I would simply write a script in whatever language, that has a get and post function so you can call the main page first, then parse the data, often websites use apis already to fetch the content, use Fiddler Classic or some other proxy server to inspect what api the website uses. When the website loads more content after scrolling, it needs to fetch the data from somewhere. Simply reproduce this api by copying the authentication tokens from the headers and providing the required headers in the requests, then parse the response body and add it to some database. I would make it store everything so if it needs to be fetched repeatedly it simply gets from offline copy instead of wasting resources fetching and parsing. I never automate browsers, if your browser can fetch the data, you can fetch it too without front end. You can also get the url to load more content from fetching the raw main page because the browser needs to know where to fetch this anyways so it's definitely defined somewhere. It's super simple to scrape websites, you only need to know how to do requests and parse json and xml in your preferred language! Don't automate browsers but just fetch it directly!

    • @unforgettable31
      @unforgettable31 Год назад +6

      I come from a cracking background and back in the day and this is exactly what we would do. We would write GET/POST requests with token grabbing methods and get the job done. We’d launch hundredths of threads all connected to different proxies, instead of a single web browser. Sometimes it was challenging for particular platforms because of cookies but at the end of the day it was doable.

    • @rossimac
      @rossimac Год назад

      Websites that use recaptcha2 are ones that I've found that I need a browser to interact with. Ones that don't then yes, totally, inspect the network traffic and understand how your browser is creating the requests and then replicate them.

    • @S0L4RE
      @S0L4RE Год назад +4

      +1 it’s such a massive pet peeve of mine seeing people use selenium when it could just be achieved with requests.

    • @cheemzboi
      @cheemzboi Год назад +1

      @@unforgettable31 what about captchas then

    • @unforgettable31
      @unforgettable31 Год назад

      @@cheemzboi Most platforms use captchas when they detect ongoing suspicious activity, which is omitted when using proxies.

  • @wierdnes
    @wierdnes Год назад +29

    Great video. I liked the step by step thought process of getting the scraper get data. One major flaw in the cost analysis you presented was the absence of any cost for brightdata. Checking the pricing myself it looks like 20€ per GB of data?

  • @WarlordEnthusiast
    @WarlordEnthusiast Год назад +1

    I actually did something similar, we needed financial data for a project we were working on and the APIs we found were very limiting and some were very expensive.
    We tried using one of the cheaper ones and it straight up did not work, it had downtime of sometimes hours and when we contacted the company they basically told us it wasn't there problem.
    So I built a web scraper, hosted it on my server at home and scraped all the forex data I needed from their website for free.

  • @takennmc
    @takennmc Год назад +55

    8 cents for 3 weeks damn this really makes reddit unreasonable

    • @rockshankar
      @rockshankar Год назад

      That does come with a significant management. the project is a simple way to get it working. Once you dig deeper there are lots of problems. Lambda and dynamodb is cheaper based on amount of requests. If you post your api endpoint in public. 1 million requests will be gone in seconds. and then using Lambda will make it more expensive than running your server.
      If its cheaper, someone else would have done it already.

  • @dancinglazer1628
    @dancinglazer1628 Год назад +27

    Honestly, I think this infrastructure is too complicated for what it is doing. I don't really care about the sponsored bit, but I think it would have been better to simply create a lambda that directly writes to a database (assume a cacheFactory -> RedisCache | MongoCache | JsonCache) along with a "freshness" param due to the relative simplicity of the data I think redis would be a good candidate; Then all you would need to do in the API is simply fetch the data based on the query param, something which can probably be achieved in a single file.

    • @joopie46614
      @joopie46614 Год назад +5

      Yeah I feel it's been quite overengineered with all this message queue and database/service stuff, this could be done fully locally realistically and at not much of a bigger cost since nowadays OSS databases and caching solutions are really efficient

    • @hqcart1
      @hqcart1 Год назад +1

      he will need a 2-4GB ram VM to do that. AWS is expensive

    • @dancinglazer1628
      @dancinglazer1628 Год назад +4

      @@hqcart1 he is deffering the scraping to the sponsered service anyway, but I think we can just fetch the html instead of running a headless browser

    • @dancinglazer1628
      @dancinglazer1628 Год назад

      @@joopie46614 This could be a single service on a docker image, run a cron scheduler that fetches and writes to a json file and have a server running that uses the json as a database

    • @hqcart1
      @hqcart1 Год назад

      @@dancinglazer1628even he uses a sponsored service, at one point you will get captcha, and my point was his code does not handle that.. and about fetching HTML, no it does not work for complex sites where HTML code or classes is getting rewritten by js, i tried that and failed, ended up using headless browser.

  • @poggybitz513
    @poggybitz513 Год назад +1

    I did the same thing for my app using selenium bindings in rust and used vagrant to manage instances. You can use docker if you want. Please mark this video as ad, because none in their right mind would do it this way. I am so tired of people shoving ads down my throat and claiming its a good education.

  • @nigerianprince5389
    @nigerianprince5389 Год назад

    1st off, thanks for this buddy, you're a godsend.
    it does feel a bit over-engineered but i guess you've gone this route because you want to build your own Reddit API.
    for folks like me who have only been coding everyday for 1 month using GPT - knowing how to pull the data from reddit and store in a database is the main thing i need (i think most people as well but i could be wrong).
    keep up the good work still and thank you again !

  • @primo_geniture
    @primo_geniture Год назад +6

    I'm curious as to what the total time for the project was.

  • @Jana-se4kv
    @Jana-se4kv Год назад +2

    THANK YOU!
    Very helpful!

  • @louishuort7969
    @louishuort7969 Год назад +5

    What about the cost of bright data ?

  • @TheHotMrDuck
    @TheHotMrDuck Год назад +5

    i hope this doesnt kill old reddit, if they remove it im gone

  • @EarlZMoade
    @EarlZMoade Год назад +5

    Unrelated to this video - would you show how you version your dotfiles (if you do)? It would make for a good video.

  • @chofmann
    @chofmann Год назад +5

    you are aware of the json api that things like rif is using? basically, for every link, there is also a json file you can just access

  • @xXtim128Xx
    @xXtim128Xx Год назад +3

    Using a full webbrowser when a simple HTTP request and HTML parser would suffice...

    • @dreamsofcode
      @dreamsofcode  Год назад +1

      You're correct. It would have. However a browser is a more versatile option for other use cases.

  • @sworatex1683
    @sworatex1683 Год назад +2

    Why didnt you use curl? It would bei way more lightweight than using a Browser. Most Programming languages will let you manage Dom objects with built in libraries

  • @pelic9608
    @pelic9608 Год назад +3

    Every modern website has an API.
    Most just aren't documented. 🤷‍♂️
    Copy their own website's auth flow and use those tokens to drive your app. Wjat are they gonna do? Paywall their entire site?
    (Ok, ok; SSR is a thing, but there's still almost always some pure-data endpoint around)

  • @ltecheroffical
    @ltecheroffical 8 месяцев назад

    You can remove the browser part by using a web scraping framework that works without a browser instance.

  • @jakestrouse12
    @jakestrouse12 Год назад +11

    You can also reverse engineer their private api by looking at the browser network requests. The scraping will be much faster

    • @S0L4RE
      @S0L4RE Год назад

      Although Cloudflare IUAM makes it an immense pain in the ass

    • @batmanatkinson1188
      @batmanatkinson1188 Год назад

      And keep in mind that private APIs are susceptible to change, so today it’s gonna work, tomorrow you have to start over

    • @nmlss-r9
      @nmlss-r9 Год назад

      ​@@batmanatkinson1188less often than the html

    • @TheSaintsVEVO
      @TheSaintsVEVO Год назад

      @@S0L4REwhat’s that? Does Reddit use it?

    • @S0L4RE
      @S0L4RE Год назад

      @@TheSaintsVEVO I’m not sure if Reddit uses it, but IUAM detects very low-level characteristics about the request (i.e cipher mode, SSL configuration) to determine whether it looks automated.

  • @antonjoacir
    @antonjoacir Год назад +2

    Man, could you make a video about the configurations of your terminal?

  • @heckerhecker8246
    @heckerhecker8246 Год назад +1

    How to get four hitmen at your door:

  • @k98killer
    @k98killer Год назад +3

    Would it have cost more without the brightdata sponsorship?

    • @louishuort7969
      @louishuort7969 Год назад +2

      Ohh yes, a lot, bright data is very expensive

  • @-Siknakaliux-II
    @-Siknakaliux-II Год назад

    So this vid popped up in my recs. Unrelated off-topic comment, but I remember getting into a programming phase in grade 6-7. I've pretty much obsessed over the thought of doing something great with it. Got myself to do a few courses but never really stuck on as ive moded onto Finance. Now I kinda wanna get into it again as I did in the past...

  • @scaffus
    @scaffus Год назад +1

    Great vid! Love your work

  • @creeperlolthetrouble
    @creeperlolthetrouble Год назад +1

    xD i've seen this coming for months but why not keep AWS and tunnel the requests through a proxy

  • @christianjedro
    @christianjedro Год назад +1

    How do you avoid vendor/database lock in by using AWS SQS?!

  • @socks5proxy
    @socks5proxy Год назад

    absolutely brilliant video. so very well done.

  • @dandandev
    @dandandev Год назад +1

    Heya! I'd recommend Railway to host your apps, its usage based and pretty cheap!

  • @veshal.s3690
    @veshal.s3690 Год назад +1

    Would love a post on your powerlevel10k config and your terminal config

  • @kale_bhai
    @kale_bhai Год назад

    Learned about the queing system utilization. But thats pretty much the obly thing new to me.

  • @shadyworld1
    @shadyworld1 Год назад +1

    If you could use RSS to pull the data and store them in a proper format to be used for API you’ll be able to save 40% at least of your current approach time and effort!

  • @Dev-Siri
    @Dev-Siri Год назад +6

    tip: bun 1.0 has been released just last day, and you can use it as a drop-in-replacement for node.
    it executes js much faster, without breaking anything so it can magically make your api faster. for deployment, you need to use a docker image because its still very early and not supported by any platforms (yet)

    • @ac130kz
      @ac130kz Год назад

      it just get stuck if I try to run puppeteer with whatsapwebjs, yeah, fast and cool, but too early

  • @livtown
    @livtown Год назад +1

    I see you're using a Mac now, what terminal is that? How are your rounded window corners so much less rounded that mine? Have you changed anything?

  • @trainsurf
    @trainsurf 10 месяцев назад

    I watched this video for 4 hours because it was on repeat and I fell asleep

  • @pchris
    @pchris Год назад +2

    Would something like this work for third-party applications like Reddit Apollo?

  • @betapacket
    @betapacket Год назад

    2:02 isn't playright yet another ECM and not a web scraper?

  • @EarlZMoade
    @EarlZMoade Год назад +3

    Are there any issues with legality when using the data you extract? I.e. could you use the data for commercial purposes, or research?

    • @ristekostadinov2820
      @ristekostadinov2820 Год назад +7

      Microsoft i think have taken someone to court for web scraping and won, i think it was a company that were scraping linkedin public data from users and were building their own app for recruiting people and microsoft were arguing that the users didn't consent to that (which is true, but then again data is public). So it's a very tricky problem, and is best to read websites terms & service.

  • @cooperqmarshall
    @cooperqmarshall Год назад +1

    The quality of this project is supreme their. Love the detail and consideration for the infrastructure

  • @jerryaugusto95
    @jerryaugusto95 Год назад +1

    Is it just me or are the icons for the Go files different? How do you change these icons please?

  • @Meleeman011
    @Meleeman011 Год назад

    why do you use playwright and not just puppeteer?

  • @sheldonsays9922
    @sheldonsays9922 Год назад

    How long did it actually take for you to complete this project.

  • @sumirandahal76
    @sumirandahal76 Год назад

    Quality project ❤ content worth watching, hooks through the time. 🎉

  • @jasondoubleoseven
    @jasondoubleoseven Год назад

    Good job, one improvement would be to go with a single table design with DynamoDb

  • @5criptcom
    @5criptcom Год назад

    Good one sir!

  • @glitchy_weasel
    @glitchy_weasel Год назад

    Fantastic! Very informative, always nice to stick it to big tech lol

  • @stylrart
    @stylrart Год назад

    Nice you are using JB Mono, like me.
    what theme are you using, the colors are handsome ;)

  • @guillemgarcia3630
    @guillemgarcia3630 Год назад +2

    jesus there's more terraform configuration than code

  • @engineeringjoe
    @engineeringjoe Год назад

    Which Editor is he using? Vim?

  • @JoshIbbotson-
    @JoshIbbotson- Год назад

    How long have you been programming? Loved this video btw!

    • @dreamsofcode
      @dreamsofcode  Год назад

      Thank you! I've been writing code since 2008.

  • @mx338
    @mx338 Год назад

    DynamoDB isn't really low cost, so I would definitely look into switching to ScyllaDB which offers a DynamoDB compatible API.

  • @zack_beard
    @zack_beard Год назад

    Great content! Quick question. Did you do this after logging into to Reddit with your userid/pwd o without? IIRC Reddit does not show new content if you are not logged in. Thanks!

    • @dreamsofcode
      @dreamsofcode  Год назад

      Thank you!
      Logged out, which causes it to fall under publically accessible. Reddit still shows content on the old reddit website under the /new when you're not logged it.

  • @rando521
    @rando521 Год назад +2

    hi dreams i love your vids on vim and tried it on my own due to them
    while trying c++ i want to know if there is a better option than cmake?
    i come from python so i plan on rpc-ing the python part and move to mostly c++ or golang any ideas on how to do this?

    • @FaZekiller-qe3uf
      @FaZekiller-qe3uf Год назад +2

      The better option is to use a language with good tooling. Zig, Rust, Go, etc. cmake L, Make L.

    • @jacksonsmith4648
      @jacksonsmith4648 Год назад

      Meson! It's basically CMake, but with syntax similar to python, and a lot less stupid design decisions. Definitely worth a look.

    • @S0L4RE
      @S0L4RE Год назад

      @@jacksonsmith4648why are we hating on cmake?

  • @metalspoon69
    @metalspoon69 Год назад +16

    "Just build your own API"
    *builds own API*
    "NOO NOT LIKE THAT!!!!"

  • @Rundik
    @Rundik Год назад +6

    You don't need any browser to scrape html from reddit. How did you even managed to configure vim with that kind of skills?

    • @SAsquirtle
      @SAsquirtle Год назад +1

      what about pressing the next button, don't you need a browser emulator for that?

    • @Rundik
      @Rundik Год назад

      @@SAsquirtle unless you need to take a screenshot or you don't have much experience/time using puppetier-like tools is extreamly wasteful. And for simple text scraping you don't even need that much experience at all

  • @TrueDetectivePikachu
    @TrueDetectivePikachu Год назад

    Genuine question, why use puppeteer that relies on an active browser and not something like cheerio?

    • @dreamsofcode
      @dreamsofcode  Год назад +1

      It's a great question. Cheerio would work really well in this case as there was little to no javascript for the old version of reddit. Initially I wanted to go with the new reddit so had scoped out using an active browser (which I think has more application beyond reddit). Cheerio is always preferable in a case with no javascript, but it's not as applicable as puppeteer is. TLDR is that I wanted to showcase active browser scraping in the video.

  • @iamrafiqulislam
    @iamrafiqulislam Год назад

    what is the Font you are using for Nvim and tmux status bar, please?

    • @dreamsofcode
      @dreamsofcode  Год назад +1

      I am using JetBrainsMono Nerd Font! I have a video on both of my Nvim and tmux configs on my channel :)

  • @houstonbova3136
    @houstonbova3136 Год назад

    DataStore and FireStore work roughly the same as Dynamo, no?

  • @robinbinder8658
    @robinbinder8658 Год назад

    boi do i smell a cease and desist

  • @flor.7797
    @flor.7797 Год назад

    There’s no AI without API

  • @xybersurfer
    @xybersurfer Год назад +2

    i was with you until you started putting things in a database and the cloud. was it because your video was sponsored by a cloud provider? (i really can't tell) it would be more interesting to see you justifying decisions. seeing all the code is really not that interesting. the overall idea of creating your own reddit API is interesting though, so i will give this a like

  • @willmil1199
    @willmil1199 Год назад

    How do we use your api then ?

  • @_Mackan
    @_Mackan Год назад

    virgin api consumer vs chad scraper

  • @jondoe79
    @jondoe79 Год назад +2

    Great content, real examples of use case for different tools for a simple but useful project.

  • @Lamin777-N
    @Lamin777-N Год назад

    if you used python you could easily bypass ip blocking with torpy

  • @Shudshudu
    @Shudshudu Год назад

    Sir am learning c and am new to programming. Currently am learning control structure. But when i look into real world projects I don’t understand anything why

    • @coda-n6u
      @coda-n6u Год назад +1

      It takes time! Also C is a VERY different level of abstraction than Javascript / Go like he used here.

  • @grif5307
    @grif5307 Год назад

    One of my favourite videos in a while, great job!!!!

  • @Puwunda
    @Puwunda Год назад

    Intercontinental Lawsuit Inbound!!!

  • @dannyeg-glitch
    @dannyeg-glitch Год назад

    sorry oot, did you use mac sir?

  • @vekoze9872
    @vekoze9872 Год назад

    what is the tmux font ?

  • @siniarskimar
    @siniarskimar Год назад

    How about developing a browser extension for "enhancing" reddit that would additionaly scrape any post that user sees 🤔

  • @ultimatetoast2739
    @ultimatetoast2739 Год назад

    Apicels be seething over scrapechads

  • @qCJLbggG4IWAY9nTH6o
    @qCJLbggG4IWAY9nTH6o Год назад

    why not use their rss feed?

  • @hemant_san
    @hemant_san Год назад

    how to bypass capctha?

  • @bieggerm
    @bieggerm Год назад

    This video shows the only way an arms race should be visualized

  • @pixel690
    @pixel690 Год назад +1

    $20 per GB is something different jesus

  • @lollermann
    @lollermann Год назад

    Don't let pyrocynical see this video he'll become a web dev

  • @hqcart1
    @hqcart1 Год назад

    what about cAaptcha ??????????????????????

  • @mayar2047
    @mayar2047 Год назад

    I'm thinking of just scrape reddit directly from a mobile device, and maybe save the data to the device for caching. I don't need to pay for anything

  • @makeshift27015
    @makeshift27015 Год назад

    Oh god, I hope this doesn't give them even more reason to kill old reddit, it's the only way I can bear using reddit now.
    As an aside, would it be possible to decompile/packet sniff their mobile app and emulate the requests it makes for a pseudo-api? I haven't decompiled android apps in a hot minute, but I imagine it uses some sort of api rather than downloading a massive html payload that requires parsing

  • @navaneeth6157
    @navaneeth6157 Год назад

    chromedp for golang is also an option

  • @appelnonsurtaxe
    @appelnonsurtaxe Год назад +3

    Why use playwright and not just parse html directly if you're going to disable CSS, JS and media loading anyway? Most HTML parsing libraries support CSS selectors.
    I hate dishonest/corrupt sponsored content like this, showing bad and expensive solutions to a problem, just because you're paid for it.
    I'd honestly be happy to be proven wrong.

  • @dimagass7801
    @dimagass7801 Год назад

    I have no clue how to use apis I still don't completely understand but data is the new oil😅

  • @mr.togrul--9383
    @mr.togrul--9383 Год назад +3

    Great video btw! In the future I also want to make my own web scraper project and this just simplified everything I need to do.
    Is there any reason why you didnt just use Golang for the whole thing, for the scraper as well? just curious, since as you said writing golang would be more faster than node js

    • @jean_hirtz
      @jean_hirtz Год назад

      Curious about Golang - any repo / vids ?

  • @thygrrr
    @thygrrr Год назад +2

    No, just no.
    Nice project, yet absolutely impractical. Just imagine the API calls you could have paid for with the developer-hours required. And your browser will get rate-limited into the dirt really quickly.

  • @eddi-y4e
    @eddi-y4e Год назад

    They will block things like this with Web Environment Integrity

  • @RodolfoOchoa
    @RodolfoOchoa Год назад

    so you pay AWS instead....

  • @reihanboo
    @reihanboo Год назад

    didn't understand anything but great video

  • @_soundwave_
    @_soundwave_ Год назад

    A very interesting comment section.

  • @iliabeliaev2260
    @iliabeliaev2260 Год назад

    Old reddit is the only version I use...

  • @earu_arcana
    @earu_arcana Год назад

    Nice video, but your setup is a lot more complex than it needs to be IMO.