The video was enjoyable, but it's important to acknowledge that sponsored content can introduce bias. One approach could be to make the entire video centered around the sponsor, or if you choose to feature the sponsor as you did, consider presenting alternative services similar to them. Your videos are consistently excellent, boasting high-quality production, a well-maintained pace, and crystal-clear explanations.
uh, no. It's a financial one. The idea that companies are going to offer network and compute resources for the sheer amount of API calls made for free was always comical. It's sad that so many programmers and general public think this stuff is just free or a charity. No matter what you do, eventually these costs will catch up to the business and HAVE to be charged to people or else the service will just die.
@@JustSomeGuy009 Yep. People pretend webscraping is 'free', but it still costs the companies. The companies are willing to bear the cost of regular users browsing through pages, but a scraper browsing through the entire catalog is even more expensive for the company than if they just used the API. Scraping is definitely malicious.
@@tabbytobias2167 yes, but for a service the size of reddit, it can lead to hundreds of thousands of dollars in losses, due to the sheer volume of those requests.
Crowdsourcing idea for this to prevent IPs getting blocked: a browser that pays its users for using it. Developers write scripts to scrape data, and pay to use the network of users. Users then get paid for using the web browser, which will create a private session, encrypted away from the user, run the web scraping tasks, and send the data back to the developer. Build it all on top of chromium, and if done correctly, websites would have a very difficult time blocking based on IP addresses, activity , or fingerprinting because it would be distributed across actual user IPs, and actual user login times (browser only runs when open). My only concern would be how to protect the users when malicious devs start doing illegal activities. You'd have to have very strong terms and conditions, have logging, and be able to trace back requests to devs. But then that opens a dev privacy can of worms. Still, interesting concept
For anyone planning to try this , use headless mode of puppeteer so that I doesn’t open multiple browser to improve performance and route it via a vpn setup on aws to obfuscate . And be ready to have your ip blocked 😊
which is not that hard because if you port scan the entire internet with some strategic guessing (downloading public datacenter IP ranges, scan port 1080 for SOCKS5 proxys) you can find unsecured proxys for free, even some rare ones that work with SSL over SOCKS5
i asked someone if port scanning the internet to find proxys is illegal and they said no so i think it's completely legal, they didnt put a password or any authentication so they are allowing people to use it
@@__sassan__ if you send a ton of requests with the same IP, you'll get rate limited. Also most VPN ips are datacenter IPs which are almost always blocked.
This is an interesting idea but a really impractical approach. New Reddit is an SPA and you can just use the XHR endpoints to fetch the data raw. Don't bother with browser emulation and HTML parsing. Besides, the closure of the APIs was never about restricting access to a user like you're circumventing here. As you've acknowledged, that wouldn't really make sense on the Web. The API pricing is about charging for data farming and large-scale user interception. You can't accomplish either of these use cases by scraping; you'll get rate-limited very quickly. The only way around this is using Bright Data's borderline-illegal botnet, which seems like a pretty shady way to do business.
its called hostile interoperability and its the consequence to fucking over developers, its time we remind platform hosts why APIs were created in the first place
People will use their own embedded browsers and similar scraping methods will occur locally. It's basically the same as an extension modification of the site. People just browsing normally don't need botnets and access to all of reddit, they just want a better stinking interface.
@@tatianatub It's time to remind you, that Reddit doesn't belong to "us". It belongs to Reddit. And they can do whatever they want with it. If they don't want large applications like Apollo to scrape EVERY post, comment, upvote, downvote, user karma and such, there is nothing you can do about it. That's it. It's not that deep.
The internet is meant to be and should be open. That doesn't mean everything has to be free at-scale but fighting hostility to the _idea of an open internet_ is a good thing. You're free to put your content behind a paywall for everyone.
To add to your point about the Reddit API stuff being about large users only, I still use a third party app for my browsing, but I have my own API key, looking at my app usage on my phone I spend between 1-3 hours on it daily and haven't heard anything from Reddit HQ trying to shut me down...
new reddit probably uses an internal API you can pull from by fetching from the browser window. also note another user's comment about old reddit + cheerio (no browser needed).
Decoupling the data persistence from the business logic is always a good idea, but using a queue service for that is bonkers. It removes none of the existing complexity, since you still eventually have to map the message payload to the database schema, and then introduces more complexity because you now have to keep track of one more service, the publishing code, the consuming code and the asynchronicity itself. Just use the repository pattern with an adapter for the chosen database, or an ORM like Prisma if you really don't expect the app to scale much.
Yeah, its expensive as heck. Also I am wondering how they claim they have 72 million residential ips? I can only imagine them having spread malware which then gave them a botnet to work with, or, less likely, they offer people money in exchange for them running a proxy. Edit: I looked it up, apparently they have an SDK which app developers can integrate which gives the users a choice between ads or allowing their connection to be used by BrightData as a proxy, thats where they (at least claim to) have the proxies from.
@@GoldenretriverYT It'd be insane to run a resold proxy on your personal IP, just to see no ads somewhere. Worst case you get your home raided by law enforcement, because someone did something highly illegal with it. But I wouldn't be surprised if less educated people still do this.
@@GoldenretriverYT99% of "residential proxies" are just computers under a botnet. Hola (that free Vpn) got in trouble a while back for making people who used their VPN join their botnet for this very reason!
Why use an SQS queue to abstract the db writing interface? The solution that immediately comes to mind is to just make an abstract class. The point of SQS is to be able to handle crazy amounts of throughput (like, up to 30,000 messages per second), which isn't really what you're doing.
I would do this in a different way. I would simply write a script in whatever language, that has a get and post function so you can call the main page first, then parse the data, often websites use apis already to fetch the content, use Fiddler Classic or some other proxy server to inspect what api the website uses. When the website loads more content after scrolling, it needs to fetch the data from somewhere. Simply reproduce this api by copying the authentication tokens from the headers and providing the required headers in the requests, then parse the response body and add it to some database. I would make it store everything so if it needs to be fetched repeatedly it simply gets from offline copy instead of wasting resources fetching and parsing. I never automate browsers, if your browser can fetch the data, you can fetch it too without front end. You can also get the url to load more content from fetching the raw main page because the browser needs to know where to fetch this anyways so it's definitely defined somewhere. It's super simple to scrape websites, you only need to know how to do requests and parse json and xml in your preferred language! Don't automate browsers but just fetch it directly!
I come from a cracking background and back in the day and this is exactly what we would do. We would write GET/POST requests with token grabbing methods and get the job done. We’d launch hundredths of threads all connected to different proxies, instead of a single web browser. Sometimes it was challenging for particular platforms because of cookies but at the end of the day it was doable.
Websites that use recaptcha2 are ones that I've found that I need a browser to interact with. Ones that don't then yes, totally, inspect the network traffic and understand how your browser is creating the requests and then replicate them.
Great video. I liked the step by step thought process of getting the scraper get data. One major flaw in the cost analysis you presented was the absence of any cost for brightdata. Checking the pricing myself it looks like 20€ per GB of data?
I actually did something similar, we needed financial data for a project we were working on and the APIs we found were very limiting and some were very expensive. We tried using one of the cheaper ones and it straight up did not work, it had downtime of sometimes hours and when we contacted the company they basically told us it wasn't there problem. So I built a web scraper, hosted it on my server at home and scraped all the forex data I needed from their website for free.
That does come with a significant management. the project is a simple way to get it working. Once you dig deeper there are lots of problems. Lambda and dynamodb is cheaper based on amount of requests. If you post your api endpoint in public. 1 million requests will be gone in seconds. and then using Lambda will make it more expensive than running your server. If its cheaper, someone else would have done it already.
Honestly, I think this infrastructure is too complicated for what it is doing. I don't really care about the sponsored bit, but I think it would have been better to simply create a lambda that directly writes to a database (assume a cacheFactory -> RedisCache | MongoCache | JsonCache) along with a "freshness" param due to the relative simplicity of the data I think redis would be a good candidate; Then all you would need to do in the API is simply fetch the data based on the query param, something which can probably be achieved in a single file.
Yeah I feel it's been quite overengineered with all this message queue and database/service stuff, this could be done fully locally realistically and at not much of a bigger cost since nowadays OSS databases and caching solutions are really efficient
@@joopie46614 This could be a single service on a docker image, run a cron scheduler that fetches and writes to a json file and have a server running that uses the json as a database
@@dancinglazer1628even he uses a sponsored service, at one point you will get captcha, and my point was his code does not handle that.. and about fetching HTML, no it does not work for complex sites where HTML code or classes is getting rewritten by js, i tried that and failed, ended up using headless browser.
I did the same thing for my app using selenium bindings in rust and used vagrant to manage instances. You can use docker if you want. Please mark this video as ad, because none in their right mind would do it this way. I am so tired of people shoving ads down my throat and claiming its a good education.
1st off, thanks for this buddy, you're a godsend. it does feel a bit over-engineered but i guess you've gone this route because you want to build your own Reddit API. for folks like me who have only been coding everyday for 1 month using GPT - knowing how to pull the data from reddit and store in a database is the main thing i need (i think most people as well but i could be wrong). keep up the good work still and thank you again !
Why didnt you use curl? It would bei way more lightweight than using a Browser. Most Programming languages will let you manage Dom objects with built in libraries
Every modern website has an API. Most just aren't documented. 🤷♂️ Copy their own website's auth flow and use those tokens to drive your app. Wjat are they gonna do? Paywall their entire site? (Ok, ok; SSR is a thing, but there's still almost always some pure-data endpoint around)
@@TheSaintsVEVO I’m not sure if Reddit uses it, but IUAM detects very low-level characteristics about the request (i.e cipher mode, SSL configuration) to determine whether it looks automated.
So this vid popped up in my recs. Unrelated off-topic comment, but I remember getting into a programming phase in grade 6-7. I've pretty much obsessed over the thought of doing something great with it. Got myself to do a few courses but never really stuck on as ive moded onto Finance. Now I kinda wanna get into it again as I did in the past...
If you could use RSS to pull the data and store them in a proper format to be used for API you’ll be able to save 40% at least of your current approach time and effort!
tip: bun 1.0 has been released just last day, and you can use it as a drop-in-replacement for node. it executes js much faster, without breaking anything so it can magically make your api faster. for deployment, you need to use a docker image because its still very early and not supported by any platforms (yet)
Microsoft i think have taken someone to court for web scraping and won, i think it was a company that were scraping linkedin public data from users and were building their own app for recruiting people and microsoft were arguing that the users didn't consent to that (which is true, but then again data is public). So it's a very tricky problem, and is best to read websites terms & service.
Great content! Quick question. Did you do this after logging into to Reddit with your userid/pwd o without? IIRC Reddit does not show new content if you are not logged in. Thanks!
Thank you! Logged out, which causes it to fall under publically accessible. Reddit still shows content on the old reddit website under the /new when you're not logged it.
hi dreams i love your vids on vim and tried it on my own due to them while trying c++ i want to know if there is a better option than cmake? i come from python so i plan on rpc-ing the python part and move to mostly c++ or golang any ideas on how to do this?
@@SAsquirtle unless you need to take a screenshot or you don't have much experience/time using puppetier-like tools is extreamly wasteful. And for simple text scraping you don't even need that much experience at all
It's a great question. Cheerio would work really well in this case as there was little to no javascript for the old version of reddit. Initially I wanted to go with the new reddit so had scoped out using an active browser (which I think has more application beyond reddit). Cheerio is always preferable in a case with no javascript, but it's not as applicable as puppeteer is. TLDR is that I wanted to showcase active browser scraping in the video.
i was with you until you started putting things in a database and the cloud. was it because your video was sponsored by a cloud provider? (i really can't tell) it would be more interesting to see you justifying decisions. seeing all the code is really not that interesting. the overall idea of creating your own reddit API is interesting though, so i will give this a like
Sir am learning c and am new to programming. Currently am learning control structure. But when i look into real world projects I don’t understand anything why
Oh god, I hope this doesn't give them even more reason to kill old reddit, it's the only way I can bear using reddit now. As an aside, would it be possible to decompile/packet sniff their mobile app and emulate the requests it makes for a pseudo-api? I haven't decompiled android apps in a hot minute, but I imagine it uses some sort of api rather than downloading a massive html payload that requires parsing
Why use playwright and not just parse html directly if you're going to disable CSS, JS and media loading anyway? Most HTML parsing libraries support CSS selectors. I hate dishonest/corrupt sponsored content like this, showing bad and expensive solutions to a problem, just because you're paid for it. I'd honestly be happy to be proven wrong.
Great video btw! In the future I also want to make my own web scraper project and this just simplified everything I need to do. Is there any reason why you didnt just use Golang for the whole thing, for the scraper as well? just curious, since as you said writing golang would be more faster than node js
No, just no. Nice project, yet absolutely impractical. Just imagine the API calls you could have paid for with the developer-hours required. And your browser will get rate-limited into the dirt really quickly.
To get $15 credit for use with Brightdata to scrape your own APIs, visit: brdta.com/dreamsofcode
Just an info: Not working anymore, only $5
@@meinkanal13378 inflation strikes again 😭
Let me reach out. Thank you for letting me know
Be careful, we scraping is illegal in some countries.
The video was enjoyable, but it's important to acknowledge that sponsored content can introduce bias. One approach could be to make the entire video centered around the sponsor, or if you choose to feature the sponsor as you did, consider presenting alternative services similar to them. Your videos are consistently excellent, boasting high-quality production, a well-maintained pace, and crystal-clear explanations.
that was my exact thought, thanks for pointing it out.
when ChatGPT focuses more on the sponsor of the video than the video itself
Yeah, bro was talking about make everything as free as possible and then presents a service which is very expensive
what about captcha?????? he didnt mention that his sponsor can go around it, and even his code did not handle captcha.
@@hqcart1 Death by captcha and related services exist for that
This was never a technical limitation, it was a legal one.
uh, no. It's a financial one. The idea that companies are going to offer network and compute resources for the sheer amount of API calls made for free was always comical. It's sad that so many programmers and general public think this stuff is just free or a charity. No matter what you do, eventually these costs will catch up to the business and HAVE to be charged to people or else the service will just die.
@@JustSomeGuy009 I am going to touch you without your consent
@@JustSomeGuy009 Yep. People pretend webscraping is 'free', but it still costs the companies. The companies are willing to bear the cost of regular users browsing through pages, but a scraper browsing through the entire catalog is even more expensive for the company than if they just used the API. Scraping is definitely malicious.
@@JustSomeGuy009 it costs a server less than a penny to serve 1000 requests.
@@tabbytobias2167 yes, but for a service the size of reddit, it can lead to hundreds of thousands of dollars in losses, due to the sheer volume of those requests.
Crowdsourcing idea for this to prevent IPs getting blocked: a browser that pays its users for using it. Developers write scripts to scrape data, and pay to use the network of users. Users then get paid for using the web browser, which will create a private session, encrypted away from the user, run the web scraping tasks, and send the data back to the developer. Build it all on top of chromium, and if done correctly, websites would have a very difficult time blocking based on IP addresses, activity , or fingerprinting because it would be distributed across actual user IPs, and actual user login times (browser only runs when open). My only concern would be how to protect the users when malicious devs start doing illegal activities. You'd have to have very strong terms and conditions, have logging, and be able to trace back requests to devs. But then that opens a dev privacy can of worms. Still, interesting concept
Botnet as a Service
You just described 99% of the “VPN” apps available for your mobile device… ;)
Wouldn't residential sneaker botting proxies be able to accomplish the same thing?
@@MuhsinunChowdhury These costs..
ahahahah you're not wrong, especially the free ones@@levifig
For anyone planning to try this , use headless mode of puppeteer so that I doesn’t open multiple browser to improve performance and route it via a vpn setup on aws to obfuscate .
And be ready to have your ip blocked 😊
Even when using the VPN?
vpns also have an ip so when doing this if they block you you need an endless revolving door of new VPNs or proxys @@__sassan__
which is not that hard because if you port scan the entire internet with some strategic guessing (downloading public datacenter IP ranges, scan port 1080 for SOCKS5 proxys) you can find unsecured proxys for free, even some rare ones that work with SSL over SOCKS5
i asked someone if port scanning the internet to find proxys is illegal and they said no so i think it's completely legal, they didnt put a password or any authentication so they are allowing people to use it
@@__sassan__ if you send a ton of requests with the same IP, you'll get rate limited. Also most VPN ips are datacenter IPs which are almost always blocked.
This is an interesting idea but a really impractical approach. New Reddit is an SPA and you can just use the XHR endpoints to fetch the data raw. Don't bother with browser emulation and HTML parsing.
Besides, the closure of the APIs was never about restricting access to a user like you're circumventing here. As you've acknowledged, that wouldn't really make sense on the Web. The API pricing is about charging for data farming and large-scale user interception. You can't accomplish either of these use cases by scraping; you'll get rate-limited very quickly.
The only way around this is using Bright Data's borderline-illegal botnet, which seems like a pretty shady way to do business.
its called hostile interoperability and its the consequence to fucking over developers, its time we remind platform hosts why APIs were created in the first place
People will use their own embedded browsers and similar scraping methods will occur locally. It's basically the same as an extension modification of the site. People just browsing normally don't need botnets and access to all of reddit, they just want a better stinking interface.
@@tatianatub It's time to remind you, that Reddit doesn't belong to "us". It belongs to Reddit. And they can do whatever they want with it. If they don't want large applications like Apollo to scrape EVERY post, comment, upvote, downvote, user karma and such, there is nothing you can do about it. That's it. It's not that deep.
The internet is meant to be and should be open.
That doesn't mean everything has to be free at-scale but fighting hostility to the _idea of an open internet_ is a good thing. You're free to put your content behind a paywall for everyone.
To add to your point about the Reddit API stuff being about large users only, I still use a third party app for my browsing, but I have my own API key, looking at my app usage on my phone I spend between 1-3 hours on it daily and haven't heard anything from Reddit HQ trying to shut me down...
Really cool project idea! Loved it
yooo
Its the rust guy
Yoo thank you! Love your videos as well.
rusty boi
new reddit probably uses an internal API you can pull from by fetching from the browser window. also note another user's comment about old reddit + cheerio (no browser needed).
He probably used Playwright just to have an excuse to shove the Bright Data sponsorship in the vide, which I understand.
Decoupling the data persistence from the business logic is always a good idea, but using a queue service for that is bonkers. It removes none of the existing complexity, since you still eventually have to map the message payload to the database schema, and then introduces more complexity because you now have to keep track of one more service, the publishing code, the consuming code and the asynchronicity itself.
Just use the repository pattern with an adapter for the chosen database, or an ORM like Prisma if you really don't expect the app to scale much.
Agreed. I swear 90% of queues I encounter are needless overcomplications
Someone fork this and make it non-ridiculous
Curious how much you spend on bright data as their product is like 20$ / GB and 0.1/hour
Yeah, its expensive as heck. Also I am wondering how they claim they have 72 million residential ips?
I can only imagine them having spread malware which then gave them a botnet to work with, or, less likely, they offer people money in exchange for them running a proxy.
Edit: I looked it up, apparently they have an SDK which app developers can integrate which gives the users a choice between ads or allowing their connection to be used by BrightData as a proxy, thats where they (at least claim to) have the proxies from.
@@GoldenretriverYT It'd be insane to run a resold proxy on your personal IP, just to see no ads somewhere. Worst case you get your home raided by law enforcement, because someone did something highly illegal with it. But I wouldn't be surprised if less educated people still do this.
@@GoldenretriverYT99% of "residential proxies" are just computers under a botnet.
Hola (that free Vpn) got in trouble a while back for making people who used their VPN join their botnet for this very reason!
Why use an SQS queue to abstract the db writing interface? The solution that immediately comes to mind is to just make an abstract class.
The point of SQS is to be able to handle crazy amounts of throughput (like, up to 30,000 messages per second), which isn't really what you're doing.
I would do this in a different way. I would simply write a script in whatever language, that has a get and post function so you can call the main page first, then parse the data, often websites use apis already to fetch the content, use Fiddler Classic or some other proxy server to inspect what api the website uses. When the website loads more content after scrolling, it needs to fetch the data from somewhere. Simply reproduce this api by copying the authentication tokens from the headers and providing the required headers in the requests, then parse the response body and add it to some database. I would make it store everything so if it needs to be fetched repeatedly it simply gets from offline copy instead of wasting resources fetching and parsing. I never automate browsers, if your browser can fetch the data, you can fetch it too without front end. You can also get the url to load more content from fetching the raw main page because the browser needs to know where to fetch this anyways so it's definitely defined somewhere. It's super simple to scrape websites, you only need to know how to do requests and parse json and xml in your preferred language! Don't automate browsers but just fetch it directly!
I come from a cracking background and back in the day and this is exactly what we would do. We would write GET/POST requests with token grabbing methods and get the job done. We’d launch hundredths of threads all connected to different proxies, instead of a single web browser. Sometimes it was challenging for particular platforms because of cookies but at the end of the day it was doable.
Websites that use recaptcha2 are ones that I've found that I need a browser to interact with. Ones that don't then yes, totally, inspect the network traffic and understand how your browser is creating the requests and then replicate them.
+1 it’s such a massive pet peeve of mine seeing people use selenium when it could just be achieved with requests.
@@unforgettable31 what about captchas then
@@cheemzboi Most platforms use captchas when they detect ongoing suspicious activity, which is omitted when using proxies.
Great video. I liked the step by step thought process of getting the scraper get data. One major flaw in the cost analysis you presented was the absence of any cost for brightdata. Checking the pricing myself it looks like 20€ per GB of data?
I actually did something similar, we needed financial data for a project we were working on and the APIs we found were very limiting and some were very expensive.
We tried using one of the cheaper ones and it straight up did not work, it had downtime of sometimes hours and when we contacted the company they basically told us it wasn't there problem.
So I built a web scraper, hosted it on my server at home and scraped all the forex data I needed from their website for free.
8 cents for 3 weeks damn this really makes reddit unreasonable
That does come with a significant management. the project is a simple way to get it working. Once you dig deeper there are lots of problems. Lambda and dynamodb is cheaper based on amount of requests. If you post your api endpoint in public. 1 million requests will be gone in seconds. and then using Lambda will make it more expensive than running your server.
If its cheaper, someone else would have done it already.
Honestly, I think this infrastructure is too complicated for what it is doing. I don't really care about the sponsored bit, but I think it would have been better to simply create a lambda that directly writes to a database (assume a cacheFactory -> RedisCache | MongoCache | JsonCache) along with a "freshness" param due to the relative simplicity of the data I think redis would be a good candidate; Then all you would need to do in the API is simply fetch the data based on the query param, something which can probably be achieved in a single file.
Yeah I feel it's been quite overengineered with all this message queue and database/service stuff, this could be done fully locally realistically and at not much of a bigger cost since nowadays OSS databases and caching solutions are really efficient
he will need a 2-4GB ram VM to do that. AWS is expensive
@@hqcart1 he is deffering the scraping to the sponsered service anyway, but I think we can just fetch the html instead of running a headless browser
@@joopie46614 This could be a single service on a docker image, run a cron scheduler that fetches and writes to a json file and have a server running that uses the json as a database
@@dancinglazer1628even he uses a sponsored service, at one point you will get captcha, and my point was his code does not handle that.. and about fetching HTML, no it does not work for complex sites where HTML code or classes is getting rewritten by js, i tried that and failed, ended up using headless browser.
I did the same thing for my app using selenium bindings in rust and used vagrant to manage instances. You can use docker if you want. Please mark this video as ad, because none in their right mind would do it this way. I am so tired of people shoving ads down my throat and claiming its a good education.
1st off, thanks for this buddy, you're a godsend.
it does feel a bit over-engineered but i guess you've gone this route because you want to build your own Reddit API.
for folks like me who have only been coding everyday for 1 month using GPT - knowing how to pull the data from reddit and store in a database is the main thing i need (i think most people as well but i could be wrong).
keep up the good work still and thank you again !
I'm curious as to what the total time for the project was.
THANK YOU!
Very helpful!
What about the cost of bright data ?
i hope this doesnt kill old reddit, if they remove it im gone
Unrelated to this video - would you show how you version your dotfiles (if you do)? It would make for a good video.
you are aware of the json api that things like rif is using? basically, for every link, there is also a json file you can just access
Using a full webbrowser when a simple HTTP request and HTML parser would suffice...
You're correct. It would have. However a browser is a more versatile option for other use cases.
Why didnt you use curl? It would bei way more lightweight than using a Browser. Most Programming languages will let you manage Dom objects with built in libraries
Every modern website has an API.
Most just aren't documented. 🤷♂️
Copy their own website's auth flow and use those tokens to drive your app. Wjat are they gonna do? Paywall their entire site?
(Ok, ok; SSR is a thing, but there's still almost always some pure-data endpoint around)
You can remove the browser part by using a web scraping framework that works without a browser instance.
You can also reverse engineer their private api by looking at the browser network requests. The scraping will be much faster
Although Cloudflare IUAM makes it an immense pain in the ass
And keep in mind that private APIs are susceptible to change, so today it’s gonna work, tomorrow you have to start over
@@batmanatkinson1188less often than the html
@@S0L4REwhat’s that? Does Reddit use it?
@@TheSaintsVEVO I’m not sure if Reddit uses it, but IUAM detects very low-level characteristics about the request (i.e cipher mode, SSL configuration) to determine whether it looks automated.
Man, could you make a video about the configurations of your terminal?
How to get four hitmen at your door:
Would it have cost more without the brightdata sponsorship?
Ohh yes, a lot, bright data is very expensive
So this vid popped up in my recs. Unrelated off-topic comment, but I remember getting into a programming phase in grade 6-7. I've pretty much obsessed over the thought of doing something great with it. Got myself to do a few courses but never really stuck on as ive moded onto Finance. Now I kinda wanna get into it again as I did in the past...
Great vid! Love your work
xD i've seen this coming for months but why not keep AWS and tunnel the requests through a proxy
How do you avoid vendor/database lock in by using AWS SQS?!
absolutely brilliant video. so very well done.
Thank you! I'm glad you enjoyed it!
Heya! I'd recommend Railway to host your apps, its usage based and pretty cheap!
Would love a post on your powerlevel10k config and your terminal config
Learned about the queing system utilization. But thats pretty much the obly thing new to me.
If you could use RSS to pull the data and store them in a proper format to be used for API you’ll be able to save 40% at least of your current approach time and effort!
tip: bun 1.0 has been released just last day, and you can use it as a drop-in-replacement for node.
it executes js much faster, without breaking anything so it can magically make your api faster. for deployment, you need to use a docker image because its still very early and not supported by any platforms (yet)
it just get stuck if I try to run puppeteer with whatsapwebjs, yeah, fast and cool, but too early
I see you're using a Mac now, what terminal is that? How are your rounded window corners so much less rounded that mine? Have you changed anything?
I watched this video for 4 hours because it was on repeat and I fell asleep
Would something like this work for third-party applications like Reddit Apollo?
It can work... until you get a captcha
2:02 isn't playright yet another ECM and not a web scraper?
Are there any issues with legality when using the data you extract? I.e. could you use the data for commercial purposes, or research?
Microsoft i think have taken someone to court for web scraping and won, i think it was a company that were scraping linkedin public data from users and were building their own app for recruiting people and microsoft were arguing that the users didn't consent to that (which is true, but then again data is public). So it's a very tricky problem, and is best to read websites terms & service.
The quality of this project is supreme their. Love the detail and consideration for the infrastructure
Is it just me or are the icons for the Go files different? How do you change these icons please?
why do you use playwright and not just puppeteer?
How long did it actually take for you to complete this project.
Quality project ❤ content worth watching, hooks through the time. 🎉
Good job, one improvement would be to go with a single table design with DynamoDb
Good one sir!
Fantastic! Very informative, always nice to stick it to big tech lol
Nice you are using JB Mono, like me.
what theme are you using, the colors are handsome ;)
jesus there's more terraform configuration than code
Which Editor is he using? Vim?
How long have you been programming? Loved this video btw!
Thank you! I've been writing code since 2008.
DynamoDB isn't really low cost, so I would definitely look into switching to ScyllaDB which offers a DynamoDB compatible API.
Great content! Quick question. Did you do this after logging into to Reddit with your userid/pwd o without? IIRC Reddit does not show new content if you are not logged in. Thanks!
Thank you!
Logged out, which causes it to fall under publically accessible. Reddit still shows content on the old reddit website under the /new when you're not logged it.
hi dreams i love your vids on vim and tried it on my own due to them
while trying c++ i want to know if there is a better option than cmake?
i come from python so i plan on rpc-ing the python part and move to mostly c++ or golang any ideas on how to do this?
The better option is to use a language with good tooling. Zig, Rust, Go, etc. cmake L, Make L.
Meson! It's basically CMake, but with syntax similar to python, and a lot less stupid design decisions. Definitely worth a look.
@@jacksonsmith4648why are we hating on cmake?
"Just build your own API"
*builds own API*
"NOO NOT LIKE THAT!!!!"
Bahahaha...
You don't need any browser to scrape html from reddit. How did you even managed to configure vim with that kind of skills?
what about pressing the next button, don't you need a browser emulator for that?
@@SAsquirtle unless you need to take a screenshot or you don't have much experience/time using puppetier-like tools is extreamly wasteful. And for simple text scraping you don't even need that much experience at all
Genuine question, why use puppeteer that relies on an active browser and not something like cheerio?
It's a great question. Cheerio would work really well in this case as there was little to no javascript for the old version of reddit. Initially I wanted to go with the new reddit so had scoped out using an active browser (which I think has more application beyond reddit). Cheerio is always preferable in a case with no javascript, but it's not as applicable as puppeteer is. TLDR is that I wanted to showcase active browser scraping in the video.
what is the Font you are using for Nvim and tmux status bar, please?
I am using JetBrainsMono Nerd Font! I have a video on both of my Nvim and tmux configs on my channel :)
DataStore and FireStore work roughly the same as Dynamo, no?
boi do i smell a cease and desist
There’s no AI without API
i was with you until you started putting things in a database and the cloud. was it because your video was sponsored by a cloud provider? (i really can't tell) it would be more interesting to see you justifying decisions. seeing all the code is really not that interesting. the overall idea of creating your own reddit API is interesting though, so i will give this a like
How do we use your api then ?
virgin api consumer vs chad scraper
Great content, real examples of use case for different tools for a simple but useful project.
if you used python you could easily bypass ip blocking with torpy
Sir am learning c and am new to programming. Currently am learning control structure. But when i look into real world projects I don’t understand anything why
It takes time! Also C is a VERY different level of abstraction than Javascript / Go like he used here.
One of my favourite videos in a while, great job!!!!
Intercontinental Lawsuit Inbound!!!
sorry oot, did you use mac sir?
what is the tmux font ?
How about developing a browser extension for "enhancing" reddit that would additionaly scrape any post that user sees 🤔
Apicels be seething over scrapechads
why not use their rss feed?
how to bypass capctha?
This video shows the only way an arms race should be visualized
$20 per GB is something different jesus
Don't let pyrocynical see this video he'll become a web dev
what about cAaptcha ??????????????????????
I'm thinking of just scrape reddit directly from a mobile device, and maybe save the data to the device for caching. I don't need to pay for anything
Oh god, I hope this doesn't give them even more reason to kill old reddit, it's the only way I can bear using reddit now.
As an aside, would it be possible to decompile/packet sniff their mobile app and emulate the requests it makes for a pseudo-api? I haven't decompiled android apps in a hot minute, but I imagine it uses some sort of api rather than downloading a massive html payload that requires parsing
chromedp for golang is also an option
Why use playwright and not just parse html directly if you're going to disable CSS, JS and media loading anyway? Most HTML parsing libraries support CSS selectors.
I hate dishonest/corrupt sponsored content like this, showing bad and expensive solutions to a problem, just because you're paid for it.
I'd honestly be happy to be proven wrong.
I have no clue how to use apis I still don't completely understand but data is the new oil😅
Great video btw! In the future I also want to make my own web scraper project and this just simplified everything I need to do.
Is there any reason why you didnt just use Golang for the whole thing, for the scraper as well? just curious, since as you said writing golang would be more faster than node js
Curious about Golang - any repo / vids ?
No, just no.
Nice project, yet absolutely impractical. Just imagine the API calls you could have paid for with the developer-hours required. And your browser will get rate-limited into the dirt really quickly.
They will block things like this with Web Environment Integrity
so you pay AWS instead....
didn't understand anything but great video
A very interesting comment section.
Old reddit is the only version I use...
Nice video, but your setup is a lot more complex than it needs to be IMO.