Design a Basic Search Engine (Google or Bing) | System Design Interview Prep

Interview Pen

Просмотров 416 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 21 дек 2024

Комментарии •

@interviewpen Год назад ⁺²⁶
Thanks for watching! Visit interviewpen.com/? for more great Data Structures & Algorithms + System Design content 🧎
@SauravSingh-mz9hc 2 месяца назад
Wnat to give interview of ai😢
@prathamshenoy9840 Год назад ⁺⁹¹
needless to say.... your channel will grow superbly. this video was RECOMMENDED by youtube
@interviewpen Год назад ⁺⁸
Thanks for the kind words! Yeah we will be posting a lot more & hope to create more quality stuff.
@newman6492 Год назад ⁺²
Yes.
@artemvolsh387 Год назад ⁺¹⁹
Channel currently hugely underrated, material is just delicious, especially for those who seek examples of complex system schemes.
Love it.
@interviewpen Год назад ⁺¹
Thanks! We have more coming - production starts this week!
@artemvolsh387 Год назад
@@interviewpen Great to hear!
@SuperGojeto 24 дня назад
I was having a feeling that I would have to crawl the youtube videos for a quality video on Google System design! But BRAVO! I find this video as the first video with the highest views. Great job!
@marwanezzat2637 Год назад ⁺³²
Dude, your content quality is superior, Keep going.
Yesterday you had 145 subscribers and now you have 245 i am so happy for you.
@interviewpen Год назад ⁺³
Thanks! haha - we'll be posting a lot more so stay tuned!
@recursion. Год назад
Nigga he's got 12.5k now lmaooo
@scorcism. Год назад
18k now
@interviewpen Год назад ⁺¹
24k now!
@scorcism. Год назад
24.8k -> 25-05-2023
@dmitrydmitriev2554 Год назад ⁺⁵
Greetings, I just came to RUclips to watch video about SQL optimization and your channel was offered. And I started to watch this video. It is amazing, the way you explain is brilliant and outstanding. Very clear, full of information, not boring because too obvious, not difficult because too sophisticated and convoluted - a golden middle.
Thank you!
@interviewpen Год назад ⁺¹
thanks for the kind words! and thanks for watching - more coming
@raghuboyapati7311 Год назад ⁺⁴
This channel is gonna explode. The content is just too good. Thank you.
@interviewpen Год назад
Thanks for watching - we'll be posting a lot more!
@michaelmaloy6378 18 дней назад
I hope this channel grows exponentially.
This is amazing content!
Thank you, and well done! :)
@lucasoliveira-xs5yh Год назад ⁺³¹
Awesome content! I liked to see some data structure (such as queue and heap) used in practice, because the simple examples are good in the beginning, but it is not that good with the time. Continue with this, really a hidden gem this channel
@interviewpen Год назад
thanks for watching - more coming
@nader2560 2 месяца назад ⁺¹
Honestly one of the best videos on the internet for system design!
@interviewpen Месяц назад ⁺¹
Thank you!
@bandr-dev Год назад ⁺⁴⁰
lmao if this was the interview question I'd just not do it.. but I'm not there yet
@interviewpen Год назад ⁺⁶
we'll get u there 🧎🧎 be brave
@theuniverse2268 Год назад
@@interviewpen only slaves need to do this
It's not worth it 🤷‍♂️ If you know how to build a search engine you're already the top 1% of the human population just make your own company and forget about a job lol
@williefr Год назад ⁺¹⁵
I really enjoyed the video! Thank you guys for taking your time and posting it, it was very entertaining and educational. Best regards
@interviewpen Год назад ⁺¹
Thanks! Stay tuned for more content
@ĀRYAN_GENE Год назад
woow loved it
by 1:48 I was in love because you ruled out everything every small detail required + planning this makes understanding alot easier rather than directly jumping into code and saying on the go
@interviewpen Год назад
Thanks for watching
@carlboneri7772 Год назад ⁺³²
One of the best walkthroughs I've ever seen, regardless of the topic or technical depth. Superb work man.
@interviewpen Год назад
thanks for commenting & the nice words - more videos coming soon!
@vladd3172 Год назад ⁺⁷
Clean, clear, efficient. ❤
I’d love to see more videos like this from you!
@interviewpen Год назад
Will do, thanks for watching!
@juanitoMint 9 месяцев назад
Really appreciate the back-of-the-envelope calculations in between!
Great work!
@interviewpen 9 месяцев назад ⁺¹
Thanks, glad you enjoyed it!
@linonator Год назад ⁺²
Wooooah!!!! This is what I needed in my life 😢. I’m now complete
@interviewpen Год назад
awesome haha - thanks for watching
@pingqiu7318 Месяц назад
Very good video! Thanks for sharing! One tiny thing, I would prefer NFS over Blob Store like S3 to keep the downloaded pages. A webpage will keep references to lots of resources, like json/css/javascript files. The bold/highlighted words are more important than plain text. That's very important information for ranking. If we don't keep the css files, those information will be lost. So we need to keep them together with the HTML file. It will be very complex to keep those information for multiple files in a single webpage in metadata if we use S3. So I suggest just download the web page with everything to a folder in NFS, and ask indexer to help themselves.
@andydataguy Год назад ⁺²⁹
Your course looks great. I love that you have a teaching assistant and the explaining styles are awesome. Will try it out! Only thing is I really wish you supported Rust 🦀🙏🏾
@interviewpen Год назад ⁺⁵
Thanks! We can add language support in under an hour. (from the engineering angle) We can push changes in a day. Just let us know in Discord.
@timSquash Год назад
yeae ive just started learning rust. It's such a cool language
@MrRetroboyish Год назад ⁺³
Only a 1/3 of the way through and already one of the best I've seen. Focused, logical leaps from topic to topic, minimal digressions. Keep it up
@interviewpen Год назад
Thanks! and thanks for watching
@JM_utube Год назад ⁺¹
I really appreciate this video! Information was clear and concise. Levels of depth are perfect for the viewer to be able to continue educating themselves about any of the topics mentioned here. Thank you so much
@interviewpen Год назад
Sure - thanks for watching!
@BraisonsCrece Год назад ⁺²
keep it going!
High quality content and a very solid platform! Without a doubt, I will buy the subscription soon and start learning!
a hug from a new Spanish subscriber
@interviewpen Год назад ⁺¹
cool! Thanks for watching. Let us know in Discord if u need any help.
@marko3808 Год назад ⁺²
This is amazing! I honestly cant wait to look into your other videos!
@interviewpen Год назад ⁺¹
More videos coming! Thanks for watching.
@marko3808 Год назад ⁺¹
@@interviewpen eagerly waiting!
@s8x. Год назад ⁺¹
Wow, this is information all for free. Thank you for making this video
@interviewpen Год назад
thanks for watching - more coming
@Roshen_Nair Год назад ⁺²
Loved the video! A video I'd love to see in the future is system design for video streaming applications e.g. RUclips, Netflix.
@interviewpen Год назад ⁺¹
Will do - thanks for watching
@strawberriesandcream2863 Год назад
amazing video, thanks👏👏i like how you guys dig deep into complex aspects of every system that some other content just gloss over
@interviewpen Год назад
Thanks!
@frankguo1748 10 месяцев назад
Really clear, concise and efficient explanation and narrative. 👍
@interviewpen 10 месяцев назад
Thanks!
@johnny_silverhand Год назад ⁺³
Exceptional way of explaining things , I'm subscribed to you guys now
@interviewpen Год назад
great!!
@dave6012 Год назад
Dang, I never thought I could understand this whole process. I typically wrote off most of the implementation details as a black box, but this seems halfway approachable.
Has me thinking a lot about single page applications, and how the crawlers handle them. A similar type of video would be awesome if you had it.
@interviewpen Год назад
Glad you liked it! Yes, SPAs are notoriously hard to optimize for crawlers. However, strategies like static rendering and routing can make SPAs look more like typical websites to a crawler. I'm not an SEO expert though :)
@dave6012 Год назад
@@interviewpen haha I appreciate the legal disclaimer
@syn3rman65 Год назад ⁺¹
Holy shit I'm glad I found this before you've blown up 🙌
@interviewpen Год назад
Thanks for watching! More coming!
@henrythomas7112 9 месяцев назад
I extremely like the video, man. Very helpful and informative. Thank you very much. It is presented so well too. Great, positive work.
@interviewpen 9 месяцев назад
Thanks, glad it helped you!
@rembautimes8808 9 месяцев назад
One application of this solution is for horizon risk scanning. The use case is that a large multinational corporation wants to have an idea of new risks which are emerging and adopting this approach allows them to have a traceability back to the web source. Of course they won’t be crawling 100M pages but maybe 100k pages.
@interviewpen 9 месяцев назад
Interesting! Thanks for watching :)
@Sgene9 Год назад ⁺¹
This was amazing. Now I want to try build a search engine!
@interviewpen Год назад
lets gooooo - thanks for watching
@gmanonDominicana Год назад ⁺¹
I was looking for something like this for a while. This content is worth the time spent.
@interviewpen Год назад
thanks for watching
@theprovego2934 Год назад
2:00 This is how to make an ad, good job!
@interviewpen Год назад
Lol thanks :)
@sinnloses746 Год назад
Second Video I watch from you. It’s so good. thank you
@interviewpen Год назад
Glad you liked it!
@yipmong 7 месяцев назад
I am impressed, you really deserved my sub❤
@interviewpen 7 месяцев назад ⁺¹
Thanks!
@jeromeeusebius 8 месяцев назад
Thank you for sharing the great design prep video. What tools or combination of tools/software is used to create the figures (with the black blackground). Thanks
@interviewpen 8 месяцев назад
Thanks for watching! We use GoodNotes on an iPad.
@sperpflerperberg8147 Год назад ⁺³
This channel is amazing
@interviewpen Год назад ⁺¹
thanks - we have a lot more coming!
@MuscleTeamOfficial Год назад ⁺²
This is high quality content.
@interviewpen Год назад
Thanks, glad you enjoyed it!
@chandrasekharmandapalli9181 Год назад ⁺¹
Great work buddy....very detailed explanation... cheers
@interviewpen Год назад
thanks for watching!
@ekanshmishra4517 Год назад ⁺⁴
Never saw such a difficult problem explained so easily❤️ subscribed instantly
Love from India❤
@interviewpen Год назад
Thanks for watching!
@notenlish 7 месяцев назад
Great video man, wish I had found this before
@interviewpen 7 месяцев назад
Thanks!
@Pankaj.Pilkhwal Год назад ⁺¹
really wow!!!!!!! amazing content.
@interviewpen Год назад ⁺¹
thanks, more coming soon!
@amigos786 Год назад ⁺¹
Hey awesome video. Just subd. What is the app you are using in ipad for this?
@interviewpen Год назад ⁺¹
GoodNotes - thanks for watching!
@maharshiguin7813 Год назад ⁺²
Great video, really like your way of explaining stuff.
@interviewpen Год назад
thanks - more videos on the way!
@tofahub Год назад ⁺⁷
How does sorting by frequency give us the most popular results? The frequency is the number of times the word occurs in that specific url. The word may appearing in that url too many times like being a common word doesn't make it the most popular search result
@interviewpen Год назад ⁺⁶
You're completely right! Google uses the PageRank algorithm in addition to a more advanced index to handle that--we glossed over this for our "basic" search engine since it's more of an algorithms problem than a system design one. Regardless, there's some cool infrastructure that goes into calculating PageRank at scale so that's certainly something to look into if you're curious. Thanks for watching!
@esm2000 Год назад ⁺⁵
ironically sorting by frequency was the original implementation of the page rank algorithm, long before it became more advanced
@H3llsHero Год назад
You can lookup tf-idf (term frequency-inverse document frequency) to learn more about how common "filler" words are filtered out in a basic search engine.
@BrianStDenis-pj1tq Год назад
This is great content. Regarding shingles, that takes a LOT to implement - lots of space and lots of CPU to compare them. The idea of the personalized recommendations is a huge success Google has and is surely difficult to implement considering the entire search, rank (personalize) and retrieve has to be done in a second.
@interviewpen Год назад
Thanks! You're exactly right--Google has built an incredibly impressive system :)
@khuntasaurus88 Год назад ⁺¹
Well thats an instant sub!!
@interviewpen Год назад
yes! thanks for watching!
@maksym7703 Год назад ⁺²
man it's so good content, who are personally you btw?)
@interviewpen Год назад ⁺¹
The instructor is named Bobby - I am Benyam, I do our Data Structures & Algorithms. Thanks for watching.
@govardhannarayan3907 Год назад
Great video..
Keep it up folks.
@interviewpen Год назад
Thx for watching 👍
@rockosaji9400 Год назад ⁺²
Wow...Super impressed
@interviewpen Год назад
Thanks! A lot more coming! We will be posting consistently.
@TarrenHassman Год назад ⁺¹
Also important to remember that search engines are moving to Vector databases with machine learning matrixes
@interviewpen Год назад
Good point!
@chenhaofeng4842 9 месяцев назад
Really appreciate it. I have several questions for politeness part. If there are 10k hosts, are we supposed to have 10k queues for politeness? Let's say if one host has only 3 urls, after all the 3 urls are visited. are we supposed to delete the idle queue? Each time we have a new host, are we supposed to created a new queue.
@interviewpen 9 месяцев назад ⁺¹
Yep, we'd need one queue for each host. There'd probably be far more than 10k in fact! Of course, these would simply be logical partitions residing on a far smaller set of physical machines. We would need to add a queue when a host is visited for the first time (this would be trivial since a queue is just a logical abstraction), but we probably wouldn't need to worry about deleting since we'll keep re-crawling hosts. Hope that helps!
@danielghani3903 Год назад ⁺¹
terima kasih puan
@interviewpen Год назад
sure - thanks for watching!
@moacir8663 Год назад ⁺²
I'd like to watch a deeper explanation about how to search for data in a shard database like you explained.
@interviewpen Год назад ⁺²
we'll cover sharding in-depth soon! thanks for watching!
@moacir8663 Год назад
@@interviewpen I'm looking forward to watch it.
@nikitaluparev6478 10 месяцев назад
while you've been explaining Schema you mentioned hash as a way to make sure something is unique. Can you explain in detail how hash helps with that?
@interviewpen 10 месяцев назад
Sure--hashing a large piece of data (such as a webpage) yields a far shorter, fixed-length string that uniquely represents that data and can be stored in a database. By checking if this hash already exists in our database, we can effectively check if the webpage has already been seen without having to compare the page content against petabytes of other pages.
@basharatwani3948 Год назад
Thank you for sharing, Good content and good work. Suggest start with core functional and non functional requirements and then capacity planning numbers and read write per sec needing to support the core functional needs. Otherwise seems we go straight into solution which is ok, some may want to know how we think ahead of an ambiguity and the problem space and have conversation around what we want to do with the interviewer. Maybe also consider adding handling copyright issues when we are extracting and rendering html, de dupe service and bloom filter, how nested cyclic loops in a site will be handed, caching strategy etc.
@interviewpen Год назад
Thanks for watching. You're right, addressing the requirements ahead of time is very important in this process, and our more recent videos tend to be better about that :)
@FeyroozeCode Год назад
Very Simple and Good
@interviewpen Год назад
Thanks!
@dibll Год назад ⁺³
Could someone pls explain what text and hash indexes are? Are they separate DBs storing partial information compare to the main DB or something else? Thanks!
@interviewpen Год назад
You're exactly right. You can think of global indexes as a copy of the database but organized onto nodes differently, and the records generally only include enough data to be able to look up the corresponding record in the primary.
@SaveCount-bh8tp 6 месяцев назад
Your Channel is very good
@interviewpen 6 месяцев назад
Thank you!
@scottthornton4220 10 месяцев назад
Love the video but I'm perplexed as to why you want to store the site contents. I figure that you would just scrape it for word frequencies for matching later to queries?
@interviewpen 10 месяцев назад ⁺¹
Good question--we store the site contents so we don't have to scrape them again later if we want to change our algorithms. Google does this too! Thanks for watching.
@shs4293 Год назад ⁺¹
Instead of sharding right off the hook, could use partioning. Sharding should be the final resort
@interviewpen Год назад ⁺¹
Good point, but 31TB of metadata is a lot to store on one node so it's necessary in this case to scale horizontally. Our query patterns work very nicely here (always single-record reads/writes by a unique key), so it shouldn't be a problem. Thanks for watching!
@VermeilChan Год назад ⁺¹
The amount of time u put in this video is crazy 😭
Keep it up 😼😼
@interviewpen Год назад
Thanks, more is on the way :)
@johnnybravo964 Год назад ⁺¹
I don't understand why this search engine has to actually store the contents of each website, when it can just store the domain/ip address of where that website is.
@interviewpen Год назад
Good point, that wasn't covered very well in this video. We usually want to keep the contents of the page in case we change things later on. Rather than having to re-crawl every page, we can used our cached copy to change site priorities, metadata, etc. But for the core functionality of this system, it's not entirely necessary. Thanks!
@johnnybravo964 Год назад
That must take a lot of storage to pretty much store a copy of the entire internet! Do they also scrape all of the videos on the internet and store that?@@interviewpen
@johnnybravo964 Год назад
The internet is estimated to be 64 Billion TerraBytes!@@interviewpen
@PouyanNosrati Год назад
It was an incredibly detailed explanation
@interviewpen Год назад
Thanks for watching 👍
@FranciscoGomez-tw1ii Год назад ⁺²
Amazing!!!
@interviewpen Год назад
Thanks for watching.
@CertificationTerminal Год назад
Awesome!
@interviewpen Год назад
Thanks for watching 👍
@christhornham Год назад
Outstanding! Thank you!
@interviewpen Год назад
Thanks for watching!
@premparihar Год назад ⁺¹
The video is really awesome and helpful ❤.
@interviewpen Год назад
thanks for watching!
@El_Remolino19 Год назад
what are you using to draw on and the software to make this? i find it super helpful and would like to make my own videos using it, thank you
@interviewpen Год назад
Cool, we're using GoodNotes on an iPad. Thanks!
@langtuyetvuanh1999 Год назад ⁺¹
great video, but can I ask? can we use elasticsearch instead? I'm not a professor but seeing a lot of system using elastic search to optimize their query performace.
@interviewpen Год назад ⁺¹
Glad you liked it! ElasticSearch actually uses a very similar data structure to the "text index" we described, and this could certainly be swapped out for our database in this system. It's just about tradeoffs between ease of use in a managed service and flexibility.
@andrewkamoha4666 Год назад ⁺¹
Piece of cake !!!
@interviewpen Год назад ⁺¹
ye
@andrewkamoha4666 Год назад ⁺¹
@@interviewpen I'm gonna build one now to smash Google !!! kkk :-D
@darkwoodmovies Год назад ⁺¹
The fact that when you crunch the numbers, the metadata is only
@dzuchun Год назад ⁺¹
have a trouble finding that shingles technique author mentioned close to the end. can anyone give some sort of reference?
@interviewpen Год назад
Thanks for watching! It's a bit math heavy but here's a reference for shingling: nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html
@Vinod_Kumar827 Год назад
Very nicely explained
@interviewpen Год назад
Thx for watching!
@rushio8673 Год назад
Please explain how the prioritizer works here
@interviewpen Год назад
Sure. There's a number of algorithms we could implement here, but the general idea is to analyze the page and how frequently it changes to determine how frequently to crawl it. The prioritizer will take in all the data and insert the page into the correct queue based on its calculated priority. Thanks for watching!
@dombat44 Год назад
Great content, yours are the best system design interview mocks I've seen on here. Could you do one on a RSS feed website?
@interviewpen Год назад
Thanks! Sure, we'll add it to the backlog :)
@wayneisthebestable Год назад
Great video, but im curious is it really neccassar to sort by frequency of a word in URL?
i think most well designed URL wont have key word like cat appear more than one time in Url?
Also if there's cat and dog in a URL should I have two record for a URL?
@interviewpen Год назад
No, we're searching the content of the pages here, not the url. Thanks for watching!
@satyamkumaryadav1560 Год назад ⁺³
Which app you are using for writing?
BTW quality content 👌🏿
@interviewpen Год назад ⁺¹
We use GoodNotes on an iPad 👍
@yourlogarithm8607 Год назад
Could you explain to me a thing I'm confused about here 13:35. When the router selects an element from the priority queue - it adds it to the politeness queue, by doing that wouldn't we loose the initial prioritization given that the politeness queues are sorted just by domain?
@interviewpen Год назад
Sure. The router uses a weighted random algorithm to select a priority queue, so the higher priority queues are more likely to be selected. This ensures that higher priority pages are crawled more frequently, regardless of what politeness queue they end up in. Thanks!
@ShueFig Год назад ⁺¹
recognised the B2B SWE voice :)
@interviewpen Год назад
Yep :D
@EntertainerOnline 5 месяцев назад
I'm not sure if I understood correctly but why are we not using any ES cluster to speed up our search? No DB can be as efficient as ES when it comes to search.
@interviewpen 5 месяцев назад
ElasticSearch is essentially a sharded db with full text search at its core, so a properly architected database will do the same thing. But you’re absolutely right-es is certainly a viable solution if we want a pre-built solution.
@NitinVarmaManthena 9 месяцев назад
What software do you use for the UI for the workflow and to highlight pen?
@interviewpen 9 месяцев назад
We use GoodNotes on an iPad. Thanks!
@eazypeazy8559 Год назад ⁺¹
cool guide, thanks
@interviewpen Год назад
sure - thanks for watching, more videos coming
@cankuter 11 месяцев назад
Very nice walkthrough appreciate the effort. I have a question tho, maybe a stupid one. I didnt quite get if "heap" means the data structure heap or the heap as a general memory space just like it is called in Java. I mean if its the data structure, wouldnt it be very inefficient to search for the correct pointer for the politeness queue you are looking for? From your explanation I am inferring that this heap is more like a memory space and works more like a hash map. Is this correct?
@interviewpen 11 месяцев назад
We did mean the heap data structure--this works very efficiently here since the earliest timestamp will always be at the top of the heap. The heap just tells us which politeness queue to look at next; no searching necessary. Thanks!
@savanpatel4938 Год назад ⁺¹
awesome
@interviewpen Год назад
thanks for watching - more videos coming soon!
@69k_gold Год назад ⁺³
Bro developed Google Search in 19 minutes
@interviewpen Год назад ⁺¹
hahaha - thanks for watching.
@tirthdoshi1337 Год назад
Can someone explain how does the priorityQueue really work for choosing the next element in the queue? Is it like a min priority queue where the top element will be having the minimum time to remove and we compare current time and minimum time and finally process the element and then if multiply rendering time by 10 and put it back to the queue and the priority queue. In that case if a 2 elements have the same time in priority queue how do we choose which one to pick?
@interviewpen Год назад
Yep you got it right, we’re looking for the earliest timestamp. If two elements have the same timestamp, it doesn’t matter which one we pick. Thanks!
@TungLe-mm7eo Год назад
what is the tool you are using for presentation? thank you
@interviewpen Год назад
We're using GoodNotes on an iPad. Thanks!
@Rockyzach88 Год назад ⁺²
*ChatGPT giving me a "rudimentary" outline of some python code for this explanation based on the youtube transcript. What do you think:*
*python*
class API:
def __init__(self):
self.load_balancer = LoadBalancer()
self.text_index = TextIndex()
self.metadata_db = MetadataDatabase()
self.blob_store = BlobStore()
def search(self, query):
urls = self.text_index.search(query)
results = []
for url in urls:
metadata = self.metadata_db.get_metadata(url)
page_content = self.blob_store.get_page_content(url)
results.append((metadata, page_content))
return results
class LoadBalancer:
def __init__(self):
self.api_servers = []
def distribute(self, query):
api_server = self.select_api_server()
return api_server.search(query)
def select_api_server(self):
# Logic to select API server based on load balancing
pass
class TextIndex:
def search(self, query):
# Implement search logic to return URLs based on query
pass
class MetadataDatabase:
def get_metadata(self, url):
# Retrieve metadata for the given URL
pass
class BlobStore:
def get_page_content(self, url):
# Retrieve page content for the given URL
pass
class Crawler:
def __init__(self, url_frontier):
self.url_frontier = url_frontier
self.hash_index = HashIndex()
self.metadata_db = MetadataDatabase()
self.blob_store = BlobStore()
def crawl(self):
while True:
url = self.url_frontier.get_next_url()
if self.check_robots_txt(url):
page = self.fetch_page(url)
if not self.is_duplicate(page):
self.save_page(page)
new_urls = self.extract_urls(page)
self.url_frontier.add_urls(new_urls)
def check_robots_txt(self, url):
# Check robots.txt for the given URL
pass
def fetch_page(self, url):
# Download the page content for the given URL
pass
def is_duplicate(self, page):
# Check if the page is a duplicate using HashIndex
pass
def save_page(self, page):
# Save the page content and metadata
pass
def extract_urls(self, page):
# Extract new URLs from the page content
pass
class HashIndex:
def check_duplicate(self, page):
# Check if the page content is a duplicate
pass
class URLFrontier:
def __init__(self):
self.priority_queues = []
self.host_queues = []
self.heap = []
def get_next_url(self):
# Get the next URL based on priority and politeness
pass
def add_urls(self, urls):
# Add new URLs to the appropriate queues
pass
# Initialize components
url_frontier = URLFrontier()
api = API()
crawler = Crawler(url_frontier)
# Start crawling and serving API requests
crawler.crawl()
@interviewpen Год назад
Might be a little more complex in practice :D Thanks for watching!
@dibll Год назад ⁺¹
Informative video! Very nicely explained. Could you pls do one on distributed key/value stores?
@interviewpen Год назад
thanks for watching - yes that's in our backlog
@ahmad-ali14 Год назад ⁺¹
Thanks
@interviewpen Год назад
sure
@CanRau 11 месяцев назад
Is there some kind of open dataset to get the database going without having to crawl the whole web from 0?
@interviewpen 11 месяцев назад ⁺¹
There is! Check out www.commoncrawl.org/ (just one example)
@CanRau 11 месяцев назад
@@interviewpenooooh that's incredible thank you so much 🙏🥰
@yuganderkrishansingh3733 Год назад ⁺²
Don't think the schema design for the query pattern "Search for a word " is included. The video says there is a text index but I don't see "word" or "frequency" at ruclips.net/video/0LTXCcVRQi0/видео.html
I think the schema needs to include these so that index automatically creates a table on top of these.
Also the part about Router routing URLs to correct queue, It's mentioned that if there is no Queue corresponding to domain then it will added to "empty" queue. But then what about updating the Heap and selector.
Also the mapping of a domain to queue has to be stored somewhere. Most likely in Redis cache as it seems like changing a lot in case queue becomes empty.
@interviewpen Год назад ⁺²
1. The "site content" field in the schema should hold the full text of the site, so words and their associated frequencies can be calculated when records are added/updated, and this data is what propagates to the text index.
2. Yep, when a new host is added to the second set of queues, the router is responsible for adding that host to the heap so the selector knows about it.
3. The host-to-queue mapping would be stored in the router, that way the router is able to quickly check which queue the next URL should be added to. It's worth noting that the router is low-traffic enough (
@yuganderkrishansingh3733 Год назад
@@interviewpen for the point 1, you mentioned that the word and frequency is calculated when a record is added or updated. But then also it needs the corresponding attributes so that it can be added to Databased when record is added or updated.
As per timestamp 3:32 the schema doesn't contain word or frequency. Am I missing something? It might be something dumb apologies.
@andrecorreia8568 Год назад
Thanks, great video but I have 1 comment. You are saying that you are going to cache the robots.txt file. How does Google system then know that the robots.txt was updated? From what you mentioned, you always take it from cache as long as it is there but you didn't mention cache invalidation.
@interviewpen Год назад
Thanks for watching! Really good point-in this system it’s not critical for the robots.txt to be constantly up to date, but there definitely should be some TTL set in the cache to make sure the data is re-fetched periodically.
@mus_g117 Год назад
nice content thank you
@interviewpen Год назад
Thanks!
@pieter5466 Год назад
4:22 anyone know how the database storage size Bytes estimates were determined?
@interviewpen Год назад
The sizes for each field are simply estimates based on the data being stored.
@pieter5466 Год назад
Thanks!@@interviewpen
@JATINJUYAL Год назад
kindly make a video systems design for algorithms
@congminhluu5068 Год назад
For resolving politeness issue, why a heap?
@interviewpen Год назад
Good question; the heap data structure enables us to efficiently look up the host with the smallest timestamp, i.e. the host that we crawled the longest ago. With a significant number of host queues, this operation could add notable latency without using a heap. Thanks!
@congminhluu5068 Год назад
@@interviewpen oh I see. I haven’t encountered any heaps and was surprised to see it can be used as a HashMap of some kind.
Why are you thanking me lol I should be thanking you for the video

Следующие

Автовоспроизведение

Design Google Drive or Dropbox (Cloud File Sharing Service) | System Design Interview Prep