System Design of GitHub Code Search - SDC Episode 1 with
HTML-код
- Опубликовано: 12 июн 2024
- Github code search allows developers to view and edit code online. This is particularly useful when debugging code in remote locations (like on vacation)!
GitHub manages permissions, storage, and retrieval through a set of services. In this video, we look at systems that power its search APIs.
If you have any doubts or suggestions, please share them in the comments below.
This is the first episode of the System Design Charcha series. Subscribe for notifications and updates!
00:00 Problem Statement
01:55 Capacity Estimations
02:52 Brute Force Approach
04:00 High Level Architecture
06:30 API calls
11:04 Form of Response Object?
16:10 API flow
17:40 Search Engine
31:10 Summary
33:10 Peek under the hood
36:14 Final thoughts
37:00 Thank you!
References:
Numbers every programmer should know: gist.github.com/jboner/2841832
Github statistics: github.blog/2023-01-25-100-mi...
Useful Resources:
InterviewReady: interviewready.io/
Designing Data-Intensive Applications Book: amzn.to/3SyNAOy
Social Links:
Github: github.com/InterviewReady/sys...
LinkedIn: / interview-ready
Twitter: / gkcs_
#SystemDesign #InterviewReady #Coding
Excited about the series!
Mam, will we have a separate trie starting with each alphabet between a-z for an org. How are we deciding the starting letter as we have so many words
The discussion at 16:10 was basically the core issue with all backend systems. Extremely useful discussion.
I would have liked to see a bit more of detail about
1. How do we deal with concurrent reads and writes on the trie.
2. How do we partition the trie?
3. If it is an inmemory trie, what are the memory requirements and how do we rebuild the trie during pod failure?
Shouldn't the architecture be more practical and detailed instead of it being theoretical? You've discussed using Tries, but how do you handle the distributed reads and writes to it? Wouldn't ElasticSearch be a better way? The primary thing to do here is document search. Why involve theoretical data structures instead of actual projects utilizing those data structures that are actually being used in the industry. Like, Text search is almost always done using ES in most companies
Could we have used ElasticSearch instead of managing Trie ourselves? Would ElasticSearch be able to update the index efficiently when the underlying document (code file in this case) changes?
We can have another trie pre processing at the file level such that if any deletions happen we can delete that particular file related trie and generate a new one eliminating going through entire trie of the repo.
That's an interesting idea!
In search engine can we use something like Lucian index which use inverted index .
is it like they(github) probably implemented something equivalent of a Razor View for returning the HTML response back?
Great video!! Does this also consider a string in a big word? Like there is a string "include" in a file but I am only searching "clude", would we get results? Doesn't seem like it
It would be possible with tries storing the reverse words.
There are also suffic tries. At that point, it's better to use a known solution like Elastic Search, which internally uses these algorithms.
can we get gaurav sen + arpit bhayani collab someday
Don't you think you could have used a ready made soluting for text indexing like elastic search instead of using trie
What is the tool he is using to draw flowcharts?
Miro
WIth trie you will be able search only full words. Github can search even if your search start from mid letter of work. It can also search combined 2 words. All these functionalities cannot be supported by trie
Have a look at suffix tries. Explaining the algorithm in a system design interview wouldn't be feasible, but its an interesting real-life implementation.
@@gkcs Understood. Thanks for video. Very informative