Видео 19
Просмотров 6 004

Mysterious CPU usage in Python search engine

4:21

How do we know how good our search results are?

5:59

The search engine where you choose the sites to crawl

2:11

Why we stopped crawling the web (and how we got it started again)

4:30

What sites are we ACTUALLY crawling?

10:52

I accidentally deleted the index

11:41

Re-ranking search results on the client side in Rust

Mwmbl is an open source and non-profit search engine

Видео

Mysterious CPU usage in Python search engine

4:21

Mysterious CPU usage in Python search engine

Просмотров 1084 месяца назад

Mwmbl is a non-profit open source search engine powered by our community. Check it out at mwmbl.org - thanks!

How do we know how good our search results are?

5:59

How do we know how good our search results are?

Просмотров 964 месяца назад

Mwmbl is a non-profit open source search engine powered by our community. In this video we discuss how we evaluate our search ranking algorithm.

The search engine where you choose the sites to crawl

2:11

The search engine where you choose the sites to crawl

Просмотров 1444 месяца назад

Mwmbl is a non-profit open source search engine powered by our community. In this video we showcase a new feature that allows the community to choose which sites we should crawl.

Why we stopped crawling the web (and how we got it started again)

4:30

Why we stopped crawling the web (and how we got it started again)

Просмотров 2,4 тыс.4 месяца назад

Our open source search engine Mwmbl is crawled by our community. But it stopped working for a while and I explain how we got it going again.

10:52

What sites are we ACTUALLY crawling?

Просмотров 91Год назад

Try out the search engine at mwmbl.org/ Join the community: matrix.to/#/#mwmbl:matrix.org Firefox extension: addons.mozilla.org/en-GB/firefox/addon/mwmbl-web-crawler/ More info: book.mwmbl.org/

11:41

I accidentally deleted the index

Просмотров 102Год назад

I accidentally deleted the index

1:07

Why am I building a search engine?

Просмотров 166Год назад

Why am I building a search engine?

1:11

COMMUNITY powered search - part 1

Просмотров 93Год назад

COMMUNITY powered search - part 1

10:51

crawler = BROKEN

Просмотров 114Год назад

crawler = BROKEN

@mintoo2cool Месяц назад
one way to do this would be to allow users/community members to run the crawler on their machines and build the index on their machine and share the index to the mainsite .. saving the main site time it takes to crawl a site politely...
@dunno23731 4 месяца назад
Is there a textbook where you are taking the screenshots from ?
@mwmbl 4 месяца назад
Actually, they're from Wikipedia! ;)
@idowhatiwantdowhatisaygoog2361 4 месяца назад
Things to keep in mind: Visiting links can cause actions to be performed on websites, especially if you visit POST links (because some sites are programmed badly, I've heard of websites being wiped out by google crawlers because the "delete content" functions were accessible via GET request links, and without authentication). Wikipedia is completely "open": they provide resources/links to download their entire website via torrent (or directly) iirc it's about 70gb and you'll be able to construct the url for each page using the data they provide in xml format. As such, they ask developers to not crawl the website as it adds to their costs. And since they're one of the only truly free websites on the internet, you should look at how you can respect their wishes, or at least pause crawling their site until you've finished your project and you're ready to deploy.. The wikipedia xml data includes everything, even the discussion pages for each wiki page. The same may also be true for websites like stackoverflow. I understand that developing a way to parse data from downloads to include in your results will add a lot of work, but it's the right thing to do. It will reduce your costs, their costs, will process faster, and you're less likely to get blocked; so there are benefits. It's possible to visit websites you're not supposed to: I was looking for data on NASA's website, and they provided a link publicly that took me to a private website that was only supposed to be accessed by their employees. A large warning from the US govt came up saying that everything I do will be monitored and after reading it I left. A crawler won't be able to read and understand bespoke warnings like that.. Plus there are unscrupulous websites out there that you won't want to visit either, particularly ones that may be legal in your country but not in others. Some websites have circular links (with randomized urls) in order to trap crawlers. You won't catch a virus if you shake hands with one person, but if you shake hands with a million people, you're sure to catch something.. If you're running this from your home network, make sure you have a secure router (and a subnet). If you're running this from a VPS make sure you have hardened it.
@mwmbl 4 месяца назад
Thanks for your suggestions! Contributions are very welcome - feel free to get in touch if you would like to implement your ideas for us :)
@rojorum2433 4 месяца назад
Very interesting, I have always wondered how the big search engines handled all the link data. I will be watching your career with great interest.
@mwmbl 4 месяца назад
Thanks! Glad it you found it interesting
@IgorAlexeyM 4 месяца назад
Good job! I just learned about the project. I was seeking for an alternate search engine a few months ago, and this looks good!
@mwmbl 4 месяца назад
Cool, thanks!
@pryl 5 месяцев назад
cool
@emojigang4 Год назад
What is crawling?
@mwmbl Год назад
Ah - thanks for the question. It means reading the pages on the web to store in the search engine index. I will make a separate video explaining it!
@mwmbl Год назад
Watch this video next! ruclips.net/user/shortsikYVNKh6OuI
@themofo Год назад
Very cool
@mwmbl Год назад
Thank you so much Moses!

Mwmbl

Комментарии