Low Level Data Extraction from Wikipedia Data with Python

SGML HTML XML What's the Difference? (Part 1) - Computerphile

XML & ElementTree || Python Tutorial || Learn Python Programming

NBA 2K25 | Gameplay Courtside Report with Mike Wang

YouTube Shorts Bingo, but the ENTIRE Board

Stray Kids "MOUNTAINS" Video

Processing Large XML Wikipedia Dumps that won't fit in RAM in Python without Spark

Jeff Heaton

Просмотров 14 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 31 июл 2024
The Python ElementTree object allows you to read any sized XML that you have time to process. Unlike a DOM the entire XML document does not need to be loaded. This video shows how the entire of Wikipedia can be processed without a large amount of RAM in Python.
My blog post for this video:
www.heatonresearch.com/2017/0...
The code for this video can be found here:
github.com/jeffheaton/present...
Наука

Комментарии • 26

@opalkabert 4 года назад ⁺¹²
I am not just liking this but want to thank you for your time to show this. It is awesome Jeff!
@biologyigcse 4 года назад ⁺⁶
As a person who is just starting out in the the research domain and have to work with wiki dumps, this was a god send. THANKS a ton, you just saved me tons of time and mental stress. Did I say thanks yet. THANKS A TON.
You sir, get a like, subscribe, notification enabling and I am sharing your channel on my twitter space.
@sadiko3000 4 года назад
I took a look at the content of your channel and it is very impressive. Please keep doing this!
@mariagraetsch3700 4 года назад
Thank you Jeff - your video provides a really structured example.
@DanielWeikert 4 года назад ⁺²
Thanks a lot for your videos. Love to see more on how to deal with big data in python. Best regards
@woetotheconquered3451 2 года назад
You're amazing. Just what I needed
@BiancaAguglia 4 года назад ⁺⁴
Thank you for another great video, Jeff. Not only is it useful but, as the zombie apocalypse **has** been on my mind lately, it is also very timely. 😁
As others have already commented, I also think it would be nice to see the same process in spark. Keep up the great work.
@mariumbegum7325 Год назад
Interesting video, keep it up!
@tonym5857 4 года назад ⁺¹
* stars video 👏👏👏. It would be nice to see the same process using big data tech like hdsf, spark, etc.
@paulowiz 3 года назад
I'm a beginner about that I will try this code after the file download =). Thanks for it
@quackcharge 3 года назад
thanks so much!
@nonenogood Год назад
Hello Mr. Heaton. I wonder, can we get the 'text' data from the dataset into csv too?
@RollingcoleW Год назад
Helpful !
@saleem801 4 года назад
Has a spark implementation been made since?
@lisanoorarida4009 4 года назад
Thank you so much.
I am working on this right now.
For the output, I need to generate a new XML file after filtering the wiki. I tried to use the modul but they said "ElementTree is not a streaming writer". What do you recommend?
@HeatonResearch 4 года назад
I have seen lxml used for that before, but have not done it myself.
@rohitreddy3609 3 года назад
Thank you for this amazing tutorial. It's very informative. Can you please explain how to create a dataset of topics from Wikipedia dump, say to retrieve 100 topics for eg.?
My question is, how we can crawl Wikipedia to get documents and images? Thanks in advance.
@tamastarisnyas1191 3 года назад
Hi there, thank you for the video, but there's an issue, namely when I use your code it won't fill the redirect column for some reason. Could you help me with this problem?
@HeatonResearch 3 года назад ⁺¹
Let me have a look at that!
@tamastarisnyas1191 3 года назад
@@HeatonResearch and another thing that i wanted to do is to grab the text of each article and connect it to the table as a separate column for each title. Could you give me some pointers or tips on how I can do this, please? Would help a lot. Been trying to do it, but it without success.
@sarasmith1647 Год назад
I get FileNotFoundError: [Error 2] No such file or directory although it created the 2 csv file in the directory
@sarasmith1647 Год назад
The 3 csv files**
@victoriar8179 4 года назад ⁺²
thanks for the video! would be awesome to have this to process with spark
@HeatonResearch 4 года назад ⁺²
Yes, that is coming. Once you start to add any NLP functions on that Wikipedia text the process can take weeks without Spark.
@623-x7b 4 года назад
You can also torrent it it's much faster to download.
@Knightmare535 4 года назад ⁺¹
3:53 Funny you say that...

Следующие

Автовоспроизведение

Low Level Data Extraction from Wikipedia Data with Python

Low Level Data Extraction from Wikipedia Data with Python

SGML HTML XML What's the Difference? (Part 1) - Computerphile

SGML HTML XML What's the Difference? (Part 1) - Computerphile

XML & ElementTree || Python Tutorial || Learn Python Programming

XML & ElementTree || Python Tutorial || Learn Python Programming

NBA 2K25 | Gameplay Courtside Report with Mike Wang

NBA 2K25 | Gameplay Courtside Report with Mike Wang

YouTube Shorts Bingo, but the ENTIRE Board

YouTube Shorts Bingo, but the ENTIRE Board

Stray Kids "MOUNTAINS" Video

Stray Kids "MOUNTAINS" Video

HELLUVA SHORTS 2 // MISSION: ANTARCTICA // HELLUVA BOSS

HELLUVA SHORTS 2 // MISSION: ANTARCTICA // HELLUVA BOSS

Parsing XML with Namespaces with Python (xml.etree.ElementTree)

Parsing XML with Namespaces with Python (xml.etree.ElementTree)

The moment we stopped understanding AI [AlexNet]

The moment we stopped understanding AI [AlexNet]

Parse XML Files with Python - Basics in 10 Minutes

Parse XML Files with Python - Basics in 10 Minutes

the TRUTH about C++ (is it worth your time?)

the TRUTH about C++ (is it worth your time?)

I Made a Graph of Wikipedia... This Is What I Found

I Made a Graph of Wikipedia... This Is What I Found

Speed Up Data Processing with Apache Parquet in Python

Speed Up Data Processing with Apache Parquet in Python

Python Multiprocessing Tutorial: Run Code in Parallel Using the Multiprocessing Module

Python Multiprocessing Tutorial: Run Code in Parallel Using the Multiprocessing Module

Python for Beginners: CSV Parsing (Part 1) - Parsing a Simple CSV File

Python for Beginners: CSV Parsing (Part 1) - Parsing a Simple CSV File

Process HUGE Data Sets in Pandas

Process HUGE Data Sets in Pandas

New setup part 3: There's still a lot to add #setup #gamer #gameroom #techhouse #gamingtech

New setup part 3: There's still a lot to add #setup #gamer #gameroom #techhouse #gamingtech

Какой ноутбук взять для учёбы? #msi #rtx4090 #laptop #юмор #игровой #apple #shorts

Какой ноутбук взять для учёбы? #msi #rtx4090 #laptop #юмор #игровой #apple #shorts

Запись звонков на iPhone, Apple Intelligence и новая Siri! Обзор iOS 18.1 (beta 1)

Запись звонков на iPhone, Apple Intelligence и новая Siri! Обзор iOS 18.1 (beta 1)

Вот почему HyperX это кринж а Fifine это база

Вот почему HyperX это кринж а Fifine это база

Всегда проверяйте нет ли камер в съемной квартире

Всегда проверяйте нет ли камер в съемной квартире

А вы докажите что мы сломали ноутбук! Как современные сервисы решают проблемы.

А вы докажите что мы сломали ноутбук! Как современные сервисы решают проблемы.

Что делать если в телефон попала вода?

Что делать если в телефон попала вода?

Ферзёвый губошлёп переделывает iPhone 12 Pro после классических кринжобасов.

Ферзёвый губошлёп переделывает iPhone 12 Pro после классических кринжобасов.