Processing Large XML Wikipedia Dumps that won't fit in RAM in Python without Spark

Поделиться
HTML-код
  • Опубликовано: 31 июл 2024
  • The Python ElementTree object allows you to read any sized XML that you have time to process. Unlike a DOM the entire XML document does not need to be loaded. This video shows how the entire of Wikipedia can be processed without a large amount of RAM in Python.
    My blog post for this video:
    www.heatonresearch.com/2017/0...
    The code for this video can be found here:
    github.com/jeffheaton/present...
  • НаукаНаука

Комментарии • 26

  • @opalkabert
    @opalkabert 4 года назад +12

    I am not just liking this but want to thank you for your time to show this. It is awesome Jeff!

  • @biologyigcse
    @biologyigcse 4 года назад +6

    As a person who is just starting out in the the research domain and have to work with wiki dumps, this was a god send. THANKS a ton, you just saved me tons of time and mental stress. Did I say thanks yet. THANKS A TON.
    You sir, get a like, subscribe, notification enabling and I am sharing your channel on my twitter space.

  • @sadiko3000
    @sadiko3000 4 года назад

    I took a look at the content of your channel and it is very impressive. Please keep doing this!

  • @mariagraetsch3700
    @mariagraetsch3700 4 года назад

    Thank you Jeff - your video provides a really structured example.

  • @DanielWeikert
    @DanielWeikert 4 года назад +2

    Thanks a lot for your videos. Love to see more on how to deal with big data in python. Best regards

  • @woetotheconquered3451
    @woetotheconquered3451 2 года назад

    You're amazing. Just what I needed

  • @BiancaAguglia
    @BiancaAguglia 4 года назад +4

    Thank you for another great video, Jeff. Not only is it useful but, as the zombie apocalypse **has** been on my mind lately, it is also very timely. 😁
    As others have already commented, I also think it would be nice to see the same process in spark. Keep up the great work.

  • @mariumbegum7325
    @mariumbegum7325 Год назад

    Interesting video, keep it up!

  • @tonym5857
    @tonym5857 4 года назад +1

    * stars video 👏👏👏. It would be nice to see the same process using big data tech like hdsf, spark, etc.

  • @paulowiz
    @paulowiz 3 года назад

    I'm a beginner about that I will try this code after the file download =). Thanks for it

  • @quackcharge
    @quackcharge 3 года назад

    thanks so much!

  • @nonenogood
    @nonenogood Год назад

    Hello Mr. Heaton. I wonder, can we get the 'text' data from the dataset into csv too?

  • @RollingcoleW
    @RollingcoleW Год назад

    Helpful !

  • @saleem801
    @saleem801 4 года назад

    Has a spark implementation been made since?

  • @lisanoorarida4009
    @lisanoorarida4009 4 года назад

    Thank you so much.
    I am working on this right now.
    For the output, I need to generate a new XML file after filtering the wiki. I tried to use the modul but they said "ElementTree is not a streaming writer". What do you recommend?

    • @HeatonResearch
      @HeatonResearch  4 года назад

      I have seen lxml used for that before, but have not done it myself.

  • @rohitreddy3609
    @rohitreddy3609 3 года назад

    Thank you for this amazing tutorial. It's very informative. Can you please explain how to create a dataset of topics from Wikipedia dump, say to retrieve 100 topics for eg.?
    My question is, how we can crawl Wikipedia to get documents and images? Thanks in advance.

  • @tamastarisnyas1191
    @tamastarisnyas1191 3 года назад

    Hi there, thank you for the video, but there's an issue, namely when I use your code it won't fill the redirect column for some reason. Could you help me with this problem?

    • @HeatonResearch
      @HeatonResearch  3 года назад +1

      Let me have a look at that!

    • @tamastarisnyas1191
      @tamastarisnyas1191 3 года назад

      @@HeatonResearch and another thing that i wanted to do is to grab the text of each article and connect it to the table as a separate column for each title. Could you give me some pointers or tips on how I can do this, please? Would help a lot. Been trying to do it, but it without success.

  • @sarasmith1647
    @sarasmith1647 Год назад

    I get FileNotFoundError: [Error 2] No such file or directory although it created the 2 csv file in the directory

  • @victoriar8179
    @victoriar8179 4 года назад +2

    thanks for the video! would be awesome to have this to process with spark

    • @HeatonResearch
      @HeatonResearch  4 года назад +2

      Yes, that is coming. Once you start to add any NLP functions on that Wikipedia text the process can take weeks without Spark.

  • @623-x7b
    @623-x7b 4 года назад

    You can also torrent it it's much faster to download.

  • @Knightmare535
    @Knightmare535 4 года назад +1

    3:53 Funny you say that...