Apache Spark / PySpark Tutorial: Basics In 15 Mins

Поделиться
HTML-код
  • Опубликовано: 24 мар 2021
  • Thank you for watching the video! Here is the notebook: github.com/gahogg/RUclips-I-m...
    I offer 1 on 1 tutoring for Data Structures & Algos, and Analytics / ML! Book a free consultation here: calendly.com/greghogg/30min
    Learn Python, SQL, & Data Science for free at mlnow.ai/ :)
    Subscribe if you enjoyed the video!
    Best Courses for Analytics:
    ---------------------------------------------------------------------------------------------------------
    + IBM Data Science (Python): bit.ly/3Rn00ZA
    + Google Analytics (R): bit.ly/3cPikLQ
    + SQL Basics: bit.ly/3Bd9nFu
    Best Courses for Programming:
    ---------------------------------------------------------------------------------------------------------
    + Data Science in R: bit.ly/3RhvfFp
    + Python for Everybody: bit.ly/3ARQ1Ei
    + Data Structures & Algorithms: bit.ly/3CYR6wR
    Best Courses for Machine Learning:
    ---------------------------------------------------------------------------------------------------------
    + Math Prerequisites: bit.ly/3ASUtTi
    + Machine Learning: bit.ly/3d1QATT
    + Deep Learning: bit.ly/3KPfint
    + ML Ops: bit.ly/3AWRrxE
    Best Courses for Statistics:
    ---------------------------------------------------------------------------------------------------------
    + Introduction to Statistics: bit.ly/3QkEgvM
    + Statistics with Python: bit.ly/3BfwejF
    + Statistics with R: bit.ly/3QkicBJ
    Best Courses for Big Data:
    ---------------------------------------------------------------------------------------------------------
    + Google Cloud Data Engineering: bit.ly/3RjHJw6
    + AWS Data Science: bit.ly/3TKnoBS
    + Big Data Specialization: bit.ly/3ANqSut
    More Courses:
    ---------------------------------------------------------------------------------------------------------
    + Tableau: bit.ly/3q966AN
    + Excel: bit.ly/3RBxind
    + Computer Vision: bit.ly/3esxVS5
    + Natural Language Processing: bit.ly/3edXAgW
    + IBM Dev Ops: bit.ly/3RlVKt2
    + IBM Full Stack Cloud: bit.ly/3x0pOm6
    + Object Oriented Programming (Java): bit.ly/3Bfjn0K
    + TensorFlow Advanced Techniques: bit.ly/3BePQV2
    + TensorFlow Data and Deployment: bit.ly/3BbC5Xb
    + Generative Adversarial Networks / GANs (PyTorch): bit.ly/3RHQiRj
  • НаукаНаука

Комментарии • 146

  • @mongooon6931
    @mongooon6931 2 года назад +30

    We use spark for our data pipeline at work -- we have tables with 10+ billion records, and our applications end up moving trillions upon trillions of records of data per month. Unfathomable numbers that spark is capable of. Great video!

    • @GregHogg
      @GregHogg  2 года назад +1

      Yeah, it's insane! Thanks so much.

    • @EclipsyChannel
      @EclipsyChannel Год назад +2

      that's the power of distributed systems and parallel computing... computer science is beautiful

  • @ashleyb5849
    @ashleyb5849 3 года назад +6

    Awesome video. I love using spark at work

  • @AnVinhNguyen
    @AnVinhNguyen 2 года назад +11

    Your explanation is clear and the examples are practical and useful for beginners. Thanks a lot and keep it up!

    • @GregHogg
      @GregHogg  2 года назад +3

      I really appreciate this. You're very welcome 😃

  • @Nedwin
    @Nedwin 2 года назад +37

    I'm a freelance data scientist and I really thankful to find this video, Gregg. Can't expect more! Thank you so much. Good luck with everything. 🙏

    • @GregHogg
      @GregHogg  2 года назад +5

      That's awesome best of luck in that! And you're very welcome it's my pleasure 😊

    • @evanshlom1
      @evanshlom1 2 года назад

      Hey that's super interesting to hear a freelance data scientist who actually needs pyspark!

  • @parvathirajan.n
    @parvathirajan.n 2 года назад +1

    No words man! Simply loved it. Appreciate your efforts.

    • @GregHogg
      @GregHogg  2 года назад +1

      Really glad to hear that! Thank you 😊

  • @mtamjidhossain
    @mtamjidhossain 2 года назад +1

    You are awesome. Just delivering the right videos. Subscribed a few days back already but hit notifications on for you rn. Cause I wanna watch all your videos

    • @GregHogg
      @GregHogg  2 года назад +1

      Well that's really great to hear! Thanks so much Tamzid!

  • @GuilhermeMendesG
    @GuilhermeMendesG 2 года назад

    What an amazing content you're putting here man... thanks for everything!

    • @GregHogg
      @GregHogg  2 года назад +1

      Thanks so much for the kind words. You're very welcome 🤠

  • @BHANUCHAUDHARY-eb4ul
    @BHANUCHAUDHARY-eb4ul Год назад

    Thanks Greg for the wonderful explanation !!

  • @EclipsyChannel
    @EclipsyChannel Год назад

    you are a great teacher... keep doing what you do my man

  • @ericcarmichael3322
    @ericcarmichael3322 Год назад +1

    Thanks for sharing, appreciate the quick run down on this stuff

  • @dominicaleung7329
    @dominicaleung7329 Год назад +1

    Greg, thank you so much. I am new to PySpark, and your video is very good in explanation and you did those simple example and I am able to follow you and write in my own Python Notebook to try it out. Will watch your DataFrame basics video next.

    • @GregHogg
      @GregHogg  Год назад +1

      Amazing! Sorry for the late reply

  • @mehmetkaya4330
    @mehmetkaya4330 Год назад

    Concise and very well explained! Thank you so much!!

    • @GregHogg
      @GregHogg  Год назад

      Thank you and you're very welcome!

  • @MUSKAN0896
    @MUSKAN0896 2 года назад

    this was an amazing and clear video! thanks so much!

    • @GregHogg
      @GregHogg  2 года назад

      Very glad to hear that!!

  • @samusaran1692
    @samusaran1692 3 года назад +2

    Very good examples. Thanks man :)

  • @victorroy525
    @victorroy525 Год назад

    Just the type of samples we need to begin with. Meaningful content. thnx.

    • @GregHogg
      @GregHogg  Год назад

      Glad you enjoyed it!

  • @aeigreen
    @aeigreen 2 года назад

    Explained so well. 5 stars. Love to see more videos..

    • @GregHogg
      @GregHogg  2 года назад

      Really glad to hear it thanks so much!

  • @manpritsingh3972
    @manpritsingh3972 2 года назад

    This video is really helpful. Thanks a lot Gregg.

    • @GregHogg
      @GregHogg  2 года назад

      You're super welcome!

  • @andersborum9267
    @andersborum9267 9 месяцев назад

    I'm just getting into DataBricks and PySpark and this introductory tutorial was a great starter.

    • @GregHogg
      @GregHogg  9 месяцев назад +1

      Awesome! Hope that goes well :)

  • @krishj8011
    @krishj8011 Год назад

    very fine details covered. really useful and easy to understand the spark concepts.

    • @GregHogg
      @GregHogg  Год назад

      Really glad to hear that.

  • @hsoley
    @hsoley 2 года назад +1

    You are awesome, thanks for sharing your knowledge with the world

    • @GregHogg
      @GregHogg  2 года назад +1

      I really appreciate that Hamid!!!

  • @joshuabradshaw1647
    @joshuabradshaw1647 Год назад

    Thank you for sharing to the world. I'm currently a supply chain analyst and aspiring supply chain data scientist 🙏

    • @GregHogg
      @GregHogg  Год назад

      That's excellent to hear and very exciting Joshua! I wish you the best of luck 🥰

  • @JaylaScousa
    @JaylaScousa 3 года назад +1

    Concise and well presented 👍

    • @GregHogg
      @GregHogg  3 года назад +1

      Very glad you found it useful, James!!

  • @aparfeno
    @aparfeno 2 года назад

    Thank you for great video and for useful education links!

    • @GregHogg
      @GregHogg  2 года назад

      You're super welcome 😃

  • @demohub
    @demohub Год назад

    Great overview. Thanks

  • @clintp3504
    @clintp3504 3 года назад +1

    Great stuff! Thanks

    • @GregHogg
      @GregHogg  3 года назад +1

      You're very welcome ☺️

  • @antarcticadventure
    @antarcticadventure 3 года назад

    Never used Spark before. Thank you.

    • @GregHogg
      @GregHogg  3 года назад +3

      Me too for the longest time; PySpark is a life changer though!

  • @boudhayanism
    @boudhayanism Год назад

    Cool video, thanks for making it

  • @lakshaydulani
    @lakshaydulani Год назад

    now thats what i was looking for

  • @caiocalo1
    @caiocalo1 Год назад

    such a good tutorial

  • @PranitKothari
    @PranitKothari 5 месяцев назад

    Nicely explained.

  • @hyeonjukwon3638
    @hyeonjukwon3638 2 месяца назад

    Very useful and interesting! Subscribed :)

    • @GregHogg
      @GregHogg  2 месяца назад

      Glad to hear it, thanks a ton!

  • @nataliaresende1121
    @nataliaresende1121 2 года назад

    very good, thanks!

    • @GregHogg
      @GregHogg  2 года назад

      You're very welcome Natalia!

  • @cetilly
    @cetilly 2 года назад

    Sensational!

    • @GregHogg
      @GregHogg  2 года назад

      Thank you 😊😊😊

  • @pardonmasuka2
    @pardonmasuka2 4 месяца назад

    Awesome starter!

  • @RossittoS
    @RossittoS 3 года назад +1

    Great!

  • @noorhake9087
    @noorhake9087 2 года назад

    Hi , I'd like to ask you a question
    I'm working on a project that is how linear regression selected feature by apache spark when I want to execute the code for pyspark it gives an error that pyspark dont define and I tried to figure it out in many ways it didn't solve that problem💔

  • @260056
    @260056 2 года назад +1

    @greg, plz share the link of 1 hr video.. I am unable to find it

  • @charlescoult
    @charlescoult Год назад

    Took a minute to get going but well done

  • @javidhesenov7611
    @javidhesenov7611 2 года назад

    nice explanation

    • @GregHogg
      @GregHogg  2 года назад

      Thanks a bunch Javid! :)

  • @tarunodaysarma9741
    @tarunodaysarma9741 3 года назад +2

    Greg ,had a question on pyspark...how do I find latest parquet files stored in hdfc path using pyspark code

    • @GregHogg
      @GregHogg  3 года назад +2

      Sorry I don't know! 🤔

  • @user-ri5gu3qe4b
    @user-ri5gu3qe4b Год назад

    great great content! BTW, please give us the link of the an-hour-long spark tutorial mentioned in the end,thanks a lot.

    • @GregHogg
      @GregHogg  Год назад

      Thanks! Here you go: ruclips.net/video/8ypIRp6DPew/видео.html

  • @r3d_robot594
    @r3d_robot594 2 года назад

    Good PySpark Primer! Others are either too lengthy or short and vague.

    • @GregHogg
      @GregHogg  2 года назад

      Thanks so much I'm really glad to hear that! :)

  • @smash4929
    @smash4929 4 месяца назад

    Hey Greg,
    The knowledge in the video is great but the background music is distracting.

  • @paraklesis2253
    @paraklesis2253 3 года назад

    Thank you

    • @GregHogg
      @GregHogg  3 года назад

      You're very welcome!

  • @emirhanbilgic2475
    @emirhanbilgic2475 Год назад

    thanks mate

  • @MrChilo89
    @MrChilo89 2 года назад

    Hello and thanks for this video, I ve been trying to follow and to your average way, but i receive an error :
    avg = nyt.map(lambda x: (x.title, int(x.rank[0])))
    grouped = avg.groupByKey()
    grouped = grouped.map(lambda x:(x[0], list(x[1])))
    averaged = grouped.map(lambda x: (x[0], sum(x[1]) / len(x[1]) ))
    averaged.collect()
    'TypeError: Invalid argument, not a string or column: [1, 3, 7, 8, 12, 14, 20] of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function.'

  • @hypebeastuchiha9229
    @hypebeastuchiha9229 2 года назад +3

    What was your degree in Computer Science or a Data Science course?
    I'm in my third year for a Computer Science BSc and I feel like I'm at a disadvantage for Data Science. We didn't learn statistics or have a lot of math modules.
    Most Data Science jobs require a Masters or PhD but I don't want to get a Masters straight after uni so I'm looking at Data Engineering since they accept BSc's. Is that a realistic path into Data Science or am I wasting my time?

    • @GregHogg
      @GregHogg  2 года назад +3

      I'm a statistics major. I don't think you're at a disadvantage, people very widely respect computer science majors. If anything I'd feel I'm at a disadvantage lol. But agreed, you get less stats courses. I would think some certificates and projects would be enough without needing a masters, unless you're aiming for FAANG or the other top jobs

    • @GregHogg
      @GregHogg  2 года назад +2

      This video may help; ruclips.net/video/08G-u9HN8Kc/видео.html

  • @yashmodi6762
    @yashmodi6762 3 года назад +3

    Which big data tools one must learn for beginners and from where to learn( please provide some resources)

    • @GregHogg
      @GregHogg  3 года назад +3

      Of course I'd recommend my channel - SQL and Spark are the most important ones in my opinion :)

  • @ganeshkaushik2290
    @ganeshkaushik2290 3 года назад +1

    Hi Bro, could you please make a video on learning process on bigdata?and what job roles which big data skills i'm really confsed where to start and what to learn!
    I know python, sql
    I learned some basics of hdfs, hive, sqoop
    now i'm trying to learn pyspark

    • @GregHogg
      @GregHogg  3 года назад +1

      Thanks for the feedback, I'll keep this in mind!

  • @rohanjoseph1531
    @rohanjoseph1531 3 года назад

    Hi @Greg Hogg,
    I can't seem to access the "sc" object on Google Colab. Which library let's you use that object?

    • @GregHogg
      @GregHogg  3 года назад

      github.com/gahogg/RUclips/blob/master/PySpark%20In%2015%20Minutes.ipynb

    • @rohanjoseph1531
      @rohanjoseph1531 3 года назад

      @@GregHogg cheers!

  • @Vlapstone
    @Vlapstone Год назад

    sc command is not working on my Colab as it's working on this vide... can anyone help?

  • @e.s298
    @e.s298 Год назад

    Good for learn RDD

  • @nataliaresende1121
    @nataliaresende1121 2 года назад +2

    Hi Greg, how can I convert .csv files into .txt files (with comma as delimiter) using pyspark? Do you have a code snippet?

    • @GregHogg
      @GregHogg  2 года назад +4

      I think you can just change the extension from CSV to txt

  • @geethsn1866
    @geethsn1866 2 года назад +1

    Thanks for the tutorial. It was simple and easy to follow. However, when I tried the code in Colab, just by typing "sc" is not invoking spark. Is there any prerequisites - to be installed in Colab before "sc" ?

    • @GregHogg
      @GregHogg  2 года назад +2

      Please check out my notebook. You'll need to pip install PySpark, and write a line or two of code to set it up

    • @geethsn1866
      @geethsn1866 2 года назад

      @@GregHogg Thank you Greg.

  • @capper3360
    @capper3360 Год назад

    Can you share the link to the hour long tutorial you mentioned at the end, couldn't find it in your spark playlist.

    • @GregHogg
      @GregHogg  Год назад +1

      Here you go: ruclips.net/video/8ypIRp6DPew/видео.html

  • @sohamsonone5440
    @sohamsonone5440 2 года назад

    Hi Greg, which one good among data science, data analytics or machine learning, AI.. could you pls give a suggestion

    • @GregHogg
      @GregHogg  2 года назад

      Data science / ML

  • @ajanieniola9172
    @ajanieniola9172 Год назад

    Can you also use apply instead of map

  • @maxitube30
    @maxitube30 2 года назад

    and what are the machine on what we parallelize the work?
    They have to be configurated?
    i mean,if pyspark or spark parrallelize on a cluster,we have to configue the cluster too?

    • @GregHogg
      @GregHogg  2 года назад +1

      Someone has to configure it. Probably won't be your job though. You'll just select it, kinda like a Python virtual environment, and act as if it's the same as in this video because nothing changes from the programming point of view :)

    • @maxitube30
      @maxitube30 2 года назад

      @@GregHogg understood. Thx :)

  • @keerthanamurugesan-xe6mr
    @keerthanamurugesan-xe6mr Месяц назад +1

    It look like using numpy, pandas what is the difference between this and pyspark.

    • @GregHogg
      @GregHogg  Месяц назад +1

      It looks very similar to us coders, which is great. But pandas and numpy are mainly for dealing with data on the computer you're using. Spark allows us to distribute our workloads across a cluster of machines

    • @keerthanamurugesan-xe6mr
      @keerthanamurugesan-xe6mr Месяц назад

      @@GregHogg Thankyou

  • @chanta2809
    @chanta2809 2 года назад

    What is the URL to practice? How to setup data for practicing?

    • @GregHogg
      @GregHogg  2 года назад +2

      Thank you! You made me notice I accidentally removed the notebook from the video description. You can grab the notebook code in the video description now. You can actually get PySpark in google colab very easily, with simply !pip install pyspark and then import pyspark, then continue following the steps in this video.

  • @abdullahsiddique7787
    @abdullahsiddique7787 3 года назад

    How is future of spark is flink replacing it ? Is it worth learning for career in big data ?

    • @GregHogg
      @GregHogg  3 года назад +1

      I don't know what flink is.

    • @abdullahsiddique7787
      @abdullahsiddique7787 3 года назад

      @@GregHogg thanks for reply gregg can u pls also tell me the career scope of Apache spark for future

    • @GregHogg
      @GregHogg  3 года назад +3

      @@abdullahsiddique7787 Spark is and will stay essential for Data science, ML, analysts and big data for a long time.

    • @abdullahsiddique7787
      @abdullahsiddique7787 3 года назад

      @@GregHogg thanks gregg appreciate your quick response

    • @GregHogg
      @GregHogg  3 года назад +1

      @@abdullahsiddique7787 Of course!

  • @Somethingaweful
    @Somethingaweful Год назад

    back up from the camera my dude. I feel like your staring directly at my soul

  • @pinkomoore
    @pinkomoore 4 дня назад

    Pyspark seems to be pandas on steroids + distributed resources usage

  • @ahmadsaad1888
    @ahmadsaad1888 3 года назад +1

    You mentioned an hour long spark video, I can't find it.

    • @GregHogg
      @GregHogg  3 года назад +1

      ruclips.net/video/8ypIRp6DPew/видео.html

    • @agnelamodia
      @agnelamodia 3 года назад

      @@GregHogg Could you please paste this link in the description?

    • @GregHogg
      @GregHogg  3 года назад

      @@agnelamodia please see above

  • @grantholomeu3725
    @grantholomeu3725 2 года назад

    I don't understand why in tutorials like this I often get errors saying, "module x has no attribute 'y.'" In this case, I can't get Python to recognize parallelize.

  • @sndselecta
    @sndselecta 3 года назад +1

    I thought performance issue between scala and py isnt an issue anymore.

    • @GregHogg
      @GregHogg  3 года назад

      I personally doubt it. I'm not an expert on this one, but I'd be pretty surprised if python wasn't significantly slower than scala. Of course, if we're talking practically- they're both very fast, but in computational time, I would suspect python is much slower. Thanks!

    • @GregHogg
      @GregHogg  3 года назад

      You are correct, and I am incorrect! Thank you for updating me!

    • @sndselecta
      @sndselecta 3 года назад

      I think we are both correct. I've been reading up on it, with regards to refreshing my scala or keep chugging away with pyspark. Bottom line: it's good to know both. It depends on the use cases. But in general Scala will perform better monotonically. However what Ive read is: it isn't always about one way gains based solely upon performance or more importantly "one" sole factor, there are pros and cons and sometimes the cumulative gains can weigh either way. For example pythons rich ecosystem can weigh in for achieving a faster result trying to do the same thing with Scala. Another interesting discussion you should start is Koalas. I wrote a blog, trying to get people to weigh in. forums.databricks.com/questions/65646/thoughts-on-if-its-worth-it-to-work-in-koalas.html

    • @GregHogg
      @GregHogg  2 года назад

      @@sndselecta Sorry I missed this! Absolutely and thank you for the great reply.

    • @jimbocho660
      @jimbocho660 2 года назад

      The Spark people themselves are advising against learning Scala for only marginal gains over pySpark.

  • @diegowang9597
    @diegowang9597 2 года назад

    how is this more useful than numpy?

    • @GregHogg
      @GregHogg  2 года назад +1

      NumPy works on one computer. Spark works on as many as you want

    • @diegowang9597
      @diegowang9597 2 года назад

      @@GregHogg thanks!

  • @MrPeacefulsoul2610
    @MrPeacefulsoul2610 3 года назад

    A detailed video probably would be more helpful.

  • @vladx3539
    @vladx3539 Год назад

    great video… but please step away from the camera sir

    • @GregHogg
      @GregHogg  Год назад

      Ouch

    • @vladx3539
      @vladx3539 Год назад

      @@GregHogg just kidding with you! great content

  • @yelnil
    @yelnil 4 месяца назад

    So you’re just gonna teach us the wrong way of doing things then leave us on a cliff hanger? 😅

  • @ranvijaymehta
    @ranvijaymehta Год назад

    Thanks Sir

  • @pauweldalmeidaayivi5310
    @pauweldalmeidaayivi5310 4 месяца назад

    Great!