Data Wrangling with PySpark for Data Scientists Who Know Pandas - Andrew Ray

Поделиться
HTML-код
  • Опубликовано: 19 июн 2024
  • "Data scientists spend more time wrangling data than making models. Traditional tools like Pandas provide a very powerful data manipulation toolset. Transitioning to big data tools like PySpark allows one to work with much larger datasets, but can come at the cost of productivity.
    In this session, learn about data wrangling in PySpark from the perspective of an experienced Pandas user. Topics will include best practices, common pitfalls, performance consideration and debugging.
    Session hashtag: #SFds12
    Learn more:
    Developing Custom Machine Learning Algorithms in PySpark
    databricks.com/blog/2017/08/3...
    Introducing Pandas UDF for PySpark
    databricks.com/blog/2017/10/3...
    Best Practices for Running PySpark
    databricks.com/session/best-p...
    Session Overview:
    - Why?
    - What Do i get with pyspark?
    - Primer
    - Important Concepts
    - Architecture
    - Setup
    - Run
    - Load CSV
    - View Dataframe
    - Rename Columns
    - Drop Column
    - Filtering
    - Add Column
    - Fill Nulls
    - Aggregation
    - Standard Transformations
    - Keep it in the JVM
    - Row Conditional Statements
    - Python when Required
    - merge/join dataframes
    - Pivot table
    - Summary Statistics
    - histogram
    - SQL
    - Make sure to
    - Things not to do
    - If things go wrong
    - Thank you
    About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
    Read more here: databricks.com/product/unifie...
    Connect with us:
    Website: databricks.com
    Facebook: / databricksinc
    Twitter: / databricks
    LinkedIn: / databricks
    Instagram: / databricksinc Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. databricks.com/databricks-nam...
  • НаукаНаука

Комментарии • 38

  • @fiddlepants5947
    @fiddlepants5947 5 лет назад +44

    Really nice how we see pandas and pyspark functions side-by-side!

  • @AlessandroBottoni
    @AlessandroBottoni 3 года назад +10

    Fantastic introduction to PySpark for beginners. Hope to see Andrew Ray again on the stage for other presentations.

  • @ratkush
    @ratkush 6 лет назад +10

    Must watch Q n A session in the end. I loved it.

  • @enes-the-cat-father
    @enes-the-cat-father 4 года назад +3

    Thank you for such a great presentation for beginners!

  • @kevinlin5486
    @kevinlin5486 4 года назад +2

    This a great video. Exactly what I'm looking for thanks very much.

  • @tanishasharma3665
    @tanishasharma3665 3 года назад

    he provided with a really good comparison between the two!

  • @thedarkknight579
    @thedarkknight579 2 года назад

    Thank you so much for the Session ❤️

  • @VishalSharma16
    @VishalSharma16 3 года назад

    Super helpful, thanks for sharing!

  • @toygraphers240
    @toygraphers240 2 года назад

    Thank you very much for your contribution.

  • @ZenvilleErasmus
    @ZenvilleErasmus 5 лет назад +3

    Cool talk and key differences nicely illustrated.

    • @harjeetkumar4632
      @harjeetkumar4632 5 лет назад

      Here are some more videos on spark Spark Interview Questions: ruclips.net/p/PL9sbKmQTkW05mXqnq1vrrT8pCsEa53std

  • @alexnim4873
    @alexnim4873 3 года назад

    great presentation!

  • @willwright5181
    @willwright5181 2 года назад

    Great intro!

  • @pratikmehta1152
    @pratikmehta1152 5 лет назад +39

    Volume is low! :(

  • @goedzo4361
    @goedzo4361 2 года назад

    Really helpful

  • @santil.7072
    @santil.7072 3 года назад +1

    Does it mean that using pyspark sql is the best practice in data wrangling using spark?

  • @abrahamf80
    @abrahamf80 Год назад

    My path to data was a little bit unsual to say the least, started to work in the financial industry using databricks and now on side projects started to work on pandas... funny that I actually used this video backwards hehe

  • @elliottharris4526
    @elliottharris4526 4 года назад

    Would this be a good tool for combining large numbers of csvs into a single dataframe quickly and then performing manipulations on that dataframe before outputting a single csv?

  • @raphaels2103
    @raphaels2103 4 года назад +6

    19:12, now pandas has an SQL support

  • @musasall5740
    @musasall5740 6 лет назад +4

    by just downloading and writing this code it will not work. You have to create a session.

  • @1over137
    @1over137 2 года назад

    PySpark is great with it's read only. It all goes badly wrong when you try and write anything with a typed schema.

  • @Arjun147gtk
    @Arjun147gtk 3 года назад +5

    I think I need a soundbox on full volume to hear this.

    • @jaspreet0305
      @jaspreet0305 3 года назад

      I've the same issue, thanks to the captions, I saved a lot of money

  • @francischab2262
    @francischab2262 5 лет назад +13

    7:49

  • @Rabixter
    @Rabixter 4 года назад

    Whats with the volume?

  • @xiaoyunzhang6878
    @xiaoyunzhang6878 2 года назад +1

    Nebraska Alumni

  • @krishnakishorepeddisetti4387
    @krishnakishorepeddisetti4387 3 года назад +1

    Which is better in databricks environment?? Python or R or SQL..reply in comments

    • @jimbocho660
      @jimbocho660 2 года назад

      Most people seem to find SQL better.

  • @trespittman1055
    @trespittman1055 3 года назад +2

    Too quiet please fix

  • @Tyokok
    @Tyokok 4 года назад

    great tech video, but volume really ...

  • @kaixianghuang8589
    @kaixianghuang8589 6 лет назад

    LOL good presentation, but unprepared for the Q &A

    • @TheBjjninja
      @TheBjjninja 5 лет назад +2

      Why did someone ask about uDF? What does UDF have to do with spark?

  • @Drivebyeasy
    @Drivebyeasy 6 лет назад

    Hey Andrew could you send me your Github link

  • @Atlas-ck9vm
    @Atlas-ck9vm 3 года назад

    Just use koalas.