Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data

Поделиться
HTML-код
  • Опубликовано: 18 дек 2024

Комментарии • 97

  • @sjob12
    @sjob12 7 лет назад +10

    He has this incredible gift of explaining something complex very simply and in a relaxed manner. I watched other videos about Spark and got nowhere. When I watched this, I felt like I now know Spark. I'll probably watch his advanced video.

    • @blueplasticvideos
      @blueplasticvideos 7 лет назад

      Thanks a lot for your comment! Spark is a pretty easy to use system and I hope you keep exploring Spark Machine Learning and Spark Streaming, which I didn't have time to cover in this presentation.

  • @timko3135
    @timko3135 2 года назад +2

    Sameer is truly amazing, I haven't seen such good teacher in a long time

  • @jorgealbmar
    @jorgealbmar 8 лет назад

    Fantastic tutorial!!!. Just a observation for viewers: The quality of the audio drops in 56:51 but back in 58:26.

  • @ylcnky9406
    @ylcnky9406 7 лет назад +3

    This is the best and most simply-explained tutorial. Thanks for the great lecture

  • @skkkks2321
    @skkkks2321 5 лет назад +1

    Hi Sameer,what a great presentation in a short amount of time and a plethora amount of information with hands on practical.I am really amazed by your knowledge and presentation skills.Hats off for you Mate.Well done!!!!!

  • @bobmickus3319
    @bobmickus3319 8 лет назад +8

    Great job Sameer! Very well done demo that was super informative. Keep 'em coming! Thanks!

  • @VenkataRamaRajuLolabhattu
    @VenkataRamaRajuLolabhattu 7 лет назад +2

    Excellent job Sameer! Thank you very much for the class.

  • @pradeepnagaraj7347
    @pradeepnagaraj7347 3 года назад

    Excellent Explanation Sameer!!!

  • @TrevorHigbee
    @TrevorHigbee 5 лет назад +2

    This is so great. Perfect for someone who has a little Sql, Python, pandas experience. Thank you!

    • @gasmikaouther6887
      @gasmikaouther6887 3 года назад

      Hi, please could you send me labs file because the libk dosent work. Thank you

  • @tanshai9870
    @tanshai9870 8 лет назад

    Hi Sammer, very interesting tutoriel. In 59:09 you are saving the content of the DataFrame from memory to a file, I have tried this but the problem when reading the file I didn't find the same number of partitions. Is there any configuration parameter to correct this issue?
    Thanks.

  • @bandehali
    @bandehali 8 лет назад +20

    Sameer, I do not have any question for you - but just wanted to say Thank You very much for explaining Apache Spark in most simplistic way. Very good job !! I definitely gonna search for your more videos here - if you have your own secret channel for technology trainings - please do let me know. Thanks,

    • @blueplasticvideos
      @blueplasticvideos 7 лет назад +13

      Nah, no secret channel. I prefer to release any video recordings publicly via RUclips.

    • @aleekazmi
      @aleekazmi 7 лет назад +1

      you might wanna check out his secret videos on some questionable websites

    • @praveenmail2him
      @praveenmail2him 6 лет назад

      Sameer, you are a spark guru. Really appreciate your gesture and simplistic methodology to explain even to a non-technical person! Amazing job!!

  • @geomichelon
    @geomichelon 7 лет назад

    Sammer i m new in Spark..but you gave me a lot information and a full vision about this powerfull tool.Thanks a lot!

  • @ibrahimibrahim613
    @ibrahimibrahim613 6 лет назад +2

    you are amazing Sameer, was really good lecture and rich information

    • @raniataha9876
      @raniataha9876 4 года назад +1

      ..I need course to learn how can deal with big data analysis in databricks by using spark/python

  • @jubinsoni4694
    @jubinsoni4694 5 лет назад

    Thank You Sameer.I learned a lot about spark after watching your videos....Will be waiting for your next 5hrs hands on video in next Summit

  • @ugursopaoglu
    @ugursopaoglu 7 лет назад +1

    It is so instructive presentation. Thank you so much Sameer.

  • @jbott9250
    @jbott9250 7 лет назад +2

    This was great! Thank you very much Sameer!

  • @govindkhator5632
    @govindkhator5632 5 лет назад +1

    Hi Sameer,
    Great explanation. You have any recent video on youtube for Spark.
    Thanks
    Govind

  • @mohiuddinshaik5939
    @mohiuddinshaik5939 7 лет назад +2

    Excellent presentation, thank you.

  • @rahulgulati890
    @rahulgulati890 8 лет назад +2

    Sameer Farooqui Is there any other session on Spark ML On such datasets going to happen in future?Like the one you quoted as " Trying Decision Trees to predict the boolean column" in this video. Thanks

    • @blueplasticvideos
      @blueplasticvideos 8 лет назад +3

      Hey Rahul, I dunno, maybe in a month I could do another recorded talk on actually implementing ML like decision trees on a data set. No definitive plans yet though.

  • @vaibhavipatel5397
    @vaibhavipatel5397 6 лет назад +1

    Thank you so much Sameer for explaining this in such a wonderful way so I can get to know many basic as well as in depth knowledge in spark. I appreciate your help.

  • @kamalkannan9282
    @kamalkannan9282 6 лет назад

    Great job Sameer!!!!!!. Looks like your 6 hr video on spark is blocked in India. Please could you fix that as that was very useful.

  • @forrestbajbek3900
    @forrestbajbek3900 6 лет назад +1

    This was excellent. Thank you!

  • @taruninbox
    @taruninbox 7 лет назад +1

    Very informative and nicely presented.

  • @zivfriedman2312
    @zivfriedman2312 8 лет назад +1

    8:50 minutes into the clip, it shows an amazon access-key-id along with the secret-access-key, I hope this pair is no longer active....
    great video!

    • @aydinsvlogs
      @aydinsvlogs 8 лет назад

      Thats a public key , its still active - It only help you to get that file he is sharing with everyone

  • @sylphes
    @sylphes 8 лет назад

    Hi sameer . I appreciate your long tutorials on pyspark. I am new to it and learning on my own. Just want to understand if i want to do any cosine similarity in a tfidf matrix how to go about it. I have tfidf in mappartition rdd

  • @avishekdatta2006
    @avishekdatta2006 4 года назад

    Hi Sameer,
    Your video on Spark is very educative.
    The 2 links beside "Labs" and "Learning Material" do not open any page. So can you give me the links from where I can download these files?
    Also, can I import this 1.5GB dataset into my own Databricks Community Edition without downloading the entire dataset and use it?
    Will look forward for your inputs..

  • @rabynovych4809
    @rabynovych4809 7 лет назад +1

    Is that possible to visualize data lineage (how each column was transformed, instantiated/derived from another column(s) - maybe using DAG?) BTW: That was an outstanding presentation, so much useful information in such discreet period of time! Thank you very much for that

  • @johnieharward5679
    @johnieharward5679 7 лет назад

    wow! Amazing , Very detailed and clear. I have watched 3 times in last 2 days still so much to grab . Very hard to believe , 2 guys don't like to this video .

    • @blueplasticvideos
      @blueplasticvideos 7 лет назад +1

      Johnie Harward hi johnie, thanks for your comment. I'm glad you found this useful.

  • @magdalenapadlewska4581
    @magdalenapadlewska4581 4 года назад +1

    The link to the Labs and Learning Material is no longer valid, could you update it?

  • @ranjitbehera947
    @ranjitbehera947 5 лет назад

    In cmd 47, for show() there are two parameters. Can we use only "false"?
    Bcoz we don't know how many rows we will get. So can we use any alternative for first parameter without hardcoding it as 35.
    Because I need false as my second parameter. However only false is throwing error

  • @shankarkuchibhotla5088
    @shankarkuchibhotla5088 7 лет назад +1

    Hi Sameer,
    I am trying your example with Spark 2.1, and when I use the caching option to cache table and run the count it is throwing an error..
    #spark.catalog.cacheTable("FSView")
    FSViewDF = spark.table("FSView")
    FSViewDF.count()
    Commenting out the cache statement returns the correct value, is this a change in behavior with Spark 2.1 ?

    • @blueplasticvideos
      @blueplasticvideos 7 лет назад

      spark.catalog is typically used for accessing metadata in parquet/hive. Try running spark.cacheTable("tableName") instead.

  • @ganeshreddy4430
    @ganeshreddy4430 8 лет назад

    Hi Sameer Awesome session.

  • @vadivelan4228
    @vadivelan4228 4 года назад

    Great.. nice presentation with more learnings..

  • @AnandSharma-yb3cv
    @AnandSharma-yb3cv 4 года назад +1

    not able to download the lab files

  • @chhanditachowdhury5565
    @chhanditachowdhury5565 3 года назад

    this is gold

  • @sanjeebkumar8539
    @sanjeebkumar8539 8 лет назад

    Hi Sameer. Thanks a lot for very nice explanation. Currently I am using Dataframe form Spark1.6 (Pyspark). 1st question: I would like to know will there be any performance impact when we will use Dataset in spark2.0 in terms of Pyspark/Scala. 2nd : I would like to know which language would be great to do ML in spark.

    • @blueplasticvideos
      @blueplasticvideos 8 лет назад +3

      For ML in Spark, Scala will give you the latest algorithms, including alpha/experimental ones. Typically the algorithms are ported to Python or R a few months later. However if you're using the newer, DataFrame-based spark.ml package in MLlib, then feel free to use Python, R, Scala or Java (if the language has the algorithm you need), b/c the performance will be basically the same. If you use the older RDD-based spark.mllib package, then you're probably better off using Scala or Java.
      Regarding the first question, if you're using Datasets in Spark 2.0 w/ Scala, then you will get excellent speed and type safety. The main thing to understand is that when you use SQL queries, then both syntax and analysis errors are caught at runtime. With DataFrames, syntax errors are caught at compile time, while analysis errors are caught at runtime. Finally, with Datasets (best option), both syntax and analysis errors are caught early at compile time.
      If you're using PySpark, UDFs will run much slower in Python processes next to the Executor JVMs. So, as long as you're not using custom UDFs, then you should be fine with Python b/c the rest of the SQL/DF/DS code and function calls will run at native speeds in the JVM (as fast as scala/java). If you really need UDFs in PySpark, consider writing them in Java as Hive UDFs and then import then into PySpark... then you get the best of both worlds... Python and JVM based UDFs.

  • @IslamicMotivationRe
    @IslamicMotivationRe 2 года назад

    I cant able to open this link for labs and learning material please update it
    It will be helpful for me

  • @brijchavda
    @brijchavda 6 лет назад

    Awesome video. Thanks for sharing

  • @kashishsehgal288
    @kashishsehgal288 7 лет назад

    Hi sameer,
    Can I use some other data from the same data set site like crime data of SF and work on the same cluster?
    If not then how should I proceed with the crime data?

  • @madhukirans
    @madhukirans 8 лет назад

    Awesome training

  • @lolcorporation7308
    @lolcorporation7308 8 лет назад

    +NewCircle Training hy fantastic tutorial but i had a question about your old android internals tutorial can you please make a new tutorial since most of the things are outdated and one more thing since win10 has bash can i use it to build a rom ?

  • @ArunGoudargdg
    @ArunGoudargdg 6 лет назад

    excellent !! is there any way to learn more from you sameer ? Can you guide or help us underastanding the concepts in person or in any other way. please let me now.

    • @blueplasticvideos
      @blueplasticvideos 6 лет назад

      Hey, sorry I don't teach big data classes any more since joining Google, but if you follow me on LinkedIn or Twitter, I am planning on posting some Deep Learning & TensorFlow tutorials on RUclips later this year.

    • @ArunGoudargdg
      @ArunGoudargdg 6 лет назад

      thank you for your reply sir, sure I do.

    • @raniataha9876
      @raniataha9876 4 года назад

      Arun Goudar ..I need course to learn how can deal with big data analysis in databricks by using spark/python

  • @yashdholakia3999
    @yashdholakia3999 5 лет назад

    Does Driver JVM has just 3 Executors in each case or even it can be changed?

  • @balupabbi
    @balupabbi 8 лет назад +1

    wow this is so awesome

  • @mouradhamoud1495
    @mouradhamoud1495 8 лет назад

    Thank you for this great intro to Spark!

  • @djibb.7876
    @djibb.7876 7 лет назад

    Great!
    I am beginner on Spark and i would like to compute some joins with 4 CSV files so that at the end i would have one dataframe. File1 is about 8GB, File2 5MB, File3 5MB, File4 15MB, File5 150MB.
    - I loaded the files, saved them as Parquet, read them again and finaly i created views (createOrReplaceTempView) of these dataframes.
    - I join dataframe_File1 with dataframe_File2 as merge1.
    - Now i would like to save merge1 as parquet again, read it an do others merges and so on...
    probleme:
    - merge1.spark.read.parquet(/path) lunch the spark Job and the job hangs.
    Any Suggestions?
    PS: I am using a standalone cluster (1 master = 1 slave).
    I did some join optimization concepts and configurations(shuflle.partitions, etc..)
    Best regards.

  • @saimourya3579
    @saimourya3579 4 года назад

    Hi Sameer, Where can I get those files. The links you've shown appears to be broken.

  • @gauravkumar796
    @gauravkumar796 7 лет назад

    I have one question about no of partitions. as we get 13 partitions from a file of 1.6 GB, in this case each partitions has size of 128MB, whereas you mentioned that Spark read 64MB at a time. It's 128MB .. right?

  • @eljay0
    @eljay0 7 лет назад

    Great work Sameer! Quick question: the sfopenreadalong link on your second slide seems to be broken (404 error). Do you have any other URL to suggest to access that file ? Thank you

    • @mahumtofiq4930
      @mahumtofiq4930 2 года назад

      Hey! Did you ever find another working link?

    • @eljay0
      @eljay0 2 года назад

      @@mahumtofiq4930 unfortunately not

    • @mahumtofiq4930
      @mahumtofiq4930 2 года назад

      @@eljay0 aww man!
      Thanks for replying though!

  • @swapnavengalam191
    @swapnavengalam191 8 лет назад

    1.Is there a JDBC/ODBC connectivity to Teradata from SPARK2.0 , to access the table directly?2.The file on S3 is not accessible. can this be made accessible?

    • @blueplasticvideos
      @blueplasticvideos 8 лет назад

      Yeah, check out the Spark Docs in Apache for info about the JDBC connector. It exists.
      And to use the file in the demo, you'll need to import the notebooks and then mount the S3 bucket (see instructors in the first 10 mins of video).

  • @enriquewilliams8676
    @enriquewilliams8676 2 года назад

    Guys what are the best steps in learning Data Science...

  • @lavishsharma6248
    @lavishsharma6248 5 лет назад +1

    pandas2016DF = joinedDF.filter(year('CallDateTS') == '2016').toPandas()
    When I this code in cmd135 , I got an error which is
    ValueError: Cannot convert NA to integer
    Please help!

  • @karannayyar604
    @karannayyar604 4 года назад

    I am starting learning spark from your video but not able to get the labs file. Any help?

  • @sharifahmad6876
    @sharifahmad6876 4 года назад

    can you please renew the links for the lab data and learning material, they have gone down.

  • @inatckeraban2704
    @inatckeraban2704 7 лет назад

    Does display() function specific for Databricks notebooks? I use Zeppelin and I have searched for and looked api document, but found nothing. How can we use display() function?

    • @DuyNguyenHoangPHP
      @DuyNguyenHoangPHP 7 лет назад

      It is specific Databicks's function. It is not Spark function. For to workaround. I use pandas instead. It works as expected. You can follow code snippet below:
      import pandas as pd
      pd_df = fireServiceCallsTsDf.limit(10).toPandas()
      pd.set_option('display.max_columns', None)
      pd_df

  • @raniataha9876
    @raniataha9876 4 года назад

    GREAT JOB .

  • @kanikvijay524
    @kanikvijay524 4 года назад

    Provided links are not working so i am not able to start the tutorial

  • @sofiaparadise7810
    @sofiaparadise7810 7 лет назад

    Great lecture! Thank you !

  • @oateurman
    @oateurman 8 лет назад +1

    great intro :)

  • @abhaygodbole6435
    @abhaygodbole6435 8 лет назад

    Hi Sameer... This was a Excellent session. I am just started my Journey with Spark. I had also viewed your RUclips long 6 hr video on Advance Apache Spark. It was also Excellent...
    Actually I am planning for pursue Certification in Spark, but found that all the vendors use old Spark versions for their Certifications Even Databricks uses Spark 1.1... Would that certification help for entering into this BiGData, Spark domain? Should I wait? I am like confused... Please get back with your inputs.....

    • @blueplasticvideos
      @blueplasticvideos 8 лет назад

      Hey Abhay... focus on learning Spark 2.0 and the new Structured Streaming API. I wouldn't bother learning Spark 1.x today if you're just starting your Spark journey.

    • @abhaygodbole6435
      @abhaygodbole6435 8 лет назад

      Thanks a lot Sameer for your inputs...

  • @balupabbi
    @balupabbi 8 лет назад

    can u run scala code in cells like pyspark

  • @9955540727
    @9955540727 4 года назад

    is lab file is still accessible?

  • @ranaivosonherimanitra5110
    @ranaivosonherimanitra5110 7 лет назад

    Can we attach data stored in google cloud platform instead of aws?

    • @blueplasticvideos
      @blueplasticvideos 7 лет назад

      If you're using Databricks, then it's best to store the data on S3 within the AWS ecosystem. But if you're using open source Apache Spark, then you can use it within Google's cloud. From Google's website: "The Google Cloud Storage connector for Hadoop lets you run Hadoop or Spark jobs directly on data in Cloud Storage."

  • @rajanice100
    @rajanice100 3 года назад

    Why does he look like that antagonist in venom movie ??

  • @Sheddy29
    @Sheddy29 8 лет назад

    I see you run Databricks in Python but if i need run in Scala? how i do this ? pls Help :)

    • @blueplasticvideos
      @blueplasticvideos 7 лет назад

      Databricks supports Python, Scala, R or SQL in their interactive notebooks. You can also use other 3rd party notebooks with Spark like Zeppelin notebook or Jupyter with a Spark/Scala kernel.

  • @shaharshid6355
    @shaharshid6355 5 лет назад

    Gold :)

  • @lijeff6277
    @lijeff6277 8 лет назад

    apache spark 2.0 release date?

    • @blueplasticvideos
      @blueplasticvideos 8 лет назад

      it's in release candidate 4 now, should be within a week or two.

  • @cloveramv
    @cloveramv 5 лет назад

    Dude u look like the guy from Nightcrawler movie.

  • @susmitvengurlekar
    @susmitvengurlekar 3 года назад

    Great work. Thanks a lot