PySpark Tutorial

Поделиться
HTML-код
  • Опубликовано: 1 окт 2024
  • Learn PySpark, an interface for Apache Spark in Python. PySpark is often used for large-scale data processing and machine learning.
    💻 Code: github.com/kri...
    ✏️ Course from Krish Naik. Check out his channel: / krishnaik06
    ⌨️ (0:00:10) Pyspark Introduction
    ⌨️ (0:15:25) Pyspark Dataframe Part 1
    ⌨️ (0:31:35) Pyspark Handling Missing Values
    ⌨️ (0:45:19) Pyspark Dataframe Part 2
    ⌨️ (0:52:44) Pyspark Groupby And Aggregate Functions
    ⌨️ (1:02:58) Pyspark Mlib And Installation And Implementation
    ⌨️ (1:12:46) Introduction To Databricks
    ⌨️ (1:24:65) Implementing Linear Regression using Databricks in Single Clusters
    --
    🎉 Thanks to our Champion and Sponsor supporters:
    👾 Wong Voon jinq
    👾 hexploitation
    👾 Katia Moran
    👾 BlckPhantom
    👾 Nick Raker
    👾 Otis Morgan
    👾 DeezMaster
    👾 Treehouse
    --
    Learn to code for free and get a developer job: www.freecodeca...
    Read hundreds of articles on programming: freecodecamp.o...

Комментарии • 536

  • @anikinskywalker7127
    @anikinskywalker7127 3 года назад +339

    Why are u uploading the good stuff during my exams bro

  • @stingfiretube
    @stingfiretube 8 месяцев назад +55

    This man is singlehandedly responsible for spawning data scientists in the industry.

  • @krishnakumar-ye9kp
    @krishnakumar-ye9kp 3 года назад +1

    FeeCodeCamp has very good reputation and don't loose it by allowing this fake people teaching, he is fraud and copy paste guy

  • @arturo.gonzalex
    @arturo.gonzalex Год назад +34

    IMPORTANT NOTICE:
    the na.fill() method now works only on subsets with specific datatypes, e.g. if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.
    So now it is impossible to replace all columns' NaN values with different datatypes into one.
    Other important question is: how come values in his csv file are treated as strings, if he has set inferSchema=True?

    • @kinghezzy
      @kinghezzy Год назад +1

      This observation is true.

    • @aadilrashidnajar9468
      @aadilrashidnajar9468 Год назад

      Indeed i also observed the same issue, now don't set inferSchema=True while reading the csv to RDD then .na.fill() will work fine

    • @sathishp3180
      @sathishp3180 Год назад +1

      Yes, I found the same.
      Fill won't work if the data type of filling value is different from the columns we are filling. So preferable to fill 'na' in each column using dictionary as below:
      df_pyspark.na.fill({'Name' : 'Missing Names', 'age' : 0, 'Experience' : 0}).show()

    • @aruna5472
      @aruna5472 Год назад +1

      Correct, even if we give value using dictionary like @Sathish P, If those data type are not string, it will ignore the value, once again, we need to read csv without inferSchema=True, may be instructor missed it to say that missing values applicable only for the string action( Look 43:03 all string ;-) ) . But this is good material to follow, I appreciate the good help !

    • @gunjankum
      @gunjankum Год назад

      Yes i found the same thing

  • @MSuriyaPrakaashJL
    @MSuriyaPrakaashJL Год назад +16

    I am happy that I completed this video in one sitting

  • @ygproduction8568
    @ygproduction8568 3 года назад +102

    Dear Mr Beau, thank you so much for amazing courses on this channel.
    I am really grateful how such invaluable courses are available for free.

    • @sunny10528
      @sunny10528 2 года назад +5

      Please thank Mr Krish Naik

  • @vivekadithyamohankumar6134
    @vivekadithyamohankumar6134 3 года назад +27

    I ran into an issue while importing pyspark(Import Error) in my notebook even after installing it within the environment. After doing some research, I found that the kernel used by the notebook, would be the default kernel, even if the notebook resides within virtual env. We need to create a new kernel within the virtual env, and select that kernel in the notebook.
    Steps:
    1. Activate the env by executing "source bin/activate" inside the environment directory
    2. From within the environment, execute "pip install ipykernel" to install IPyKernel
    3. Create a new kernel by executing "ipython kernel install --user --name=projectname"
    4. Launch jupyter notebook
    5. In the notebook, go to Kernel > Change kernel and pick the new kernel you created.
    Hope this helps! :)

  • @jorge1869
    @jorge1869 3 года назад +7

    The full installation of PySpark was omitted in this course.

  • @JuanR223
    @JuanR223 Месяц назад +1

    Bro you’re teaching SPARK no pandas, don’t confuse people because we are focus only in thinks of SPARK

  • @shritishaw7510
    @shritishaw7510 3 года назад +87

    Sir Krish Naik is an amazing tutor, learned a lot about statistics and data science from his channel

  • @antonmursid3505
    @antonmursid3505 2 года назад +1

    Antonmursid🙏🙏🙏🙏🙏✌🇸🇬🇸🇬🇸🇬🇸🇬🇸🇬✌💝👌🙏

  • @antonmursid3505
    @antonmursid3505 2 года назад +1

    Antonmursid🙏🙏🙏🙏🙏✌🇸🇬🇸🇬🇸🇬🇸🇬🇸🇬✌💝👌🙏

  • @candicerusser9095
    @candicerusser9095 3 года назад +30

    Uploaded at the right time. I was looking for this course. Thank you so much.

  • @sharanphadke4954
    @sharanphadke4954 3 года назад +32

    Biggest crossover : Krish Naik sir teaching for free code camp

  • @topluverking
    @topluverking Год назад +1

    I have a dataset that is 25 GB in size. Whenever I try to perform an operation and use the function .show(), it takes an extremely long time and I eventually receive an out of memory error message. Could you assist me with resolving this issue?

  • @simoncottle9626
    @simoncottle9626 6 месяцев назад +1

    The video is informative, but I wish he wouldn't constantly scroll up and down very quickly. It's hard to read the text when it is constantly moving.

  • @oiwelder
    @oiwelder 2 года назад +9

    0:52:44 - complementing Pyspark Groupby And Aggregate Functions
    df3 = df3.groupBy(
    "departaments"
    ).agg(
    sum("salary").alias("sum_salary"),
    max("salary").alias("max_salary"),
    min('salary').alias("min_salary")
    )

  • @DivyaChindam-h3p
    @DivyaChindam-h3p 3 месяца назад +1

    how to create new environment i am getting Pyspark runtime error after introducing sparksession PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.please rectify it

    • @kusumalatha9635
      @kusumalatha9635 7 дней назад

      Same for me. Did you get any pointers..?

  • @PallabM-bi5uo
    @PallabM-bi5uo Год назад +5

    Hi, thanks for this tutorial, If my dataset has 20 columns, why describe output is not showing in a nice table like the above? It is coming all distorted. Is there a way to get a nice tabular format like above for a large dataset?

  • @mohandev7385
    @mohandev7385 3 года назад +23

    I didn't expect krish.... Amazingly explained

  • @BennyHarassi
    @BennyHarassi 2 года назад +1

    can we get 1 tutorial vid without some corny dude with a thick hindi accent...

  • @pavanraibagi4045
    @pavanraibagi4045 3 года назад +5

    I love python,
    not language......but snake

    • @mukundjajadiya
      @mukundjajadiya 3 года назад +1

      I also love python,
      Language......but not snake😊

    • @pavanraibagi4045
      @pavanraibagi4045 3 года назад

      @@mukundjajadiya haha...install sololearn from playstore to learn coding for free

    • @freecodecamp
      @freecodecamp  3 года назад +5

      🐍🐍🐍🐍🐍🐍

    • @arnavmehta3669
      @arnavmehta3669 3 года назад +1

      I didn't knew that freecodecamp ever replies

    • @pavanraibagi4045
      @pavanraibagi4045 3 года назад

      @@arnavmehta3669 im lucky to get reply on my first comment

  • @MiguelPerez-nv2yw
    @MiguelPerez-nv2yw 2 года назад +3

    I just love how he says
    “Very very simple guys”
    And it turns out to be simple xD

  • @ujjawalhanda4748
    @ujjawalhanda4748 2 года назад +9

    There is an update in na.fill(), any integer value inside fill will replace nulls from columns having integer data types and so for the case of string value as well.

    • @harshaleo4373
      @harshaleo4373 2 года назад +1

      Yeah. If we are trying to fill with a string, it is filling only the Name column nulls.

    • @austinchettiar6784
      @austinchettiar6784 Год назад +3

      @@harshaleo4373 so whats the exact keyword to replace all null values?

  • @johanrodriguez241
    @johanrodriguez241 3 года назад +5

    Finished!. But i still want to see the power of this tool.

  • @kanwalzahoor7828
    @kanwalzahoor7828 3 года назад +2

    I am having this error Exception: Java gateway process exited before sending its port number

    • @vitazamb3375
      @vitazamb3375 2 года назад

      Me too. Did you manage to solve this problem?

  • @dev-skills
    @dev-skills 2 года назад +1

    Mention, as a prerequisite for this session apache-spark must be installed in your system.

  • @Leandro-es3sn
    @Leandro-es3sn 2 года назад +1

    bro, theres a hard 'cut' in 1:12:34

  • @himanshu_kumar3714
    @himanshu_kumar3714 Год назад +1

    What should I do?
    RuntimeError Traceback (most recent call last)
    Input In [3], in ()
    ----> 1 spaark=SparkSession.builder.appName('Pactise').getOrCreate()
    File ~\anaconda3\lib\site-packages\pyspark\sql\session.py:269, in SparkSession.Builder.getOrCreate(self)
    267 sparkConf.set(key, value)
    268 # This SparkContext may be an existing one.
    --> 269 sc = SparkContext.getOrCreate(sparkConf)
    270 # Do not update `SparkConf` for existing `SparkContext`, as it's shared
    271 # by all sessions.
    272 session = SparkSession(sc, options=self._options)
    File ~\anaconda3\lib\site-packages\pyspark\context.py:483, in SparkContext.getOrCreate(cls, conf)
    481 with SparkContext._lock:
    482 if SparkContext._active_spark_context is None:
    --> 483 SparkContext(conf=conf or SparkConf())
    484 assert SparkContext._active_spark_context is not None
    485 return SparkContext._active_spark_context
    File ~\anaconda3\lib\site-packages\pyspark\context.py:195, in SparkContext.__init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls, udf_profiler_cls)
    189 if gateway is not None and gateway.gateway_parameters.auth_token is None:
    190 raise ValueError(
    191 "You are trying to pass an insecure Py4j gateway to Spark. This"
    192 " is not allowed as it is a security risk."
    193 )
    --> 195 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    196 try:
    197 self._do_init(
    198 master,
    199 appName,
    (...)
    208 udf_profiler_cls,
    209 )
    File ~\anaconda3\lib\site-packages\pyspark\context.py:417, in SparkContext._ensure_initialized(cls, instance, gateway, conf)
    415 with SparkContext._lock:
    416 if not SparkContext._gateway:
    --> 417 SparkContext._gateway = gateway or launch_gateway(conf)
    418 SparkContext._jvm = SparkContext._gateway.jvm
    420 if instance:
    File ~\anaconda3\lib\site-packages\pyspark\java_gateway.py:106, in launch_gateway(conf, popen_kwargs)
    103 time.sleep(0.1)
    105 if not os.path.isfile(conn_info_file):
    --> 106 raise RuntimeError("Java gateway process exited before sending its port number")
    108 with open(conn_info_file, "rb") as info:
    109 gateway_port = read_int(info)
    RuntimeError: Java gateway process exited before sending its port number

  • @raghunandanreddy9324
    @raghunandanreddy9324 3 года назад +4

    Exception: Java gateway process exited before sending its port number

    • @vivekn993
      @vivekn993 3 года назад +1

      how did u solve the error?

    • @raghunandanreddy9324
      @raghunandanreddy9324 3 года назад

      @@vivekn993 Not Yet Solved If you know please help me

    • @migueljr6147
      @migueljr6147 3 года назад

      @@raghunandanreddy9324 Did you install JAVA? I think it s the problem... because I've never used Java before and I'm having the same error.

    • @kazekagetech988
      @kazekagetech988 2 года назад

      @@migueljr6147 didyou solve

  • @hounddog1
    @hounddog1 Год назад +1

    A good tutorial, but it would also be really great to have a tutor who has clear English since it is a tutorial in the English language, and as for somebody who is not a native speaker and also not an Indian who is used to this kind of accent, it sometimes becomes an extra challenge.
    P.S. No need to make a racial issue out of it, it's just a simple truth.

  • @nagarjunp23
    @nagarjunp23 3 года назад +29

    You guys are literally reading everyone's mind. Just yesterday I searched for pyspark tutorial and today it's here. Thank you so much. ❤️

  • @rayyanamir8560
    @rayyanamir8560 3 года назад +2

    Is this info of pyspark enough to get a relevant job?

  • @thecaptain2000
    @thecaptain2000 Год назад +1

    in your example df_pyspark.na.fill('missing value').show() replace null values with "missing value" just in the "Name" column

  • @sqlacademy
    @sqlacademy Год назад +1

    Hi Krish, I am getting below error:
    RuntimeError Traceback (most recent call last)
    ~\AppData\Local\Temp\ipykernel_15708\3930265577.py in
    ----> 1 spark=SparkSession.builder.appName('Test').getOrCreate()
    ~\anaconda3\lib\site-packages\pyspark\sql\session.py in getOrCreate(self)
    475 sparkConf.set(key, value)
    476 # This SparkContext may be an existing one.
    --> 477 sc = SparkContext.getOrCreate(sparkConf)
    478 # Do not update `SparkConf` for existing `SparkContext`, as it's shared
    479 # by all sessions.
    Can you please help?

  • @dipakkuchhadiya9333
    @dipakkuchhadiya9333 3 года назад +3

    I like it 👌🏻
    we request you to make video on blockchain programing.

  • @yitezeng1035
    @yitezeng1035 2 года назад +13

    I have to say, it is nice and clear. The pace is really good as well. There are many tutorials online that are either too fast or too slow.

  • @dataisfun4964
    @dataisfun4964 Год назад +7

    Hi krishnaik,
    All I can say is just beautiful, I followed from start to finish, and you were amazing, was more interested in the transformation and cleaning aspect and you did justice, I realize some line of code didn't work as yours but all thanks to Google for the rescue.
    This is a great resource for introduction to PySpark, keep the good work.

  • @baneous18
    @baneous18 Год назад +4

    42:17 Here the 'Missing values' is only replacing in the 'Name' column not anywhere else. even if I am specifying the columns names as 'age' or 'experience', it's not replacing the null values in those columns

    • @Star.22lofd
      @Star.22lofd Год назад

      Lemme know if you get the answer

    • @WhoForgot2Flush
      @WhoForgot2Flush 27 дней назад

      Because they are not strings. If you cast the other columns to strings it will work as you expect, but I wouldn't do that just keep them as ints.

  • @SporteeGamer
    @SporteeGamer 3 года назад +8

    Thank you so much to give us these type of courses for free

  • @Uboom123
    @Uboom123 3 года назад +22

    Hey Krish, thanks for simple training on pyspark, can you add sample video merging data frame? And add rows to data frame?

  • @migueljr6147
    @migueljr6147 3 года назад +4

    Someone with the same error "Exception: Java gateway process exited before sending its port number
    " after spark = SparkSession.builder.appName('Practise').getOrCreate() ?????

    • @pwrtricks
      @pwrtricks 3 года назад

      Yes. I migrated from Jupyter to Google Collab and it worked normally.

    • @migueljr6147
      @migueljr6147 3 года назад +1

      @@pwrtricks I needed to see other video explain how to install Java... Apache spark ... Configurate... And install pyspark... That was a tough thing to do ....

    • @ManoharNathGuptaMann
      @ManoharNathGuptaMann 3 года назад

      Getting same error, pls can anyone help me to resolve this issue

    • @migueljr6147
      @migueljr6147 3 года назад +1

      @@ManoharNathGuptaMann of course man... I watched this video but is in Portuguese-Brasil (but it is possible to follow the instructions) ruclips.net/video/7tDOUrl7Aoc/видео.html
      That video helped me a lot to use it, good lucky buddy

    • @ManoharNathGuptaMann
      @ManoharNathGuptaMann 3 года назад +1

      @@migueljr6147 thanks mate, issue is resolved once I download java latest versions

  • @akshayagwl1
    @akshayagwl1 3 года назад +4

    I was having the error "Exception: Java gateway process exited before sending its port number". This was due to having JAVA 17.
    I uninstalled JAVA 17 and installed JAVA 8 and it worked.

    • @Paras7627
      @Paras7627 2 года назад

      Hey, Can you send me the syntax to isnatll JAVA 8, coz I am too facing the issue

    • @bansal02
      @bansal02 Год назад

      Hi, I am still facing the issue, can u please advise me?

  • @yarramneediravindraswamy6804
    @yarramneediravindraswamy6804 Год назад

    it showing below error.
    could you please help me in this
    running install_lib
    copying build\lib\pyspark\python\pyspark\shell.py -> C:\Users\HP\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pyspark\python\pyspark
    byte-compiling C:\Users\HP\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pyspark\examples\src\main\python\ml\multiclass_logistic_regression_with_elastic_net.py to multiclass_logistic_regression_with_elastic_net.cpython-310.pyc
    error: [Errno 2] No such file or directory: 'C:\\Users\\HP\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python310\\site-packages\\pyspark\\examples\\src\\main\\python\\ml\\__pycache__\\multiclass_logistic_regression_with_elastic_net.cpython-310.pyc.2644507691136'
    [end of output]
    note: This error originates from a subprocess, and is likely not a problem with pip.
    error: legacy-install-failure
    × Encountered error while trying to install package.
    ╰─> pyspark
    note: This is an issue with the package mentioned above, not pip.
    hint: See above for output from the failure.
    C:\Users\HP>cd C:\Users\HP\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pyspark\examples\src\main\python\ml\__pycache__\multiclass_logistic_regression_with_elastic_net.cpython-310.pyc.2644507691136

  • @sarthak1314
    @sarthak1314 Год назад +2

    Sorry to say, the instructor only knows how to use things (by doing API calls) but lacks the thorough understanding about how things run in the background. I've also observed in other videos. First learn before teaching.

    • @toyosi_ogunbiyi
      @toyosi_ogunbiyi 6 месяцев назад

      What happens to learning while teaching? 😅

  • @lakshyapratapsigh3518
    @lakshyapratapsigh3518 3 года назад +14

    VERY MUCH HAPPY IN SEEING MY FAVORITE TEACHER COLLABORATING WITH THE FREE CODE CAMP

  • @antonmursid3505
    @antonmursid3505 2 года назад +1

    Antonmursid
    🙏🙏🙏🙏🙏✌🇸🇬🇸🇬🇸🇬🇸🇬🇸🇬✌💝👌✌🙏

  • @Nari_Nizar
    @Nari_Nizar 2 года назад +1

    At 1:09:00 when you try to add Independent feature I get the below error:
    Py4JJavaError Traceback (most recent call last)
    in
    1 output = featureassembler.transform(trainning)
    ----> 2 output.show()
    C:\ProgramData\Anaconda3\lib\site-packages\pyspark\sql\dataframe.py in show(self, n, truncate, vertical)
    492
    493 if isinstance(truncate, bool) and truncate:
    --> 494 print(self._jdf.showString(n, 20, vertical))
    495 else:
    496 try:

  • @TheBarkali
    @TheBarkali 2 года назад +2

    Dear Krish. This is only W.O.N.D.E.R.F.U.L.L 😉.
    Thanks so Much and thanks to professor Hayth.... who showed me the link to your training. Cheers to both of U guys

  • @IvanSedov-i7f
    @IvanSedov-i7f 2 года назад +11

    Прекрасное видео и прекрасная манера подачи материала. Большое спасибо!

  • @barzhikevil6873
    @barzhikevil6873 3 года назад +4

    For the filling exercise on minute 42:00 aprox, I cannot do it with integer type data, I had to use string data like you did. But them in the next exercise, the one on minute 44:00, the function won't run unless you use integer data for the columns you are trying to fill.

    • @Richard-DE
      @Richard-DE 2 года назад +1

      @@caferacerkid you can try to read with/without inferSchema = True and check the schema, you will see the difference. Try to read again for Imputer.

  • @siddhantbhagat7216
    @siddhantbhagat7216 Год назад +5

    I am very happy to see krish sir on this channel.

  • @gkranasinghe
    @gkranasinghe 2 года назад

    Truth is this pyspark tutorial could have been lot better,this definitely needs a lot of improvement and doesn't meet the standards to be in free code camp channel.why I say so,49:00 minutes past , still doing the basic pyspark stuff

  • @yashbhawsar0872
    @yashbhawsar0872 3 года назад +8

    @Krish Naik Sir just to clarify at 26:33 I think the Name column min-max decided on the lexicographic order, not by index number.

    • @shankiiz
      @shankiiz Год назад

      yep, you are right!

  • @hvbosna
    @hvbosna Год назад +1

    Thank you for this video. A lot of effort and time, I appreciated it.
    I have a technical question: @10:23 I had an error like:
    RuntimeError: Java gateway process exited before sending its port number
    I have Java installed and JAVA_HOME set.
    Thank you for your time.

    • @vincentchidiebere1652
      @vincentchidiebere1652 9 месяцев назад

      Hello, have you resolved this, how did you do it? I am encountering same.

  • @nagarajannethi
    @nagarajannethi 3 года назад +5

    🥺🥺🙌🙌❣️❣️❤️❤️❤️ This is what we need

  • @raunakghosh3552
    @raunakghosh3552 2 года назад +2

    Hi, i tried running the type(df_pyspark) but its providing an output of "nonetype" instead of "dataframe". Can you please suggest what i should be doing?

    • @ranroun3
      @ranroun3 2 года назад

      having the same issue

    • @ranroun3
      @ranroun3 2 года назад +2

      just fixed it. remove the '.show()' from the df_pyspark declaration (line 13 in his version). The .show command apparently makes the object a NoneType

    • @raunakghosh3552
      @raunakghosh3552 2 года назад

      @@ranroun3 Thank you. Real helpful tip

  • @kirantodekar3481
    @kirantodekar3481 2 года назад +1

    RuntimeError: Java gateway process exited before sending its port number
    I am getting this error after executing the getCreate command

    • @vitazamb3375
      @vitazamb3375 2 года назад

      Me too. Did you manage to solve this problem?

  • @cr7ro237
    @cr7ro237 Год назад

    *NichesPanel likes this xD* we all know that they isn't, but do you think models buy followers to appear on the internet?

  • @aanchalgujrathi9985
    @aanchalgujrathi9985 3 года назад +1

    Hi, could you please tell me how to skip the header while reading csv file? . option ("header","False") is not working

  • @dilandanuka620
    @dilandanuka620 3 месяца назад

    А можешь сделать обзор на самую новую перспективную крипту? сейчас столько всего нового выходит, за всем не успеваешь следить...

  • @suriyab8143
    @suriyab8143 2 года назад +1

    I getting Java exception handling error when I tried to execute d code.
    I installed java also..
    I tried to solve on net also, but I could not
    . Kindly help me to solve this issue

    • @vitazamb3375
      @vitazamb3375 2 года назад

      Me too. Did you manage to solve this problem?

  • @LRondan
    @LRondan 3 года назад +4

    Nice video, could you add timestamps?

  • @trevorweber4771
    @trevorweber4771 9 дней назад

    Does anyone have any recommendations for what to look into after finishing this tutorial? Project ideas or more advanced tutorials?

  • @vigneshjaisankar7087
    @vigneshjaisankar7087 2 года назад +1

    RuntimeError: Java gateway process exited before sending its port number - Do I need to install java in my laptop to avoid this error. Kindly help me

    • @sanjaybalikar870
      @sanjaybalikar870 2 года назад

      did you get answer for this

    • @vigneshjaisankar7087
      @vigneshjaisankar7087 2 года назад +2

      @@sanjaybalikar870 yeah ... Java was not there.. installing java in my PC resolved the issue

    • @sanjaybalikar870
      @sanjaybalikar870 2 года назад

      @@vigneshjaisankar7087 Thank you

  • @alanhenry9850
    @alanhenry9850 3 года назад +8

    Atlast krish naik sir in freecodecamp😍

  • @roverteam4914
    @roverteam4914 Месяц назад

    I got this error when trying to create the session 10:30 "RuntimeError: Java gateway process exited before sending its port number"

  • @akashk2824
    @akashk2824 2 года назад +4

    Thank you so much sir, 100 % satisfied with your tutorial. Loved it.

  • @owaisahmad8336
    @owaisahmad8336 Год назад +1

    At 1:09:56 when I try to run the following command in my jupyter notebook
    finalized_data=output.select("Independent Features", "Salary")
    I am getting the following error:
    AnalysisException: cannot resolve '`Independent Features`' given input columns: [ Independent Features, Age, Experience, Name, Salary];
    'Project ['Independent Features, Salary#19]
    +- Project [Name#16, Age#17, Experience#18, Salary#19, UDF(struct(Age_double_VectorAssembler_7452ec0fbd38, cast(Age#17 as double), Experience_double_VectorAssembler_7452ec0fbd38, cast(Experience#18 as double))) AS Independent Features#48]
    +- Relation [Name#16,Age#17,Experience#18,Salary#19] csv
    Kindly help!!!!

    • @owaisahmad8336
      @owaisahmad8336 Год назад +1

      Resolved the issue!!
      There was a typo. I had mistakenly given a space at the beginning of the string like this " Independent Features".

  • @cosmeligion
    @cosmeligion 3 года назад +6

    Hi Krish, this is very helpful video. I have a question when I try to run pyspark from jupyter notebook I always need to import findspark and initialize the same. But I saw that you were able to directly import pyspark. What could be the problem?

    • @geethanshr
      @geethanshr 4 месяца назад

      i think he already downloaded apache spark

  • @jagmeetsond6075
    @jagmeetsond6075 3 года назад +1

    Freecodecamp please discuss recursion concept very well thanks

  • @cherishpotluri957
    @cherishpotluri957 3 года назад +7

    Krish Naik on FCC🤯🔥🔥

  • @mohammedmussadiq8934
    @mohammedmussadiq8934 Год назад

    Is this Pyspark Tutorial enough for a Data Engineer?

  • @convel
    @convel 3 года назад +1

    in the linear regression part shouldn't be all the categorical cols transform into dummy variables? yes for binary categorical variables it doesn't matter. But which method should be used for multi-categorial variables? stringindexer only transfer them into int numbers, which doesn't make any sense for the coef-estimation... is there another StringIndexer like method?

  • @minecraft-dc1jt
    @minecraft-dc1jt Год назад

    Who did the advertisement for the NichesPanel store appear to?

  • @ZohanSyahFatomi
    @ZohanSyahFatomi Год назад

    Thank you for the tutorial. I have one question. Is hadoop has similar role with pyspark? Please let me know.

  • @khangnguyendac7184
    @khangnguyendac7184 11 месяцев назад +1

    42:15 The Pyspark now have update the na.fill(). It could only fill up the "value type" matching with "column type". For example, in the video, the professor only could replace all 4 columns because all 4 "column type" is "string" as the same as "Missing value". This being explain in 43:02.

  • @konstantingorskiy5716
    @konstantingorskiy5716 Год назад +3

    Used this video to prepare for the tech interview, hope it will help)))

    • @michasikorski6671
      @michasikorski6671 Год назад +1

      Is this enought to say that you know spark/databricks?

  • @MrAhmedUA
    @MrAhmedUA 3 года назад +1

    you can also do this
    read.csv('user.csv', header=True)

  • @shubhamgupta6408
    @shubhamgupta6408 3 года назад +1

    I am trying to initialize my spark setup but I am facing this error. I did look at stackoverflow but the issue has been same. Can anyone help
    Python: Current version 3.8
    Pyspark :3.0.3
    Py4JError: org.apache.spark.api.python.PythonUtils.getPythonAuthSocketTimeout does not exist in the JVM

  • @critiquessanscomplaisance8353
    @critiquessanscomplaisance8353 Год назад +2

    That for free is charity, litteraly! Thanks a lot!!!

  • @sushilkamble8379
    @sushilkamble8379 3 года назад +1

    10:00 | Whoever is getting Exception: Java gateway process exited before sending the driver its port error, Install Java SE 8 (Oracle). The error will be solved.

    • @kazekagetech988
      @kazekagetech988 2 года назад

      did you solve bro? im facing it now

    • @vitazamb3375
      @vitazamb3375 2 года назад

      Me too. Did you manage to solve this problem?

  • @gaurangagarwal9640
    @gaurangagarwal9640 2 года назад

    I am getting error while starting sparksession , can sombody help me out?

    • @kailashkangne6288
      @kailashkangne6288 2 года назад

      read 3 instruction in below video
      ruclips.net/video/nSIzZeuC9pY/видео.html&ab_channel=kailashkangne

  • @am010delson.d2
    @am010delson.d2 3 года назад +3

    You have good heart ❤️

  • @rajvashisthsharma5102
    @rajvashisthsharma5102 2 года назад

    Hi Krish Naik, I am getting the error while using pyspark==> "Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
    : java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils". PLease help me on this, i have tried all steps

  • @Kronu.
    @Kronu. 3 года назад

    I want to be a hacker!
    Where should I start! 🥺
    Got scammed 1200$, it's payback time so I need FreeCodeCamp *Sensei* to teach me the Legendary art of Hacking!! ❤️❤️

  • @christsciple
    @christsciple 2 года назад +1

    I receive the following error: java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ when trying to run spark = sparkSession.builder.appName('Practise').getOrCreate()
    Researching on Google suggests its an issue with the version of Java JDK I'm running. I've tried 18, 11, and now 8 and run into the same issue. Anyone know the solution?

    • @kailashkangne6288
      @kailashkangne6288 2 года назад

      read 3 instruction in below video
      ruclips.net/video/nSIzZeuC9pY/видео.html&ab_channel=kailashkangne

  • @pushtikapadia542
    @pushtikapadia542 2 года назад

    ParseException:
    mismatched input '>=' expecting {, '-'}(line 1, pos 14)
    == SQL ==
    No. of Orders >=10
    --------------^^^
    can anyone please help me to solve this error?

  • @Nivedavelan
    @Nivedavelan 10 месяцев назад

    Hi, i am trying to do pip install pyspark and getting error 'ValueError: cannot verify_mode to CERT_NONE when check_hostname is enabled. please suggest.

  • @TheRedGauntlet
    @TheRedGauntlet 2 года назад +1

    Thank You for this. But im having some weird problem where i import a csv file but everything is inside one column. I tried making on excel a data set and even downloading a made one and still kept importing it like it was on one column.

  • @thatwasavailable
    @thatwasavailable 2 года назад

    On my notebook, it is only replacing null values as 'missing values' on Name column, on others it is still showing null. What could be the issue ?

  • @gauravchaturvedi3615
    @gauravchaturvedi3615 7 месяцев назад

    sorry to be naive, but I am just starting. It seems first I need to have jupyter notebook for Python installed on my computer, before I can pip install Spark?? Is there any recommended way to get Jupyter installed for learning use case?

  • @rohitvisave4447
    @rohitvisave4447 Год назад

    At 10:00 while executing this
    Spark=SparkSession.builder.appname().
    , It Gives me a FileNotFound Error!!!
    Can u please help me?

  • @lahariprogram4317
    @lahariprogram4317 Год назад

    Hi can anyone help me out with a question,
    while learning the Pyspark need to have a basic knowledge on ML or can I directly jump into this Pyspark library without any basic knowledge on ML or anyother rather than Python

  • @skateforlife3679
    @skateforlife3679 2 года назад +7

    Nice video, clear and precise. But it would be better with better dataset, to show more options in the data analysis (grouping more columns, max(column) etc.)

  • @ccuny1
    @ccuny1 3 года назад +5

    Yet another excellent offering. Thank you so much.

  • @tech-n-data
    @tech-n-data 6 месяцев назад

    42:11 As of 3/9/24 the na.fill or fillna will not fill integer colums with string.
    51:31 aslo df_pyspark.filter('Salary15000')

  • @troy5842
    @troy5842 Год назад

    I am facing a problem while creating spark session... It's showing me " Java gateway process exited before sending it's port number". Plz tell me why this issue is arising and what's the solution

  • @stkmgr00
    @stkmgr00 Год назад

    I am using ipynb
    at line "spark=SparkSession.builder.appName('test').getOrCreate()"
    getting following error.
    Py4JError: An error occurred while calling None.org.apache.spark.sql.SparkSession. Trace:
    py4j.Py4JException: Constructor org.apache.spark.sql.SparkSession([class org.apache.spark.SparkContext, class java.util.HashMap]) does not exist
    at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)