Pyspark Scenarios 11 : how to handle double delimiter or multi delimiters in pyspark

Поделиться
HTML-код
  • Опубликовано: 17 окт 2024
  • Pyspark Scenarios 11 : how to handle double delimiter or multi delimiters in pyspark #pyspark
    Pyspark Interview question
    Pyspark Scenario Based Interview Questions
    Pyspark Scenario Based Questions
    Scenario Based Questions
    #PysparkScenarioBasedInterviewQuestions
    #ScenarioBasedInterviewQuestions
    #PysparkInterviewQuestions
    Notebook location :
    github.com/rav...
    Complete Pyspark Real Time Scenarios Videos.
    Pyspark Scenarios 1: How to create partition by month and year in pyspark
    • Pyspark Scenarios 1: H...
    pyspark scenarios 2 : how to read variable number of columns data in pyspark dataframe #pyspark
    • pyspark scenarios 2 : ...
    Pyspark Scenarios 3 : how to skip first few rows from data file in pyspark
    • Pyspark Scenarios 3 : ...
    Pyspark Scenarios 4 : how to remove duplicate rows in pyspark dataframe #pyspark #Databricks
    • Pyspark Scenarios 4 : ...
    Pyspark Scenarios 5 : how read all files from nested folder in pySpark dataframe
    • Pyspark Scenarios 5 : ...
    Pyspark Scenarios 6 How to Get no of rows from each file in pyspark dataframe
    • Pyspark Scenarios 6 Ho...
    Pyspark Scenarios 7 : how to get no of rows at each partition in pyspark dataframe
    • Pyspark Scenarios 7 : ...
    Pyspark Scenarios 8: How to add Sequence generated surrogate key as a column in dataframe.
    • Pyspark Scenarios 8: H...
    Pyspark Scenarios 9 : How to get Individual column wise null records count
    • Pyspark Scenarios 9 : ...
    Pyspark Scenarios 10:Why we should not use crc32 for Surrogate Keys Generation?
    • Pyspark Scenarios 10:W...
    Pyspark Scenarios 11 : how to handle double delimiter or multi delimiters in pyspark
    • Pyspark Scenarios 11 :...
    Pyspark Scenarios 12 : how to get 53 week number years in pyspark extract 53rd week number in spark
    • Pyspark Scenarios 12 :...
    Pyspark Scenarios 13 : how to handle complex json data file in pyspark
    • Pyspark Scenarios 13 :...
    Pyspark Scenarios 14 : How to implement Multiprocessing in Azure Databricks
    • Pyspark Scenarios 14 :...
    Pyspark Scenarios 15 : how to take table ddl backup in databricks
    • Pyspark Scenarios 15 :...
    Pyspark Scenarios 16: Convert pyspark string to date format issue dd-mm-yy old format
    • Pyspark Scenarios 16: ...
    Pyspark Scenarios 17 : How to handle duplicate column errors in delta table
    • Pyspark Scenarios 17 :...
    Pyspark Scenarios 18 : How to Handle Bad Data in pyspark dataframe using pyspark schema
    • Pyspark Scenarios 18 :...
    Pyspark Scenarios 19 : difference between #OrderBy #Sort and #sortWithinPartitions Transformations
    • Pyspark Scenarios 19 :...
    Pyspark Scenarios 20 : difference between coalesce and repartition in pyspark #coalesce #repartition
    • Pyspark Scenarios 20 :...
    Pyspark Scenarios 21 : Dynamically processing complex json file in pyspark #complexjson #databricks
    • Pyspark Scenarios 21 :...
    Pyspark Scenarios 22 : How To create data files based on the number of rows in PySpark #pyspark
    • Pyspark Scenarios 22 :...
    pyspark sql
    pyspark
    hive
    which
    databricks
    apache spark
    sql server
    spark sql functions
    spark interview questions
    sql interview questions
    spark sql interview questions
    spark sql tutorial
    spark architecture
    coalesce in sql
    hadoop vs spark
    window function in sql
    which role is most likely to use azure data factory to define a data pipeline for an etl process?
    what is data warehouse
    broadcast variable in spark
    pyspark documentation
    apache spark architecture
    which single service would you use to implement data pipelines, sql analytics, and spark analytics?
    which one of the following tasks is the responsibility of a database administrator?
    google colab
    case class in scala
    RISING
    which role is most likely to use azure data factory to define a data pipeline for an etl process?
    broadcast variable in spark
    which one of the following tasks is the responsibility of a database administrator?
    google colab
    case class in scala
    pyspark documentation
    spark architecture
    window function in sql
    which single service would you use to implement data pipelines, sql analytics, and spark analytics?
    apache spark architecture
    hadoop vs spark
    spark interview questions

Комментарии • 30

  • @pokemongatcha122
    @pokemongatcha122 Год назад +2

    Hi Ravi, I'm trying to do split by delimiter of a column with each cell having different no. of commas. Can you write a code to split columns with each occurance of comma? E.g. if row 1 has 4 commas it generates 4 columns but row 2 has 10 commas so it further generates another 6 columns.

  • @sravankumar1767
    @sravankumar1767 2 года назад +2

    Nice explanation 👌 👍 👏

  • @rajeshk1276
    @rajeshk1276 2 года назад +2

    Very Well explained.. Loved it

  • @gobinathmuralitharan1997
    @gobinathmuralitharan1997 2 года назад +1

    Clear explanation 👍👏thank you 🙂

  • @tanushreenagar3116
    @tanushreenagar3116 2 года назад +1

    Nice

  • @prabhakaranvelusamy
    @prabhakaranvelusamy 2 года назад +1

    Excellent explanation!

  • @udaynayak4788
    @udaynayak4788 Год назад

    Hi Ravi, i do have .txt file which multiple space delimiter, e.g accountID Acctnbm acctadd branch and likewise can you please suggest the approach here almost i have 76 columns with multiple consecutive delimiter.

  • @JustForFun-oy8fu
    @JustForFun-oy8fu Год назад

    Hi Ravi, thanks I have one doubt: how
    can we generalize the above logic.....like if we have large number of columns after splitting the data like then it's obvious we can't do it manually.
    What could be our approach in that case?
    Thanks,
    Anonymous

    • @DanishAnsari-hw7so
      @DanishAnsari-hw7so Год назад

      # Case 1. when no of columns is known
      col = 4
      i = 0
      while i < col:
      df_multi = df_multi.withColumn("sub" + str(i), df_multi["marks_split"][i])
      i += 1
      df_1 = df_multi.drop("marks").drop("marks_split")
      display(df_1)

    • @DanishAnsari-hw7so
      @DanishAnsari-hw7so Год назад

      # Case 2. when no of columns is not known known
      from pyspark.sql.functions import max
      df_multi = df_multi.withColumn('marks_size', size('marks_split'))
      max_size = df_multi.select(max('marks_size')).collect()[0][0]
      j = 0
      while j < max_size:
      df_multi = df_multi.withColumn("subject" + str(j), df_multi["marks_split"][j])
      j += 1
      df_2 = df_multi.drop("marks").drop("marks_split").drop('marks_size')
      display(df_2)

  • @NaveenKumar-kb2fm
    @NaveenKumar-kb2fm Год назад

    very well explained , i have a scenario with schema (id,name,age,technology) and data in single row like (1001|Ram|28|Java|1002|Raj|24|Database|1004|Jam|28|DotNet|1005|Kesh|25|Java) coming in a single csv file.
    now can we make it into multiple rows as per schema as a single table like below
    id,name,age,technology
    1001|Ram|28|Java
    1002|Raj|24|Database
    1004|Jam|28|DotNet
    1005|Kesh|25|Java

  • @V-Barah
    @V-Barah Год назад

    this is looks simple in example but in real time we can't do each with column if there are 200-300 columns.
    is there any other way?

    • @DanishAnsari-hw7so
      @DanishAnsari-hw7so Год назад +1

      # Case 1. when no of columns is known
      col = 4
      i = 0
      while i < col:
      df_multi = df_multi.withColumn("sub" + str(i), df_multi["marks_split"][i])
      i += 1
      df_1 = df_multi.drop("marks").drop("marks_split")
      display(df_1)

    • @DanishAnsari-hw7so
      @DanishAnsari-hw7so Год назад +1

      # Case 2. when no of columns is not known known
      from pyspark.sql.functions import max
      df_multi = df_multi.withColumn('marks_size', size('marks_split'))
      max_size = df_multi.select(max('marks_size')).collect()[0][0]
      j = 0
      while j < max_size:
      df_multi = df_multi.withColumn("subject" + str(j), df_multi["marks_split"][j])
      j += 1
      df_2 = df_multi.drop("marks").drop("marks_split").drop('marks_size')
      display(df_2)

  • @penchalaiahnarakatla9396
    @penchalaiahnarakatla9396 2 года назад +1

    Hi, good video, one clarification, while writing dataframe output to csv leading zeros are missing.. How to handle this secanioro. If possible make a video on this.

  • @fratkalkan7850
    @fratkalkan7850 2 года назад

    perfection

  • @gobinathmuralitharan1997
    @gobinathmuralitharan1997 2 года назад +1

    Subscribed 🔔

  • @snagendra5415
    @snagendra5415 2 года назад +1

    Could you explain spark small files problem using pyspark?
    Thank you in advance

    • @TRRaveendra
      @TRRaveendra  2 года назад +2

      sure i will do video on small files problem.

    • @snagendra5415
      @snagendra5415 2 года назад

      @@TRRaveendra thank you for your reply, and waiting for the video 🤩

  • @vikrammore-y4t
    @vikrammore-y4t Год назад

    spark 3.X supports multi delimiter like .option("delimiter","[||]")

  • @dinsan4044
    @dinsan4044 Год назад

    Hi ,
    Could you please create a video to combine below 3 csv data files into one data frame dynamically
    File name: Class_01.csv
    StudentID Student Name Gender Subject B Subject C Subject D
    1 Balbinder Male 91 56 65
    2 Sushma Female 90 60 70
    3 Simon Male 75 67 89
    4 Banita Female 52 65 73
    5 Anita Female 78 92 57
    File name: Class_02.csv
    StudentID Student Name Gender Subject A Subject B Subject C Subject E
    1 Richard Male 50 55 64 66
    2 Sam Male 44 67 84 72
    3 Rohan Male 67 54 75 96
    4 Reshma Female 64 83 46 78
    5 Kamal Male 78 89 91 90
    File name: Class_03.csv
    StudentID Student Name Gender Subject A Subject D Subject E
    1 Mohan Male 70 39 45
    2 Sohan Male 56 73 80
    3 shyam Male 60 50 55
    4 Radha Female 75 80 72
    5 Kirthi Female 60 50 55