109. Databricks | Pyspark| Coding Interview Question: Pyspark and Spark SQL

Поделиться
HTML-код
  • Опубликовано: 3 янв 2025

Комментарии • 38

  • @prasadtelu9873
    @prasadtelu9873 Год назад +6

    FIrst time I saw scenario-based and interview-based solutions in RUclips videos. Thanks for your commitment and for sharing the knowledge.

  • @adiityagupta-wu1tz
    @adiityagupta-wu1tz Год назад +3

    Please continue this series it will be very helpful to crack the interview and thank for starting this series.

  • @Vk-gw6ii
    @Vk-gw6ii 8 месяцев назад +1

    Excellent 👌

  • @prabhatgupta6415
    @prabhatgupta6415 Год назад +1

    Thanks Sir..create playlist of coding questions which are frequently asked.

  • @harithad1757
    @harithad1757 8 месяцев назад +1

    can i get the code copy pasted in description or maybe ink to the notebook

  • @DataEngineering-ni2ot
    @DataEngineering-ni2ot Год назад +1

    It should be 1 in first row of change event at 08:10 as previous value is not same with first row of event status but why it is coming as 0?

    • @johnsonrajendran6194
      @johnsonrajendran6194 9 месяцев назад

      for first row, previous value is null and we cannot compare null value with anything
      so by default our logic will go to else condition which is 0 in this case

  • @jinsonfernandez
    @jinsonfernandez 5 месяцев назад +1

    Thanks for this video, But I am curious why didnt you directly use min max with group by which would have fetched the same result
    ```
    result = df.withColumn("event_date", F.to_date("event_date")) \
    .groupBy("event_status") \
    .agg(
    F.min("event_date").alias("event_start_date"),
    F.max("event_date").alias("event_end_date")
    ) \
    .orderBy("event_start_date")
    result.show()
    ```

    • @rajasdataengineering7585
      @rajasdataengineering7585  5 месяцев назад

      Thanks for sharing your approach. Yes there are various approaches

    • @boyasuresh5998
      @boyasuresh5998 5 месяцев назад

      This won't work please check your code

    • @rajasdataengineering7585
      @rajasdataengineering7585  5 месяцев назад

      It worked

    • @funworld8659
      @funworld8659 4 месяца назад

      @@rajasdataengineering7585 it did not show the last won start date and end date

    • @beingnagur
      @beingnagur 2 месяца назад

      How will this approach work as we are grouping by event status, in that case 7th also will be grouped to this and start date will be 1st and end date will be 7th

  • @prabhatgupta6415
    @prabhatgupta6415 Год назад +1

    One more suggestion plz do put the daatset in description

  • @namratachavan6317
    @namratachavan6317 Год назад

    Hi sir could you please share the notebook and the github repository link to access the code

  • @arrooow9019
    @arrooow9019 Год назад

    Hi sir could you please share the notebook and dataset in the description. as it will helpful for our practice. Thanks in advance.

  • @rawat2608
    @rawat2608 Год назад +1

    Thanks you sir

  • @saurabh011192
    @saurabh011192 Год назад +1

    This solution will work only when the dates are in order wrt events. Tried jumbling them, didnt work.

    • @rohitsharma-mg7hd
      @rohitsharma-mg7hd Месяц назад

      lol brother you can apply sort first on window function , it will work.

  • @landchennai8549
    @landchennai8549 9 месяцев назад +1

    here is my SQL query for the same.
    declare @Event_Table table ( Event_date date, Event_status varchar(8))
    insert into @Event_Table
    select getdate()+Value, case when value 3 and value < 7 then 'Lost' else 'Won' end from generate_series(1,10,1)
    ; with cte
    as
    (
    select *
    , row_number() over ( order by Event_date) - row_number() over ( order by Event_status,Event_date) as GroupId
    from @Event_Table
    )
    select GroupId
    , min(Event_status) as Event_status
    , min(Event_date) as Start_date
    , max(Event_date) as End_Date
    , count(1) as Consequtive_Events
    from cte
    group by GroupId

  • @bonysrivastava7779
    @bonysrivastava7779 8 месяцев назад

    🙌

  • @roshniagrawal4777
    @roshniagrawal4777 5 месяцев назад

    This solution will not work if you have data like this, may be some tweak will be needed - data = [
    ("2020-06-01","Won"),
    ("2020-06-02","Won"),
    ("2020-06-03","Won"),
    ("2020-06-03","Lost"),
    ("2020-06-04","Lost"),
    ("2020-06-05","Lost"),
    ("2020-06-06","Lost"),
    ("2020-06-07","Won")
    ]

  • @starmscloud
    @starmscloud Год назад +3

    I did it in something like this . By suing a default date , a running number and datediff
    from pyspark.sql.functions import to_date ,row_number,asc,date_add,lit,datediff,min,max
    from pyspark.sql.window import Window
    eventDF.withColumn("event_date",to_date(col="event_date" ,format= "dd-MM-yyyy")) \
    .withColumn("rank",row_number().over(Window.partitionBy("event_status").orderBy(asc("event_date")))) \
    .withColumn("startDate",date_add(lit("1900-01-01"),"rank")) \
    .withColumn("datediff",datediff("event_date","startDate")) \
    .groupBy("datediff","event_status").agg(min("event_date").alias("start_date"),max("event_date").alias("end_date")) \
    .drop("rangeDate") \
    .sort("start_date").show()

  • @ShivamGupta-wn9mo
    @ShivamGupta-wn9mo Месяц назад +1

    my sol:
    window_base=Window.orderBy('date')
    Dataframe API
    df_t=df.withColumn("diff",
    dense_rank().over(window_base)-
    dense_rank().over(window_base.partitionBy("status")))\
    .groupBy("status","diff").agg(min("date").alias("start_date")\
    ,(max("date").alias("end_date")))\
    .orderBy("start_date")\
    .show()
    Spark SQL 1
    df.createOrReplaceTempView("match")
    spark.sql('''
    with cte as(
    select *,
    dense_rank() over(order by date)-
    dense_rank() over(partition by status order by date) as diff
    from match)
    select status,diff,
    min(date) as start,
    max(date) as end
    from cte
    group by status,diff
    order by start
    '''
    ).show()
    Spark SQL 2
    df.createOrReplaceTempView("match")
    spark.sql(
    '''
    with cte as(
    select *,
    dense_rank() over(order by date) as rn1,
    dense_rank() over(partition by status order by date) as rn2,
    dense_rank() over(order by date) -dense_rank() over(partition by status order by date) as diff
    from match)
    select a.status,max(a.start_date) as start_date,max(a.end_date) as end_date
    from
    (select date,status,diff,
    min(date) over(partition by status,diff ) as start_date,
    max(date) over(partition by status,diff ) as end_date
    from cte
    order by date) a
    group by a.status,a.diff
    order by start_date asc
    '''
    ).show()
    enjoy