Use pivot funtion with Subject column to get a new column for each value in that particular column. Can use aggregate function sum on Marks. Order of Eng/Math column may not be same. df1=df.groupBy("Name").pivot("sub").agg(sum("Marks"))
Using case when is better as using collect list may give incorrect output. Considering null or absence of one subject mark for a person. Even in production queries case when would be used in such scenarios. 🙂
@@prachideokar7639 , FIRST - Aggregated values which picks first value in orderBy for the grouped values. In the above case, GroupBy value- Name (Rudra) Pivot Value - (Math) marks: first combination of above two values is {79}.
I tried below df = df.groupBy(F.col("Name")).pivot(F.col("sub")).agg(F.max(F.col("Marks"))) df.show() , but throw error as jgd = self._jgd.pivot(pivot_col) Column is not iterable
df1=df.groupBy("Name").pivot("sub").agg({"Marks" : "last"}) df1.show() This code will give you irrspective how many subject you have in Sub col umn as different columns
Hi Sir My Way: df1 = df.groupBy("Name").pivot("Sub").agg(first(col("Marks"))) df2 = df1.select("Name", "math", "eng").orderBy(col('math').desc(),col('eng').desc()) df2.show()
Sagar, I had a query.... For using collect_list command, we have to sort the dataset based on subject first, right? My Solution: df_1 = spark.createDataFrame(data=data,schema=["Name","Sub","Marks"]) df_2 = df_1.groupBy(col("Name")).pivot("Sub",["math","eng"]).agg(sum("Marks")) or, df_1.createOrReplaceTempView("Pivot_Data") display(spark.sql("Select Name, SUM(CASE WHEN sub like 'math' THEN Marks ELSE 0 END) as Math, SUM(CASE WHEN sub like 'eng' THEN Marks ELSE 0 END) as Eng from Pivot_Data GROUP BY Name"))
All videos in this pyspark interview playlist are highly useful Sagar. Big Thanks for your efforts man!!
Use pivot funtion with Subject column to get a new column for each value in that particular column. Can use aggregate function sum on Marks. Order of Eng/Math column may not be same.
df1=df.groupBy("Name").pivot("sub").agg(sum("Marks"))
Using case when is better as using collect list may give incorrect output. Considering null or absence of one subject mark for a person. Even in production queries case when would be used in such scenarios. 🙂
Yeah could be
Hi Sagar , to master pyspark which of your's course should i buy?
very usefull Sagar
was this asked in Tiger analytics (Canada)?
pivoted_df = df.groupBy("Name").pivot("Subject").agg({"Marks": "first"}).show()
This is the better answer
I don't know why he didn't use pivot function and made solution complicated. LoL
Last me {"marks":"first"} ka kya meaning hai
@@prachideokar7639 , FIRST - Aggregated values which picks first value in orderBy for the grouped values.
In the above case,
GroupBy value- Name (Rudra)
Pivot Value - (Math)
marks: first combination of above two values is {79}.
I tried below df = df.groupBy(F.col("Name")).pivot(F.col("sub")).agg(F.max(F.col("Marks")))
df.show() , but throw error as jgd = self._jgd.pivot(pivot_col) Column is not iterable
My Solution :
df.withColumn("math",when(df.Subject=="math",df.Marks).otherwise(0))\
.withColumn("eng",when(df.Subject=="eng",df.Marks).otherwise(0))\
.groupBy("Name").agg(max("math").alias("math"),max("eng").alias("eng")).show()
df1=df.groupBy("Name").pivot("sub").agg({"Marks" : "last"})
df1.show()
This code will give you irrspective how many subject you have in Sub col umn as different columns
Hi Sir
My Way:
df1 = df.groupBy("Name").pivot("Sub").agg(first(col("Marks")))
df2 = df1.select("Name", "math", "eng").orderBy(col('math').desc(),col('eng').desc())
df2.show()
Sagar, I had a query.... For using collect_list command, we have to sort the dataset based on subject first, right?
My Solution:
df_1 = spark.createDataFrame(data=data,schema=["Name","Sub","Marks"])
df_2 = df_1.groupBy(col("Name")).pivot("Sub",["math","eng"]).agg(sum("Marks"))
or,
df_1.createOrReplaceTempView("Pivot_Data")
display(spark.sql("Select Name, SUM(CASE WHEN sub like 'math' THEN Marks ELSE 0 END) as Math, SUM(CASE WHEN sub like 'eng' THEN Marks ELSE 0 END) as Eng from Pivot_Data GROUP BY Name"))
You are right, collect list will give random result
please english language azure datbricks
required plese
df.groupBy("Name").agg(max(when(df.Sub=='math',df.Marks).otherwise(0)).alias("Math"),max(when(df.Sub=='eng',df.Marks).otherwise(0)).alias("eng"))
df.groupBy(df.Name).pivot(df.sub).agg(max(df.marks)).show()
Or
df.groupBy(df.Name).pivot(df.sub).agg(first(df.marks)).show()
It is giving me the " Column is not iterable" error. can you please suggest on it?
pivot function supports only string values. Try using string notation for pivot instead of dot notation.@@vasisultan8896
df.groupBy("Name").pivot("sub").agg(max("marks"))
or
df.groupBy("Name").pivot("sub").agg(first("marks"))
try this one@@vasisultan8896
Max ("marks") kyu??
My solution
result = df.groupBy('Name').agg(collect_list(col('sub')).alias('subs'),collect_list(col('Marks')).alias('marks'))
col_name = [result.select('subs').first()[0][0],result.select('subs').first()[0][1]]
result_df = result.withColumn(col_name[0],col('marks')[0]) \
.withColumn(col_name[1],col('marks')[1]).select('Name','Math','Eng')
result_df.show()
df.groupby(col("Name")).agg(
sum(when(col("Sub")=="math",col("Marks")).otherwise(0)).alias("maths"),
sum(when(col("Sub")=="eng",col("Marks")).otherwise(0)).alias("eng")
).show()
Good approach
df.groupBy("Name").pivot("Sub").sum("Marks")
df.groupBy('name').pivot('Sub', ['math','eng']).sum('Marks').display()
df_sub1 = df_sub.groupBy('Name').agg(collect_list('Marks').alias('Sub_Marks'))
df_sub1.withColumn('math',df_sub1.Sub_Marks[0]).withColumn('eng',df_sub1.Sub_Marks[1]).select('Name','math','eng').show()
df.groupBy(f.col("Name")).pivot("Sub",[i[0] for i in df.select("Sub").distinct().collect()]).agg(f.sum("Marks"))
Good Solution
df.groupBy("Name").pivot("Sub").agg(first("Marks")).orderBy("Name").show()
df.groupBy("Name").pivot("Sub").sum("Marks").show()