Data Validation Using Pyspark || ColumnPositionComparision ||

Data Validation with Pyspark || Real Time Scenario

95. Databricks | Pyspark | Schema | Different Methods of Schema Definition

Aidan Hutchinson Horrible Leg Injury - Doctor Explains

Film Theory: The Inside Out Analog Horror Gets DARK...

SCARY BLIND HIDE & SEEK!! “ HALLOWEEN EDITION”

Data Validation with Pyspark || Schema Comparison || Dynamically || Real Time Scenario

DataSpark

Просмотров 1,7 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 17 окт 2024
In this Video we covered how we can perform quick data validation like Schema comparison between source and Target: In the next video we will look into Date/TimeStamp format check and duplicate count check .
Column Comparison link :
• Data Validation with P...
#dataanalytics #dataengineeringessentials #azuredatabricks
#dataanalysis
#pyspark
#pythonprogramming
#sql
#databricks #PySpark #Spark #DatabricksNotebook #PySparkLogic

Комментарии • 10

@skateforlife3679 10 месяцев назад ⁺²
Thank you for your work !!! It would be amazing if you could enhance the video with "chapters" to put more context in what you explain in the differents sections of the video :) !
@DataSpark45 10 месяцев назад ⁺¹
Great suggestion!
@saibhargavreddy5992 5 месяцев назад
I found this very useful as I had a similar issue with data validations. It helped a lot while completing my project.
@DataSpark45 5 месяцев назад
Glad it helped!
@vamshimerugu6184 5 месяцев назад ⁺¹
I think schema comparison is the important topic in pyspark . Great explanation sir ❤
@DataSpark45 5 месяцев назад
thank you bro
@avinash7003 9 месяцев назад
code in github?
@DataSpark45 9 месяцев назад ⁺¹
Here is the link bro : drive.google.com/drive/folders/1I6rqtiKh1ChM_dkLJwxfxyxcHWPgyiKZ?usp=sharing
@amandoshi5803 3 месяца назад
source code ?
@DataSpark45 3 месяца назад ⁺¹
def SchemaComparision(controldf, spsession, refdf):
try:
#iterate controldf and get the filename and filepath
for x in controldf.collect():
filename = x['filename']
#print(filename)
filepath = x['filepath']
#print(filepath)
#define the dataframes from the filepaths
print("Data frame is creating for {} or {}".format(filepath, filename))
dfs = spsession.read.format('csv').option('header', True).option('inferSchema', True).load(filepath)
print("DF Created for {} or {}".format(filepath, filename))
ref_filter = refdf.filter(col('SrcFileName') == filename)
for x in ref_filter.collect():
columnNames = x['SrcColumns']
refTypes = x['SrcColumnType']
#print(columnNames)
columnNamesList = [x.strip().lower() for x in columnNames.split(",")]
refTypesList = [x.strip().lower() for x in refTypes.split(",")]
#print(refTypesList)
dfsTypes = dfs.schema[columnNames].dataType.simpleString() #StringType() : string , IntergerType() : int
dfsTypesList = [x.strip().lower() for x in dfsTypes.split(",")]
# columnName : Row id, DataFrameType : int, reftype: int
missmatchedcolumns = [(col_name, df_types, ref_types) for (col_name, df_types, ref_types) in zip(columnNamesList, dfsTypesList, refTypesList) if dfsTypesList != refTypesList]
if missmatchedcolumns :
print("schema comparision has been failed or missmatched for this {}".format(filename))
for col_name, df_types, ref_types in missmatchedcolumns:
print(f"columnName : {col_name}, DataFrameType : {df_types}, referenceType : {ref_types}")

else:
print("Schema comaprision is done and success for {}".format(filename))
except Exception as e:
print("An error occured : ", str(e))
return False

Следующие

Автовоспроизведение

Data Validation Using Pyspark || ColumnPositionComparision ||

Data Validation Using Pyspark || ColumnPositionComparision ||

Data Validation with Pyspark || Real Time Scenario

Data Validation with Pyspark || Real Time Scenario

95. Databricks | Pyspark | Schema | Different Methods of Schema Definition

95. Databricks | Pyspark | Schema | Different Methods of Schema Definition

Aidan Hutchinson Horrible Leg Injury - Doctor Explains

Aidan Hutchinson Horrible Leg Injury - Doctor Explains

Film Theory: The Inside Out Analog Horror Gets DARK...

Film Theory: The Inside Out Analog Horror Gets DARK...

SCARY BLIND HIDE & SEEK!! “ HALLOWEEN EDITION”

SCARY BLIND HIDE & SEEK!! “ HALLOWEEN EDITION”

The 2024 MLB Playoffs Have Been Different...

The 2024 MLB Playoffs Have Been Different...

Data validation between source and target table | PySpark Interview Question |

Data validation between source and target table | PySpark Interview Question |

🔥PART 2 - ETL Validation - Data Validation - TABLE vs TABLE, File vs File, File vs Database #QA

🔥PART 2 - ETL Validation - Data Validation - TABLE vs TABLE, File vs File, File vs Database #QA

96. Databricks | Pyspark | Real Time Scenario | Schema Comparison

96. Databricks | Pyspark | Real Time Scenario | Schema Comparison

Azure Data Factory - Validate File Schema

Azure Data Factory - Validate File Schema

JSON Schema Validation in Python: Bring Structure Into JSON

JSON Schema Validation in Python: Bring Structure Into JSON

I gave 127 interviews. Top 5 Algorithms they asked me.

I gave 127 interviews. Top 5 Algorithms they asked me.

Learning Pandas for Data Analysis? Start Here.

Learning Pandas for Data Analysis? Start Here.

Column-wise comparison of two Dataframes | PySpark | Realtime Scenario

Column-wise comparison of two Dataframes | PySpark | Realtime Scenario

Теперь DOOM пройдёт каждый

Теперь DOOM пройдёт каждый

Mübariz İbrahimovun atası vəfat etdi

Mübariz İbrahimovun atası vəfat etdi

Не узнал Харламова 😳 #ComedyClub #КамедиКлаб #ТакерКарлсон #харламов #тнт4 #тнт #батрутдинов

Не узнал Харламова 😳 #ComedyClub #КамедиКлаб #ТакерКарлсон #харламов #тнт4 #тнт #батрутдинов

Невероятный разговор о книгах / вДудь

Невероятный разговор о книгах / вДудь

moto tag - AirTag для Android

moto tag — AirTag для Android

Fake watermelon by Secret Vlog

Fake watermelon by Secret Vlog

Когда встретил соседей с 1-го этажа #воваизатаманово #3серия #mediumquality

Когда встретил соседей с 1-го этажа #воваизатаманово #3серия #mediumquality