Design pipeline of historical and then implement CDC question is a very good question. I did not find any resources where I could get such design questions. It's this type of data design questions, also questions related to given an api you have to do some manipulation in existing dataset in spark. Spark optimization, SQL are all readily available elsewhere. If you could have more such questions in upcoming videos, it would be very helpful for FAANG prep.
here are some answers to questions that I think the candidate didn't answer correctly. Q- to avoid duplication when ETL job is rerun - use incremental loads. or use staging tables, when you load data into the staging tables run deduplication steps. Q- If a query is taking too much time and resources - check the ddl of the tables to analyze the indexes, and see if the filtering is based on indexes. Use explain plan to check if full table scans are being implemented. if so, try running query in such a way that index based scans are used....There are cerrtain joins like cross joins or full outer joins that causes the query to run slow. Please correct me
Design pipeline of historical and then implement CDC question is a very good question. I did not find any resources where I could get such design questions. It's this type of data design questions, also questions related to given an api you have to do some manipulation in existing dataset in spark. Spark optimization, SQL are all readily available elsewhere.
If you could have more such questions in upcoming videos, it would be very helpful for FAANG prep.
here are some answers to questions that I think the candidate didn't answer correctly.
Q- to avoid duplication when ETL job is rerun - use incremental loads. or use staging tables, when you load data into the staging tables run deduplication steps.
Q- If a query is taking too much time and resources - check the ddl of the tables to analyze the indexes, and see if the filtering is based on indexes. Use explain plan to check if full table scans are being implemented. if so, try running query in such a way that index based scans are used....There are cerrtain joins like cross joins or full outer joins that causes the query to run slow.
Please correct me
@@shomailnajeeb230 🤝💯👏