2nd Data Engineering Interview | Apache Spark Interview | Live Big Data Interview

Поделиться
HTML-код
  • Опубликовано: 8 сен 2024
  • This video is part of the Spark Interview Questions Series.
    A lot of subscribers has requested me to give some experience on how an actual Big Dta Interview look like. In This Video we have covered what usually happens in Big data or data engineering interview happens.
    There will be more videos covering different aspects of Data Engineering Interviews.
    Here are a few Links useful for you
    Git Repo: github.com/har...
    Spark Interview Questions: • Spark Interview Questions
    If you are interested to join our community. Please join the following groups
    Telegram: t.me/bigdata_hkr
    Whatsapp: chat.whatsapp....
    You can drop me an email for any queries at
    aforalgo@gmail.com
    #apachespark #sparktutorial #bigdata
    #spark #hadoop #spark3 #bigdata #dataengineer

Комментарии • 63

  • @ansarhayat6276
    @ansarhayat6276 3 года назад +6

    1.common current working task?
    2. What type of problem have you face current task?
    3.loading data into data lake ,which changes you face?
    4.how you handle increamental data?by batch or stream?how much size of daily process data?
    5.snarion:sales data group product category per hour. need result of half historical+ half real time data in report?
    6.which tools possible to use for above sanirio? kafka,event hub
    7.how to tranfermation in kafka?
    ****Hive****
    8.hive external and internal table keys different? give use case
    9.when use static/dynamic partation in hive table?
    10.daily transcational table with year,date colum we can use any one of them,its static/dynamic partation?
    soultion: we partation on date colum which dynamic.each day day data place in daily partation.
    ***Spark***
    11.why,which language you use in spark ?desc its benifits
    12.you use df and data set? any error on runtime in df/dataset ?give example
    13.spark end coders?
    14.1 TB data process by spark ,how distrbute memory of core,driver,executor
    15.scala case class and regular calss difference?
    ***DB***
    16.have you work any non-relational db?
    17.a given tabel with three colum need to show one row of data use of grouping
    CREATE DATABASE big_data;
    USE big_data;
    CREATE TABLE user_info
    (user_name NVARCHAR(255),user_age INT,user_loc NVARCHAR(255))
    INSERT INTO user_info (user_name,user_age,user_loc)VALUES('ansar',30,'bang'),('ansar',30,'fsk');
    SELECT * FROM user_info;
    SELECT DISTINCT user_name,user_age FROM user_info;
    SELECT DISTINCT user_name,user_age,user_loc FROM user_info
    GROUP BY user_name,user_age;

    • @0yustas0
      @0yustas0 3 года назад +2

      Just for fun with Hive:
      SET hivevar:rnd = CAST(ROUND(RAND()) AS INT);
      SELECT user_name,user_age,collect_list(user_loc)[${hivevar:rnd}] AS c1, MAX(user_loc) AS c2
      FROM user_info
      GROUP BY user_name,user_age;

  • @priyankadhamija886
    @priyankadhamija886 2 года назад +2

    I have seen so many videos but your is best on all topics. Precise and cover all the interview questions almost.

    • @DataSavvy
      @DataSavvy  2 года назад

      Thanks Priyanka... I am happy that you like it

  • @nikhilv199138
    @nikhilv199138 3 года назад +17

    These type of videos are extremely helpful. If you could prepare a video about scala interview questions that would be of great help!!

    • @DataSavvy
      @DataSavvy  3 года назад +4

      Sure Nikhil... That is already in plan.. it is just difficult to find volunteers for Mock Interview

    • @somasundaramvalliappan3851
      @somasundaramvalliappan3851 3 года назад

      Can anyone please help me with sample resumes for scala, it's very hard for me to find s ala resumes in internet

  • @atanu4321
    @atanu4321 3 года назад +16

    Good Initiative Data Savvy, this will really help full for those who is preparing for interviews. One suggestions, if you will create a followup video where you can explain what good or wrong answer the candidate has given or what is the correct answer the candidate should give in order to get more acceptance from interviewer. a kind of analysis of this interview.

    • @DataSavvy
      @DataSavvy  3 года назад +8

      Thanks Atanu... I will plan for that... Your suggestion is very valuable

    • @nikhilv199138
      @nikhilv199138 3 года назад +2

      Exactly

    • @DataSavvy
      @DataSavvy  3 года назад +3

      Point noted... If anyone of you can volunteer, it will help me create these kind of videos...

    • @omkarjoshi3750
      @omkarjoshi3750 3 года назад

      @@DataSavvy hello sir,
      I am interested for volunteering. But I am fresher (2020 pass-out). If it is ok then I can volunteer.

    • @manojkumar-oc1sp
      @manojkumar-oc1sp 3 года назад

      @@DataSavvy I am interested to volunteer..

  • @user-fz4mz6ym6f
    @user-fz4mz6ym6f Год назад

    Some scenarios where dropping a schema but not the data are 1)Reorganizing the database structure 2)Cleaning up unused schemas 3) Rebuilding the schema from scratch.

  • @ankurrunthala
    @ankurrunthala 3 года назад +1

    I wish I got senior like u can learn more knowledge ❤️nice question ⁉️ .....sir u should make the anwers also ....mostly problematic answer ❤️❤️❤️❤️

  • @graceindia3122
    @graceindia3122 3 года назад +3

    Great vedio. But the candidate seems to be have ETL informatica developer experience not data engineer experience. He was not able to ans major questions of Spark. 😀. But good initiative data savvy, helps me to test my knowledge on Spark, big data and I m able to ans many questions.

  • @user-fz4mz6ym6f
    @user-fz4mz6ym6f Год назад

    Based on Problem Statement, My Answer would be firstly by using Apache Kafka or Amazon Kinesis handle the streaming data and dump into Aws S3, Since Aws S3 is acts like a Data Lake. After that by using Apache spark do some essential data processing and then ingest data into Aws Redshift or any other Datawarehouse by using Aws Glue as ETL.

  • @Chittaluri
    @Chittaluri 3 года назад +5

    Thanks a lot team, specially to Harjeet this video boosted my confidence towards interviews, please kindly post more interview videos

    • @DataSavvy
      @DataSavvy  3 года назад +2

      Thanks Sai... Yes I plan to create more videos

  • @johnsonrajendran6194
    @johnsonrajendran6194 3 года назад +1

    I found this video to be really helpful sir....Please create more such videos🙏

  • @bhavaniv1721
    @bhavaniv1721 3 года назад

    Thank you so much for sharing this kind of videos , really I understand that how interview happen 🙏

  • @rahulmaheshwari5582
    @rahulmaheshwari5582 3 года назад +1

    Very informative. Thanks for the video. 🙏

  • @phanidbd7284
    @phanidbd7284 3 года назад +2

    Great Thanks.... Can you please create a video with answers for these questions ...It really helps... Or add your comments at the end of the video

    • @DataSavvy
      @DataSavvy  3 года назад +5

      That's a good suggestion... Let me look into this

  • @mamamiakool
    @mamamiakool 3 года назад +3

    Hi Harjit, you are doing a great job for the community. Is there a way i can connect with you on Linkedin or via email?
    Also, do you plan to conduct similar interviews about Spark Streaming/Kafka?

  • @vibhavaribellutagi9439
    @vibhavaribellutagi9439 3 года назад

    Really helpful. thanks a lot for the video.

  • @raviyadav-dt1tb
    @raviyadav-dt1tb 9 месяцев назад

    Can you please provide aws questions and for data engineer, it will be helpful for us thanks 🙏

  • @yelururao1
    @yelururao1 3 года назад

    Hi sir..
    Please do more videos like this..

  • @prachigupta7688
    @prachigupta7688 3 года назад +3

    In the last question, to combine emp data on name, age and select random location, if we use groupby & collect list, won't it create the list of all loc for the group of emp name and age ? Shouldnt the use of other function like max, first etc will help in this scenario?

  • @anshusharaf2019
    @anshusharaf2019 7 месяцев назад

    In this scenario-based question can we create an end-to-end pipeline using the Kafka and power BI dashboard like..we can connect with your database as a source connector and for the transformation we can use KSQL DB where we perform some business-level transformation and after that store it into the Kafka-topic and then connect with the power BI for dashboard?
    @dataSavvy or someone, can u check Am I right thinking?

  • @Iamsatya_1
    @Iamsatya_1 3 года назад

    Can we solve the sales problem using classification. i.e - we can train our historical data by logistics regression and then predict the value of sales using evaluate function on new data.

  • @ghumredhanu6381
    @ghumredhanu6381 2 года назад

    Sir can you make of kafkha community

  • @usharani7125
    @usharani7125 2 года назад

    Harjit let me know if you are taking Ang training session please

  • @nitinm1473
    @nitinm1473 3 года назад +1

    What is the function for grouping distinct and select random value from other column?

    • @ashutoshsamanta4244
      @ashutoshsamanta4244 3 года назад +1

      You can use first()

    • @prabhaker9031
      @prabhaker9031 3 года назад

      @@ashutoshsamanta4244 Ah thanks man. I was trying to put a max filter and whatnot

  • @anirbandatta2037
    @anirbandatta2037 3 года назад

    Hi @Data Savvy, can you plan for a senior level interview, may be people with more than 16-20 yrs of experience?

  • @sachinchandanshiv7578
    @sachinchandanshiv7578 2 года назад

    Hi Sir,
    How much it's important to know snowflake for big data engineer?

  • @ashokkodari5042
    @ashokkodari5042 3 года назад +1

    What will be the usecase for dropping schema instead of truncating complete tbl? For only restore data in future or any other major reason?

    • @RakeshGupta23
      @RakeshGupta23 3 года назад +3

      Major use case for dropping schema or. Creating external table when you have storage area outside your Hadoop e.g. client want data to be stored in S3 or data stored in mongodb.

    • @manojkumar-oc1sp
      @manojkumar-oc1sp 3 года назад +1

      @@RakeshGupta23 Thanks bro.. One more question.. what will happen if we delete the external table file folder.

    • @DataSavvy
      @DataSavvy  3 года назад +2

      This is usually done when more than one team is consuming same data and also using different tech to consume it

    • @RakeshGupta23
      @RakeshGupta23 3 года назад +2

      @@manojkumar-oc1sp you mean ,you are keeping the external table schema but deleting the folder and file from hdfs?. in that case you won't be able to access the data as in hdfs it looks like when you create a database or table but under the hood it's always a file or folder.

    • @ashokkodari5042
      @ashokkodari5042 3 года назад

      @@RakeshGupta23 Thanks Rakesh for your quick response

  • @rashmidogra7792
    @rashmidogra7792 3 года назад

    what is sql question he is asking, I could not understand completely.

  • @yadlapallipriyanka9000
    @yadlapallipriyanka9000 3 года назад

    Hi Harjeet... I am Priyanka and I would like to volunteer for mock interview on BigData

  • @rajasekharreddy7624
    @rajasekharreddy7624 3 года назад

    Hi DataSavvy, Please let me know your free time will discuss about the mock interview to me.

  • @paul4367
    @paul4367 3 года назад +1

    Can u call some data analysts for mock interview too plz??

    • @DataSavvy
      @DataSavvy  3 года назад +1

      I am finding it difficult to get volunteers... Let me explore that

    • @ashleylemos3977
      @ashleylemos3977 3 года назад

      @@DataSavvy I would love to give mock interviews to all of your data engineering questions in case you looking out for candidates with 12+ years of experience in PySpark, AWS, Spark SQL, Jenkins CI/CD, Glue, Kafka, Python, Hive, Athena, Presto, Bash, Airflow, Nifi.

  • @nikhilmishra7572
    @nikhilmishra7572 3 года назад

    @30.02 what would be the solution? Having count(*)>1 after group by?

  • @uditsethia7
    @uditsethia7 2 года назад

    LAMBDA ARCHITECTURE

  • @awanishkumar6308
    @awanishkumar6308 3 года назад

    Sir in red t-shirt is making the interview questions very uninteresting even though spark itself is very much interning in terms of its concept and its working principle,, but sorry you are ruining the interest of learning