Data Engineering Interview at top product based company | First Round

Поделиться
HTML-код
  • Опубликовано: 29 сен 2024
  • Data Engineering Mock Interview
    In top product-based companies like #meta #amazon #google #netflix etc, the first round of Data Engineering Interviews checks problem-solving skills.
    It mostly consists of screen-sharing sessions, where candidates are expected to solve multiple SQL and DSA problems, particularly in #python. We have tried to replicate the same things by asking multiple good SQL and DSA problems to check the candidate's problem-solving skills.
    If you're preparing for a Data Engineering interview, this is the perfect opportunity to enhance your skills and increase your chances of success.
    The mock interview simulates a real-life scenario and provides valuable insights and guidance.
    You'll get to see how professionals tackle technical questions and problem-solving challenges in a structured and efficient manner.
    By watching this mock interview, you'll learn effective strategies to approach technical questions and problem-solving scenarios, gain familiarity with the data engineering interview process and format, enhance your communication skills and ability to articulate your thoughts clearly, identify areas of improvement, receive expert feedback on your performance, boost your confidence, and reduce nervousness for future interviews.
    This mock interview suits all levels of experience, whether you're a fresh graduate, a career changer, or a seasoned professional looking to improve your interview skills. Don't miss out on this invaluable learning experience! Subscribe to our channel and hit the notification bell to be notified when the mock interview is released. Stay tuned for a deep dive into the world of data engineering.
    Subscribe now and be the first to watch the Data Engineering Mock Interview.
    🔅 To book a Mock interview - topmate.io/ank...
    🔅 LinkedIn - / thebigdatashow
    🔅 Instagram - / ranjan_anku
    🔅 Ankur(Organiser) 's LinkedIn profile - / thebigdatashow
    🔅Ankur Ranjan(Interviewer) 's LinkedIn profile -
    / thebigdatashow
    🔅 Pragya Jaiswal(Interviewee) 's LinkedIn profile -
    / pragya-jaiswal-9661b3192
    Chapters:
    #sql #dataengineering #interview interview #interviewquestions #bigdata #mockinterview #aws #dsa

Комментарии • 18

  • @sarathkumar-tr3is
    @sarathkumar-tr3is 3 месяца назад +1

    def distinct_ind(l,k):
    dict={}
    for i in range(len(l)):
    if l[i] in dict:
    if abs(i - dict[l[i]])

  • @mirzataimoor9632
    @mirzataimoor9632 Месяц назад

    Python solution:
    l = [2,4,2,8]
    key = 1
    dict = {}
    for i, j in enumerate(l):
    if not dict.get(j):
    dict[j] = [i]
    else:
    dict[j] = dict.get(j) + [i]
    def get_abs_val(dict, key):
    for k, v in dict.items():
    if len(v) > 1:
    return abs(v[-1] - v[0])

  • @uttamkumarreddygaggenapall2504
    @uttamkumarreddygaggenapall2504 13 дней назад +1

    How does spark benefit by creating another stage for wide transformation ? 38:02

    • @TheBigDataShow
      @TheBigDataShow  13 дней назад

      In Apache Spark, creating another stage can provide several benefits that improve the efficiency and performance of data processing. Here’s how Spark benefits from creating additional stages:
      1. Parallelism and Task Granularity: By breaking down a job into more stages, Spark can create smaller tasks that are easier to distribute across the cluster. This increased granularity allows Spark to parallelize the workload more effectively, utilizing more resources and reducing the time to complete the job.
      2. Optimized Shuffling: When Spark creates a new stage, it often indicates that a shuffle operation is necessary (e.g., after a wide transformation like groupByKey or join). By managing shuffles explicitly through stages, Spark can optimize the data shuffling process, including sorting and partitioning data in an efficient manner, which helps in reducing network and disk I/O overhead.
      3. Fault Tolerance: Spark’s fault tolerance is based on the lineage of RDDs (Resilient Distributed Datasets). When a stage is completed, its results are stored and can be reused. If a node fails, Spark can recompute only the affected stages rather than the entire job. This staged execution ensures that recomputation is limited to the failed partitions, improving fault tolerance.
      4. Pipelining and Task Scheduling: By dividing a job into multiple stages, Spark allows for pipelining of operations within a stage. Operations that do not require shuffling (narrow transformations) can be combined into a single task, reducing the overhead of task scheduling and improving execution efficiency.
      5. Resource Management: Stages help in managing and allocating cluster resources more effectively. By breaking down the job into smaller units of work (stages and tasks), Spark can allocate resources dynamically based on the workload of each stage. This enables better utilization of the cluster resources, avoiding situations where some nodes are idle while others are overloaded.
      6. Caching and Reuse: When a stage's output is cached or persisted, subsequent stages that depend on this data can access it directly from memory or disk rather than recomputing it. This caching mechanism reduces the computational overhead and speeds up the execution of subsequent stages that rely on the same dataset.
      In summary, creating another stage in Spark helps optimize data processing by improving parallelism, managing data shuffling efficiently, enhancing fault tolerance, enabling pipelining, and better resource management. This staged approach allows Spark to execute complex workflows more efficiently, leveraging the full power of distributed computing.

    • @uttamkumarreddygaggenapall2504
      @uttamkumarreddygaggenapall2504 13 дней назад +1

      @@TheBigDataShow looks like google gemini ans

    • @TheBigDataShow
      @TheBigDataShow  13 дней назад +1

      @@uttamkumarreddygaggenapall2504 Yes but I think it answer your question. I didn't feel like typing and Gemini has explained it in a good way.
      I have checked & verified the answer and it looks correct. Try reading every point if you still have some questions and do ask in the comment section.

  • @CctnsHelpdesk
    @CctnsHelpdesk 3 месяца назад +2

    parquet store the file in hybrid format, not column based(80 % true)..

  • @tejachillapalli8812
    @tejachillapalli8812 3 месяца назад +2

    dictionary = dict()
    for index, value in enumerate(nums):
    if value in dictionary:
    if abs(dictionary[value] - index)

  • @regilenemariano9244
    @regilenemariano9244 Месяц назад +1

    Thank you for this interview! Keep going with it; Thank you.

  • @VishalSharma-lz6ky
    @VishalSharma-lz6ky 3 месяца назад +2

    Awesome mock interview
    And the last question was very good
    How it saves time if you are reading from disk

  • @mohitbhandari1106
    @mohitbhandari1106 3 месяца назад +2

    I think the first sql can be done using group by as well instead of window function

  • @DE_Pranav
    @DE_Pranav 3 месяца назад +3

    great questions, thank you for this video

  • @sarathkumar-tr3is
    @sarathkumar-tr3is 3 месяца назад +1

    1.SQL solution:
    select name from (
    select e.name,d.department_name, DATEDIFF(day,e.hire_date,p.promotion_date) as day_count,
    rank() over(partition by d.department_name order by DATEDIFF(day,e.hire_date,p.promotion_date) desc) as Rank
    from employee e
    join promotions p
    on e.employee_id =p.employee_id
    join departments d
    on e.department_id = d.department_id) A
    where rank =1;

  • @sandeepmodaliar6980
    @sandeepmodaliar6980 3 месяца назад +1

    The Python Program is an interesting one, assuming a value in list as Store ID and k as the distance or proximity within which another store with the same ID shouldn't exist. We can store a list as value with count,start_indx, end_ indx. If count is 2 we check the diff i.e., end_indx-start_indx

  • @sarathkumar-tr3is
    @sarathkumar-tr3is 3 месяца назад

    2.SQL solution:
    select name from (
    select
    e.name, DATEDIFF(day,p.promotion_date,l.leave_start) as d_diff
    from employee e
    join promotions p
    on e.employee_id = p.employee_id
    join leaves l
    on e.employee_id = l.employee_id) A
    where d_diff = 1;

    • @kiranmudradi3927
      @kiranmudradi3927 3 месяца назад

      I think d_diff should be >=1. lets say if an employee got promoted on some date which falls on Friday. From Monday he is taking leave. which has d_diff =2 in this case this recored wont be counted right. Just sharing my thoughts of some edge case.

    • @sarathkumar-tr3is
      @sarathkumar-tr3is 3 месяца назад

      @@kiranmudradi3927 hey thanks for covering that

  • @yashbhosle3582
    @yashbhosle3582 2 месяца назад

    SELECT name, department_name, MAX(DATEDIFF(promotion_date, hire_date)) AS longest_time
    FROM employee
    JOIN department ON employee.dept_id = department.dept_id
    JOIN promotion ON employee.employee_id = promotion.employee_id
    GROUP BY department_name
    ORDER BY longest_time DESC;