Good project learning experience Ankur. It took me around 10 hours to debug and write code even after watching you step by step. Nice way to explain complex logics.
Great job! This is the best way to learn. The ten hours you spent will always help you write production-ready pipelines. Debugging is an art that requires patience. Merely following the steps won't help as much as implementing them yourself after seeing the steps. This is the true way of learning and ensures that you won't forget the code flow. Don't forget to check out our channel's community section/tab. I have created over 1000 Data Engineering questions for practicing and improving your skills.
@@muhammadsamir2243 Please check the decription of the video. You will able to find the link of all notebook in form of HTML file. You will able to import it in any python notebook editor. Open the HTML files in chrome. It will give you the import option. & Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them. Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on community tab
After completing the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them. Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on community tab
After creation of the project, try creating a GitHub and share the link of GitHub repo & your LinkedIn profile. I will give shout out to your profile on LinkedIn. It will help you to grow your network & help finding job by showcasing your skills as full Data Engineering project.
Thanks for the tutorial. It would greatly help, at start of the video, if you could let the audience know what the pre-requisites to learn from this tutorial (and level of knowledge - basic, expert). And topics you're covering from basics (so the audience need to know about the tool). Could you please add them in your description for this video?
It took many hours but finally I did it line by line. its really a quality project. Thank you. We need more projects like this but data sources should SQL SERVER or POSTGRES.
This video is amazing and explains in depth, but I have a query: Why do we use design patterns in the ETL? Is it important in real-world projects? If yes, how can I improve my skill sets according to this?
Just completed this project after a lot of debugging. Got to learn about factory design pattern. Is this pattern typically used in the production environments? Thank you Ankur for creating such a quality project!
All your content is well thought of and grounded on solid knowledge base. If you can enhance the first maybe 30 minutes of your course to just show what the E, T & L sections would look like, WITHOUT class usage then one would greatly understand the significance of your tutorial's highlighting of Class Usage.
Great Video Ankur, Thanks for explaining the factory pattern. One thing I would like to request is how to build a framework for big data pipeline from scratch like the folder structure, design patterns. If you can cover this that will be great, this concept is kind of black box atleast for me.
This video is amazing and explains in depth, but I have a query: Why do we use design patterns in the ETL? Is it important in real-world projects? If yes, how can I improve my skill sets according to this?
@@hafizadeelarif3415 Low level design is important in every piece 🧩 of code let it be ETL or ELT. Working on larger volumes with a variety of sources requires a good knowledge of LLD and HLD. Maintaining any code with good LLD is much easier than not using LLD
After creating the project, please create a GitHub repository and share the link to the repository as well as your LinkedIn profile. I will give a shout out to your profile on LinkedIn. This will help you grow your network and showcase your skills as a full Data Engineering project, which can help you in finding a job.
Thank you for doing this project, it is quite enriching experience for learning. I would love to see more of these kind of videos in future. Keep up great work!
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them. Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on the community tab
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them. Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on the community tab
Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
We are also very much excited like you to release it. You can solve more than 1000 Data Engineering questions that I have created on my Community page/tab/section of our RUclips channel. I have collected all those questions from different interviews which my friends have given in recent times
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them. Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on the community tab
Thanks for your kind words. Once you complete the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
I was searching something like this for a long time. Than you for putting this together.. ..Already learning a lot from you ..I would love to connect with you .
Thank you for your kind words Pradeep. You can connect with me on TopMate by scheduling one call from my calendar 🗓️ there. You can find the link 🖇️ in the description of the video.
After you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
@@SAURABHKUMAR-uk5gg More you work in the Big Data space you will realise that if you are following good coding practice then it doesn't not matter even with Big Data. Distributed compute engines like Apache Spark with a good way of writing code will always work like a charm. Running a code on Big Data while demonstrating in recording via RUclips is waste of time. One just need to be smart developers to work even with Big Data.
@@TheBigDataShow - yeah that’s what even I was wondering. Thanks for the prompt response, I just asked as a lot of interviewers ask what optimisations would you do when you scale from a 100 GB to 5-600 GB of data. I already know the basics and understand that it depends on a lot of factors and there is no definite answer. A big thanks to you for showing the factory design, I have just started and already into one hour of the video. This will surely help me a lot for my interviews, thanks for spreading your knowledge in the community!!
Thanks for this videos. But, I thinks in real time we would be processing a very large amount of data, So , It will be great if you can make a video ön processing large amounts of data with all the optimisation techniques we can use. Thanks in advance.
Thank you for the kind words first but there is a need to understand that Apache Spark with large volumes of data mostly behave the same. For learning, better to stick to fundamentals and it's not necessary that all the optimization techniques like Broadcast JOIN, Salting, Skewness handling etc can be only done with large data. These are just a technique which can be implemented with any volume of data. One just has to keep his mind open when implementing these techniques. There is no need to memories by watching them. Just implement those, and even in real world and work, you will be pretty comfortable. I hope you will understand this and start implementing it instead of waiting for large data. I have not chosen a large dataset for this demonstration because after every run, spark will take more time & which will increase the length of the video. To learn technology like Apache Spark, one have to keep her or his imagination open and don't memories every thing by watching a demo. Better to implement.
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
You are using dictionary, will it not be slow if csv files have huge volume of data. What should we use instead of dictionaries in case record count is 250 Million?
Thank you for your kind words. I am already working on one video involving AWS. Once you finish the video and complete project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Will i be able to switch into data engineering after watching and practicing the project ? Will i be able to tell my interview that i done this project in my current company?
Yes but you have to work hard and learn all the concepts. Just completing one project will not help you to get a job. You have to learn multiple technology and frameworks for getting into Data Engineering domain.
Give me some time. I am already planning but I am currently getting less time due to my startup initial days. Please give me some time, I will upload it. I have already uploaded some of the Kafka videos. Please check the "Kafka for Data Engineers" playlist
Hey Ankur- thanks for the great information. I had 1 issue pop up- the initial run command to run other notebooks is not working for me. I am using the exact same command and file name. All my notebooks are under the appleAnalysis folder. Can you please suggest a solution for this. For now I am running the entire notebook code before the main file as a workaround.
Check if you are using all the notebooks using the same cluster and then check the command `%run "./name_of_notbook"`. I have even provided all my notebooks in the description of the video. I have exported them in the form of HTML. Could you try importing that and then match your code? If the issue persists then kindly let us know.
Please find the link all input files. drive.google.com/drive/folders/1G46IBQCCi5-ukNDwF4KkX4qHtDNgrdn6?usp=sharing Please let me know if you can access it or not.
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them. Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on community tab
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them. Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on the community tab
Mostly Data Engineers use Airflow or Astronomer(Enterprise version of AirFlow). For the DataBricks environment, people also use the Workflow. Workflow is not available in the community version of DataBricks
We are also very excited to release it. I hope my hard work pays off and many aspiring Data Engineers create their Data Engineering project after watching it.
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them. Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on the community tab
Hey Ankur This side Anshu, First of all, thanks for your amazing effort I'm a little bit confused about the source file (Extraction part) You explained to us in the videos We have used sources like CSV, Parquet, and Delta Table. But this is the type of file where you keep the data as a source then what is the Actual Source of data? For example, we have some ABC database I export the data in CSV or parquet and other file formats But my data source would be ABC Data Base) is it the right way I think? @Ankur
This is a great course and I was following very well until at 54:02, you started moving gform one notebook to another and confused me too much.. my apple analysis notebook isn't running .. you debugged a code without showing us how you did and I'm stuck there ... I'm sad I had to abandon the course. hwo can I reach out to you.. I need assistance.. Have explored all the channels I have and the code is still not working
@@mekings0422 please let us know your bug. Debugging is an art and it comes with a lot of persistence. I have also attached my notebook in the description. Kindly check it, it might help. Do let us know the problem too in the comment section. The community might help you here.
@@omkarshinde4792 DataFrame is not converted into Dictionary instead we have created Dictionary of DataFrame. Using Dictionary of DataFrame helps to return multiple DataFrame in a lot better way
Hey Ankur bhai, big thanks for this project was waiting eagerly from your channel to get one project video, hope this helps in interview to explain as a Real Time Project for exp candidates
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them. Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on community tab
I'm already learning from it, Unbelievable work done by you creating 1000+ MCQ is a challenging and boring task but thank you soo much Ankur Bhai for creating this Series. I'm 100% Sure that no one on RUclips has created these many MCQs. Thanks again and hats off to you for this incredible work.
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Hello I am getting error to read delta table that is on default at 01:21:50 IllegalArgumentException: Path must be absolute: default.customer_delta_table_persist.Please help me through that
We are also excited just like you to release the full Apache Spark End-to-end pipeline. Please click on the bell icon to not miss the notification before the start of the live premiere. It will go live at 2:30 PM IST.
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
I am facing issue while signup to the databricks account. This error is coming up "an error has occurred. please try again later" . does anyone faced similar kind of issue. Please help
Not full course till now but we are releasing Apache Spark interview questions one by one. You can find an initial video in this playlist. ruclips.net/video/NhYGVUuUVFg/видео.html&pp=gAQBiAQB
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link with me, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Databricks inbuilt spark object, so we can use spark object directly unless you are working outside of databricks (onpremises clusters, Local machines and other cloud services).
hyyy!!!! There is an error in filepath while giving the table name of customer delta table it asking me to give absolute file name after giving the delta table name at 1:22 pls help me out
Hi Sruti Will you please point out the timestamp from the video? The Delta table is deleted once your cluster is deleted after some time in the community version of DataBricks. I have explained this in the later part of the video. You can always restart a brand new cluster and again create the delta table again. Because after every auto delete of the cluster, the DELTA table will be deleted . You can always create the DELTA table. Either using DataBricks UI or notebook. You might have to delete the original files which are behind the delta table while recreating it. If you move forward in the video. I have demonstrated all these steps.
i have already created a new cluster from there I create the delta table still it showing me( IllegalArgumentException: Path must be absolute: default.customer_delta_table_persistt) this type of error .the timestamp is 1:22
@@srutishriyasahu1556 no worries, move ahead in the video. You will be able to solve it. Customer tables will be only used after our first problem statement which is near to 1 hour 45 min. I have demonstrated how to solve it. Don't worry. Move forward with the demonstration
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them. Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on community tab
@@TheBigDataShow thank you for your reply ,I will do that ,It will help me for interview preparation.Thank you so much again as you are putting lots of efforts in creating videos with high quality content .
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Yes, I have created a more than 3-hour video demonstrating two pipelines using Apache Spark today. After live, you can find the video in our PySpark Practice - Tutorial. Nd I am not Sir 😄 I am just Ankur. Only Ankur is fine
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them. Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on community tab
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
you should 'return the dataframe in transform notebook'. May be you are not returning the dataframe after join.show() because I faced this issue as well
@@prasanthikalyampudi7993 What is the error? Check the Databrick community page. You might be using the wrong URL for signup. Try checking the description of the RUclips video for the correct link.
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
drive.google.com/drive/u/0/mobile/folders/1G46IBQCCi5-ukNDwF4KkX4qHtDNgrdn6?usp=sharing Link for the dataset, if you face any access issue, do mention in comment
Thank you for your kind words & hope you were able to access the datasets. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them. Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on the community tab
Thanks for your kind words. After completing the project, please create a GitHub repository and share the link to the repository, as well as a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Replied in my previous question but it seems not visible so making again. Getting in first pipeline. AnalysisException: Failed to merge fields 'customer_id' and 'customer_id' Any suggestion would be appreciated. Thank you Please find the ap code I am trying to follow along here: github.com/maaz-ahmed-ansari/apple-product-analysis/tree/main
Good project learning experience Ankur. It took me around 10 hours to debug and write code even after watching you step by step. Nice way to explain complex logics.
Great job! This is the best way to learn. The ten hours you spent will always help you write production-ready pipelines. Debugging is an art that requires patience. Merely following the steps won't help as much as implementing them yourself after seeing the steps. This is the true way of learning and ensures that you won't forget the code flow.
Don't forget to check out our channel's community section/tab. I have created over 1000 Data Engineering questions for practicing and improving your skills.
please share your github code
@@muhammadsamir2243 Please check the decription of the video. You will able to find the link of all notebook in form of HTML file. You will able to import it in any python notebook editor. Open the HTML files in chrome. It will give you the import option.
& Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them.
Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on community tab
After completing the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
@@TheBigDataShow sure
Really amazing end-to-end DE project, learned a lot in these 3 hours
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them.
Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on community tab
After creation of the project, try creating a GitHub and share the link of GitHub repo & your LinkedIn profile. I will give shout out to your profile on LinkedIn. It will help you to grow your network & help finding job by showcasing your skills as full Data Engineering project.
Thanks for the tutorial.
It would greatly help, at start of the video, if you could let the audience know what the pre-requisites to learn from this tutorial (and level of knowledge - basic, expert). And topics you're covering from basics (so the audience need to know about the tool).
Could you please add them in your description for this video?
It took many hours but finally I did it line by line. its really a quality project. Thank you.
We need more projects like this but data sources should SQL SERVER or POSTGRES.
Pipeline 30% - Pyspark 50% - SQL 10% - Design Pattern 10%
Then this project is good or bad😃. I hope it includes most of the things that you are trying to learn
This video is amazing and explains in depth, but I have a query: Why do we use design patterns in the ETL? Is it important in real-world projects? If yes, how can I improve my skill sets according to this?
Just completed this project after a lot of debugging. Got to learn about factory design pattern.
Is this pattern typically used in the production environments? Thank you Ankur for creating such a quality project!
Yes a lot. Try learning builder, singleton and companion, low level design now.
All your content is well thought of and grounded on solid knowledge base. If you can enhance the first maybe 30 minutes of your course to just show what the E, T & L sections would look like, WITHOUT class usage then one would greatly understand the significance of your tutorial's highlighting of Class Usage.
Hi Ankur, Could you please suggest a bigger dataset and problem to solve with this similar example? It would be a great help. Thanks in advance.
Excellent elaborations with hands-on practice.
Great Video Ankur, Thanks for explaining the factory pattern. One thing I would like to request is how to build a framework for big data pipeline from scratch like the folder structure, design patterns. If you can cover this that will be great, this concept is kind of black box atleast for me.
Thank you for time and patience to prepare this video. this will definitely help many .
Thank you for your kind words 🙏🙏
Just completed this amazing project 😍
Can i add this in my portfolio?
Hello I have watched the whole video and code it, but getting some error, do you have entire code, so I can make changes accordingly
Please check the description
This video is amazing and explains in depth, but I have a query: Why do we use design patterns in the ETL? Is it important in real-world projects? If yes, how can I improve my skill sets according to this?
@@hafizadeelarif3415 Low level design is important in every piece 🧩 of code let it be ETL or ELT. Working on larger volumes with a variety of sources requires a good knowledge of LLD and HLD.
Maintaining any code with good LLD is much easier than not using LLD
@@TheBigDataShow sure!
how can it be improved?
Good learning experience.
Can you please make video on unit testing of this project.,it will really helpful.
This is a great demonstration, appreciate the team's effort for putting together an awesome end-to-end project.
Thank you Shouvik.
I have also started one playlist with the name "Kafka for Data Engineers" Do check it out in your free time
After creating the project, please create a GitHub repository and share the link to the repository as well as your LinkedIn profile. I will give a shout out to your profile on LinkedIn. This will help you grow your network and showcase your skills as a full Data Engineering project, which can help you in finding a job.
Thank you for doing this project, it is quite enriching experience for learning. I would love to see more of these kind of videos in future. Keep up great work!
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them.
Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on the community tab
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them.
Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on the community tab
Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Excited to learn and implement real-time, Thanks #The_Big_Data_show
We are also very much excited like you to release it.
You can solve more than 1000 Data Engineering questions that I have created on my Community page/tab/section of our RUclips channel.
I have collected all those questions from different interviews which my friends have given in recent times
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Appreciate your great effort and share your knowledge brother!👍
great explanation , but have small concern about datasets having small data.
What’s the size? Columns x rows?
@@pranav283 I mean rows
This Channel is simply amazing 😍 Keep coming up with great content on Data Engineering like this
Sure
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them.
Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on the community tab
Thanks for your kind words. Once you complete the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
I was searching something like this for a long time. Than you for putting this together.. ..Already learning a lot from you ..I would love to connect with you .
Thank you for your kind words Pradeep. You can connect with me on TopMate by scheduling one call from my calendar 🗓️ there. You can find the link 🖇️ in the description of the video.
@@TheBigDataShow Will do thanks !!
After you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Thanks for your effort but since this is a big data project, shouldn't you use a large file to show Spark techniques you're using?
@@SAURABHKUMAR-uk5gg More you work in the Big Data space you will realise that if you are following good coding practice then it doesn't not matter even with Big Data. Distributed compute engines like Apache Spark with a good way of writing code will always work like a charm.
Running a code on Big Data while demonstrating in recording via RUclips is waste of time.
One just need to be smart developers to work even with Big Data.
@@TheBigDataShow - yeah that’s what even I was wondering. Thanks for the prompt response, I just asked as a lot of interviewers ask what optimisations would you do when you scale from a 100 GB to 5-600 GB of data. I already know the basics and understand that it depends on a lot of factors and there is no definite answer. A big thanks to you for showing the factory design, I have just started and already into one hour of the video. This will surely help me a lot for my interviews, thanks for spreading your knowledge in the community!!
Thanks for this videos.
But,
I thinks in real time we would be processing a very large amount of data,
So , It will be great if you can make a video ön processing large amounts of data with all the optimisation techniques we can use.
Thanks in advance.
Thank you for the kind words first but there is a need to understand that Apache Spark with large volumes of data mostly behave the same.
For learning, better to stick to fundamentals and it's not necessary that all the optimization techniques like Broadcast JOIN, Salting, Skewness handling etc can be only done with large data.
These are just a technique which can be implemented with any volume of data. One just has to keep his mind open when implementing these techniques. There is no need to memories by watching them. Just implement those, and even in real world and work, you will be pretty comfortable.
I hope you will understand this and start implementing it instead of waiting for large data.
I have not chosen a large dataset for this demonstration because after every run, spark will take more time & which will increase the length of the video.
To learn technology like Apache Spark, one have to keep her or his imagination open and don't memories every thing by watching a demo. Better to implement.
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Thanks so much for your help
You are using dictionary, will it not be slow if csv files have huge volume of data. What should we use instead of dictionaries in case record count is 250 Million?
Hi Ankur, very excited to go through the video, also, are you planning to implement through AWS as well, would be helpful
Yes, stay tuned
Thank you for your kind words. I am already working on one video involving AWS.
Once you finish the video and complete project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
@@TheBigDataShow That sounds amazing, sure will do soon.
What python learning resource do you recommended to learn Pyspark class functions?
Getting in first pipeline.
AnalysisException: Failed to merge fields 'customer_id' and 'customer_id'
Any suggestion would be appreciated. Thank you
@@maazahmedansari4334 please share your some more code snippet for debugging and have your created some GitHub repo for same.
Will i be able to switch into data engineering after watching and practicing the project ? Will i be able to tell my interview that i done this project in my current company?
Yes but you have to work hard and learn all the concepts. Just completing one project will not help you to get a job. You have to learn multiple technology and frameworks for getting into Data Engineering domain.
Can you do even a small project using Kafka?
Give me some time. I am already planning but I am currently getting less time due to my startup initial days. Please give me some time, I will upload it.
I have already uploaded some of the Kafka videos. Please check the "Kafka for Data Engineers" playlist
can we get next project on real time data using Kafka or something like that.
Already planning this.
On utube there are some projects but they r very simple. Please plan one complex project with a proper problem statement n solution. It is a request.😊
This was our first End-to-end project. Already some more complex projects in pipeline
@@TheBigDataShow Thanks
Hey Ankur- thanks for the great information.
I had 1 issue pop up- the initial run command to run other notebooks is not working for me. I am using the exact same command and file name. All my notebooks are under the appleAnalysis folder. Can you please suggest a solution for this. For now I am running the entire notebook code before the main file as a workaround.
Check if you are using all the notebooks using the same cluster and then check the command `%run "./name_of_notbook"`.
I have even provided all my notebooks in the description of the video. I have exported them in the form of HTML. Could you try importing that and then match your code? If the issue persists then kindly let us know.
Try adding run command in different cells and it will resolve even I was facing the same issue
@@LalitSharma-up5hl 👏👏
Please find the link all input files.
drive.google.com/drive/folders/1G46IBQCCi5-ukNDwF4KkX4qHtDNgrdn6?usp=sharing
Please let me know if you can access it or not.
Hi Ankur. I am not able to download the files
Thanks for doing such videos❤
My pleasure 😊
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Notebook files are in html format. What purpose does that serve?
I see this when trying to upload files using DBFS file browser
Missing credentials to access AWS bucket
Excited to complete this
Great 🤞
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them.
Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on community tab
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them.
Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on the community tab
How we do schedule pipeline? Thanks , What we use in industry to Schedule job.
Mostly Data Engineers use Airflow or Astronomer(Enterprise version of AirFlow). For the DataBricks environment, people also use the Workflow.
Workflow is not available in the community version of DataBricks
Excited to watch
We are also very excited to release it. I hope my hard work pays off and many aspiring Data Engineers create their Data Engineering project after watching it.
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them.
Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on the community tab
Hey Ankur This side Anshu, First of all, thanks for your amazing effort I'm a little bit confused about the source file (Extraction part) You explained to us in the videos We have used sources like CSV, Parquet, and Delta Table. But this is the type of file where you keep the data as a source then what is the Actual Source of data? For example, we have some ABC database I export the data in CSV or parquet and other file formats But my data source would be ABC Data Base) is it the right way I think?
@Ankur
This is a great course and I was following very well until at 54:02, you started moving gform one notebook to another and confused me too much.. my apple analysis notebook isn't running .. you debugged a code without showing us how you did and I'm stuck there ... I'm sad I had to abandon the course. hwo can I reach out to you.. I need assistance.. Have explored all the channels I have and the code is still not working
@@mekings0422 please let us know your bug. Debugging is an art and it comes with a lot of persistence.
I have also attached my notebook in the description. Kindly check it, it might help.
Do let us know the problem too in the comment section. The community might help you here.
'str' object has no attribute 'write' errior happens when try to write output to DBFS.. pls help
This can we do in community edition?
I have used the community version only. So all the things explained in the video are using the Databricks community version only.
@@TheBigDataShow thanks
Can you please do incremental project tutorial maybe with structured streaming for batch processing?
Hi, one question. why do we need to convert the dataframe to a dictionary when we can pass the dataframe directly to the transform function ??
@@omkarshinde4792 DataFrame is not converted into Dictionary instead we have created Dictionary of DataFrame.
Using Dictionary of DataFrame helps to return multiple DataFrame in a lot better way
@@TheBigDataShow Got it. Thanks.
Hey Ankur bhai, big thanks for this project was waiting eagerly from your channel to get one project video, hope this helps in interview to explain as a Real Time Project for exp candidates
Thank you for your kind words :)
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them.
Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on community tab
I'm already learning from it, Unbelievable work done by you creating 1000+ MCQ is a challenging and boring task but thank you soo much Ankur Bhai for creating this Series. I'm 100% Sure that no one on RUclips has created these many MCQs.
Thanks again and hats off to you for this incredible work.
@@codjawan Thank you. Keep motivating us and we will keep making valuable content
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Getting Exception during loading to dbfs:
'NoneType' has no attribute 'write'
Any suggestions
Thanks
I also got the same error. If you fixed it pls let me know
Continue the video, you will be able to understand this error. The Community version of Databricks will kill the dbfs after every cluster is closed.
You will have to recreate it again.
@@TheBigDataShow I found and fixed the issue.
@@gowthamm.s7745 Try to explain here Gowtham. It might help many
Appreciate your efforts.. keep it up ❤
Thanks a lot 😊
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Hello I am getting error to read delta table that is on default at 01:21:50
IllegalArgumentException: Path must be absolute: default.customer_delta_table_persist.Please help me through that
Workflows are not getting executed could you please tell us how to resolve the issues
Please send the error and have you checked the internet for resolving the issues
Excited to learn
We are also excited just like you to release the full Apache Spark End-to-end pipeline. Please click on the bell icon to not miss the notification before the start of the live premiere. It will go live at 2:30 PM IST.
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Kudos to your efforts
I am facing issue while signup to the databricks account. This error is coming up "an error has occurred. please try again later" .
does anyone faced similar kind of issue. Please help
Sir aapka pyspark ka full course available h kya
Not full course till now but we are releasing Apache Spark interview questions one by one. You can find an initial video in this playlist.
ruclips.net/video/NhYGVUuUVFg/видео.html&pp=gAQBiAQB
@@TheBigDataShow batao, full course me v itna nahi sikhaega, fir v course chahie.
Sir in first pipeline I am getting error that str object has no attribute write
Share the code snippet where you are getting errors and have you StackOverflow it?
Sir can u please reply to my question @@TheBigDataShow
Can you please make deployment as well?
Let's try doing it in next video
at 2:00:54 you copied a delta file path. when did you created that ?
Thanks for sharing!!!
My pleasure!!
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link with me, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Do we really need to import Spark Session? Because Databricks automatically creates spark session for us.
Databricks inbuilt spark object, so we can use spark object directly unless you are working outside of databricks (onpremises clusters, Local machines and other cloud services).
Good efforts however please minimize the usage of words like "perfect" and "well & good" after every sentence
@@AnjaliH-wo4hm will try to improve
hyyy!!!! There is an error in filepath while giving the table name of customer delta table it asking me to give absolute file name after giving the delta table name at 1:22 pls help me out
Hi Sruti
Will you please point out the timestamp from the video?
The Delta table is deleted once your cluster is deleted after some time in the community version of DataBricks. I have explained this in the later part of the video.
You can always restart a brand new cluster and again create the delta table again. Because after every auto delete of the cluster, the DELTA table will be deleted .
You can always create the DELTA table. Either using DataBricks UI or notebook.
You might have to delete the original files which are behind the delta table while recreating it.
If you move forward in the video. I have demonstrated all these steps.
i have already created a new cluster from there I create the delta table still it showing me( IllegalArgumentException: Path must be absolute: default.customer_delta_table_persistt) this type of error .the timestamp is 1:22
@@srutishriyasahu1556 no worries, move ahead in the video. You will be able to solve it. Customer tables will be only used after our first problem statement which is near to 1 hour 45 min. I have demonstrated how to solve it. Don't worry. Move forward with the demonstration
Ok thank you sir 😊
Ok thank you sir 😊
Appreciate your efforts thank you
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them.
Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on community tab
@@TheBigDataShow thank you for your reply ,I will do that ,It will help me for interview preparation.Thank you so much again as you are putting lots of efforts in creating videos with high quality content .
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
@@TheBigDataShow I will do that
Where do we learn pyspark from scratch to advance with databricks
Learn with wafastudies on RUclips.
Sir will do complete project today only
Yes, I have created a more than 3-hour video demonstrating two pipelines using Apache Spark today. After live, you can find the video in our PySpark Practice - Tutorial.
Nd I am not Sir 😄 I am just Ankur. Only Ankur is fine
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them.
Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on community tab
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
I am getting an error occurred NoneType attribute 'write' . Pls support to address this error
you should 'return the dataframe in transform notebook'. May be you are not returning the dataframe after join.show() because I faced this issue as well
I want to execute this project. I am facing an issue while sign up. "Error occurred".please help if anyone facing this issue
@@prasanthikalyampudi7993 Are you getting error while logging in or sign up to Databrick?
@@TheBigDataShow during signup to databricks using community edition
@@TheBigDataShow I am getting this error during signup.
@@prasanthikalyampudi7993 What is the error? Check the Databrick community page.
You might be using the wrong URL for signup. Try checking the description of the RUclips video for the correct link.
Exciting
Thank you for your kind words. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Could you please provide a dataset links
drive.google.com/drive/u/0/mobile/folders/1G46IBQCCi5-ukNDwF4KkX4qHtDNgrdn6?usp=sharing
Link for the dataset, if you face any access issue, do mention in comment
Please check the link, which Nisha has shared. Please let us know if it is accessible of not
Thank you for your kind words & hope you were able to access the datasets. Once you finish the project, please create a GitHub repository and share the link to the repository, along with a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Great 👍
Thank you for your kind words. This motivates me to make more and better videos. Please check the community section/tab of the channel. We have created and collected over 1000 of the most asked Data Engineering questions. We have made all these questions in the form of MCQs so that you can solve them and learn from them.
Search our Channel Name - The Big Data Show, on RUclips -> Go to the channel -> Then click on the community tab
Thanks
Please break down the video into topics.
Done, please check and try to complete it
Okay Thanks you
Same here
Please click on bell 🔔 icon. So you don't miss the notification before the start of the video. We are as excited as you to make this video live
Thanks for your kind words. After completing the project, please create a GitHub repository and share the link to the repository, as well as a link to your LinkedIn profile. I will give a shout-out to your profile on LinkedIn. This will help you expand your network and showcase your skills through a complete Data Engineering project, which can assist you in finding a job.
Replied in my previous question but it seems not visible so making again.
Getting in first pipeline.
AnalysisException: Failed to merge fields 'customer_id' and 'customer_id'
Any suggestion would be appreciated. Thank you
Please find the ap code I am trying to follow along here:
github.com/maaz-ahmed-ansari/apple-product-analysis/tree/main
2nd pipeline is working as expected. Still bashing my mind around 1st pipeline. Can someone suggest how to resolve above error?
Thanks