That actually was the original plan!!! I had it filmed and eventually, I edited it out! 😀😀😀 The reason is - after experimenting on 3 different computers I've noticed it was a major waste of time! Saving the data in chunks is much much slower! And you also need to make sure to set header=False, otherwise - each chunk inside the CSV will begin with the column names. The way around it is to store the column names in advance and then use the for loop to add each individual chunk without its headers. If you're still interested - try the following: cols = pd.DataFrame(columns=["final_price", "image_url", "title", "url", "categories"]) cols.to_csv("modified_data.csv", mode="a", encoding="utf-8", index=False) data = pd.read_csv("bd_amazon.csv", chunksize= 50000, usecols=["final_price", "image_url", "title", "url", "categories"]) for idx, chunk in enumerate(data): chunk.to_csv("modified_data.csv", mode="a", encoding="utf-8", index=False, header=False) This should do the trick (but much slower, in more lines of code and a much less elegant syntax 😉) I hope it helps! 😀
You can time things in a Jupyter notebook using an inbuilt magic command. If you wanna time a single line statement, prefix it with %time. If you wanna time the whole cell put %%time at the start. Similarly there are magic commands %timeit and %%timeit which will run your code multiple times and report the fastest time.
WHAT SORCERY IS THIS, PHSOPHER??? 🤩🤩🤩 Just tried it and it is absolutely incredible!!! I had no idea we can do that and I've been using Jupyter since my very first print statement in Python!!! 😱 Folks, you must try the following: %time data = pd.read_csv("data.csv",usecols=["final_price", "image_url", "url", "title", "categories"]) Nothing short of magic!! I see you as Gandalf now🧙♂️🧙♂️🧙♂️
Your enthusiasm shines through as usual. I know there have been some difficult times since the introduction of Openai etc, but you must not stop doing what you’re doing because thousands of people are relying on you, your wonderful teaching skills and your python (amongst other) knowledge. Thank you.
Thank you so much Shane!!! 😃 I definitely spent way too much time going over the comments of my ChatGPT vlog, and my only regret is that I only focused on the job aspect rather than the whole doomsday package 🤪 hahahaha The reason why I was off radar was not related to the vlog though, I was finishing up a rough semester in university and just landed in the middle east to visit my family. I'll be gone for a bit longer, but will come back with a brand new AI Simplified series! 🥳🥳🥳 The first video is all about the question of "can computers think?" and it's a huge tribute to Alan Turning! so please stay tuned 😉
Thank you for your great video. But perhaps for 15 GB data it's better to use Polars instead of Pandas. It has a similar syntax to Pandas so you don't find yourself on a different planet and it uses Rust code for faster execution. It is particularly suitable for processing large data sets, as it has built-in support for multi-threaded and multi-core processing.
Absolutely! The chunking solution goes hand in hand with the Machine Learning batching 😉 When we load data into a neural network (or other type of models) we load it in batches rather than all at once! If you'd like to see a specific example in the realm of ML, I have a special beginner friendly tutorial covering it: ⭐ Machine Learning Databases and How to Access Them: ruclips.net/video/8z2oLfK2sIc/видео.html This gives you a really nice introduction to the Pytorch framework as well 🙂 It's part of a nice AI and ML Simplified playlist which you may find helpful on your exciting new journey: ruclips.net/p/PLqXS1b2lRpYTpUIEu3oxfhhTuBXmMPppA (I'm actually working on a new AI Simplified series, starting very soon on this channel so definitely stay tunned! 😀)
Absolutely! Thank you for the awesome tip Prakash! 😀 Pickled data is fantastic in terms of reading speed! it's a bit limited in terms of readability (and as a result - security! as you wouldn't necessarily notice any shenanigans inside what seems to be normal data😉) It's a great solution if you don't mind a slight learning curve and as we are the ones who pickle our files - security is not a problem 😀 If anyone is curious about pickling, here's the documentation: docs.python.org/3/library/pickle.html And please let me know in the replies below if you'd like to see a simplified tutorial about the pickle module 🙂
I am just learning python and found some of your videos. You are very good and very clear. I had issues installing Anaconda on my Windows 11 computer. It was very slow and crashing most of the time. I have good hardware so that was not the issue. I might try it again as this Juniper looks good. Thank for your time and efforts making these videos.
My approach when downloading very large csv files, is to use data = requests.get(url,stream=True).iter_lines() That returns an iterable to the data, but doesn't start downloading it at this stage. The first row will be the headings, so get that with something like headings = next(data).decode("utf-8").split(",") Then loop over the body of the data either with a for loop, list comprehension, or multiprocessing.Pool().map() and dump each line into a database, then do queries on the database to analyse it. Or, if it isn't quite so big, then put it in a numpy array and work on it from there.
Thank you for the awesome tips, Katrina!! You're a rockstar!! 🤩🤩🤩 This approach reminds me of loading data with C++! It looks like it gives you full control over your data and doesn't leave any grey areas for libraries like Pandas to fill in. I LOVE IT!!! Do you find that multiprocessing works better than multithreading when it comes to loading/storing data? I've heard from online folks that multiprocessing is time-costly, but it's highly recommended for CPU-intence tasks... which I find a bit confusing! hahaha I wonder what's your take on that 🙂 (sorry, I haven't had a chance to follow up on our ChatGPT conversation. The comment section there got absolutely insane after a short while. It got substantially more comments than any one of my videos... I wasn't expecting that at all! 😅)
@@PythonSimplified I guess it depends. Maybe if all your data is strings that, that isn't particularly CPU intensive. I am usually downloading files where most of the columns need to be converted to ints and floats, and generally at least one of them needs to be converted to a date, and it needs more than one CPU thread to keep up with the maximum download speed I can do. @functools.lru_cache can sometimes help as well, but not always, it depends on the data.
Another thing I've found is, multiprocessing can be really slow to get up and running on Windows, but it is much faster on FreeBSD and Linux. If you are using a Windows computer, put your Python/Jupyter in wsl-2, then amend your Jupyter config file to run a Windows instance of the browser.
Thanks you alot maria you are doing great job, may god bless you and your efforts. We need clear map to data science and machine-learning in long length videos, if you will
I really would suggest to use Polars instead of pandas when using big files...it can be 7 times faster, use timeit to measure to difference. Handling that amount of data every min counts. I love pandas but Polars is WAY faster. Cheers. Nice tutorial though you bought the full set? How did you get access to the full dataset?
Thank you, another great tutorial. Once we have the dataset we can search (SQL type search, any indexing?) inside the data? Maybe a follow on tutorial?
me too! In contrast to Excel - my new PC was actually able to load it as is! Excel on the other hand collapses almost immediately, regardless of the computer system 🙃
Very interesting. Please do more videos on handling pandas dataframes + tkinter. Logical operations, comparisons, filtering, unique values etc. etc. , and how to include the result in a frame on the window.
Hi! Love your videos! I learn so much, and in an easier way! Question, which laptop would you recommend to use as an entry level programmer-data-analyst. Mac? Windows? Or Linux? I dont mind Ubuntu GUI. Thank you!
My personnel record is the Citibike data. 50 csv files containing 300 + million rows. I import it into Excel using Power Query. Yep it did take a few minutes.
I have a bunch of these already 😉 ⭐️ SQLite Basics: ruclips.net/video/Ohj-CqALrwk/видео.html Please don't forget to add connection.commit() to insert all values to table, it's in the description but was omitted in the video. ⭐️ Webscraping Databases: ruclips.net/video/MkGQmZoMuRM/видео.htmlsi=Q_Z3jFfoLnF54Bxb You'll find exactly what you're looking for in this bideo, csv to SQL, just skip the webscraping part 😃 Cheers!
Thank you for this informative video. Do you think you can make a video on machine learning where we take a dataset and train and test a model to predict some event? Please it will help us many a lot. I really enjoy the way you explain these concepts. Keep up the good work!
I'm working on a brand new AI aeries as we speak 😉 In the meanwhile you can checkout my old AI Simplified playlist: ruclips.net/p/PLqXS1b2lRpYTpUIEu3oxfhhTuBXmMPppA I have 2 videos showcasing the entire training + neural network making process (N-gram modelling): ⭐ Build a Neural Network with Pytorch - Storyeller PART 1: ruclips.net/video/mzbJd0NhW2A/видео.html ⭐Train a Neural Network - StoryTeller PART 2: ruclips.net/video/GTyTG3XzPq8/видео.html I must warn you though - it's a very stupid neural network as I haven't optimized it 😅 These were some of the first tutorials I've ever filmed, so I wasn't particularly good in terms of explaining. The new series will be much much better 😃
Would you mind doing a follow-up on this, where we do decide to iterate through the batches? (ie with the amazon example app, how would we best implement batch processing in this context). Thanks. Your vids are great.
Awesome. I wonder if loading them into chunks and appending them into a single DataFrame leads to a faster FULL DataFrame that we can work with altogether
Hi Mariya, I have a question regarding programming logic, algorithm and data structures: What do you recommend for a beginner who wants to learn programming logic, algorithms and data structures? I am totally confused on what I should be doing and what are the best resources to use... I know Java syntax, but problem solving is a show stopper for me and I don't know what to study, what to do or what resources to use to improve... Please help! I am a desperate beginner Thank you! ( I am learning Java atm, but I love your videos so I follow you! :D)
have you considered a voice career? like singer or actress, your voice is deep and clear, I like it very much. Thank you for all your work, you make a lot for python learners like myself
I’d love to see a version of this video but for geospatial data. Manipulating such data is often complicated by the geospatial aspect. Anyway, great video!
Thanks for the suggestion and for the lovely comment! 🙂 I cant say that I'm an expert in mapping and coordinates... but if I stumble upon a nice geospatial dataset, I'll definitely explore it and see if I understand it well enough to film a tutorial about it 😉
Haven't had a chance to explore it yet, thank you so much for suggesting! 😃 My guess is - call.apply() on each individual chunk using my code from minute 10:25, it should speed things up quite a bit 😉
You should have a look at the python lib datatable. That one was especially designed for huge data sets and is orders of magnitudes faster than pandas. And can do memory mapping to process data that does not fit into RAM.
First - that was awesome walkthrough, Mar. Second - would be nice to see more clustering workloads. Like koalas, instead of pandas. Working on a single node is acceptable/tolerable. But to future-proof our skills we'll have to work with parallelization. Aka Spark/Dask/Ray. I reckon you can use free acc to run your jup notebooks from a company-we-all-know-that-uses-spark-as-core-engine, something to do with bricks...
I use parquet bc it's so much faster and takes up less disk space AND saves data types, which is really useful if your optimizing dtype for speed/storage. And it's compatible with R/Power BI/ect. When I'm starting from CSV though this is an awesome tip, thanks! 🙏 I'm shocked at the speed increase!
@@PythonSimplified I love it! Can't wait to watch. I am currently using your Qt5 tutorial to help with my chatbot with gpt-3.5 where the user can select a character and the bot will emulate chat with that character. It's super fun!
The reason is - we're dealing with a much much smaller file! we reduced it from 11GB to 600MB so it's not nearly as challenging to load as the previous one 😃
you should check out timeit from timeit import timeit result = timeit(stmt=f"main()", globals=globals(), number=n) print(f"Execution time is {result/n} seconds")
good afternoon, when will you come here in Brazil to visit us?, I work in an Arab restaurant when you come let us know, you are very charismatic, thanks for the videos, we learn a lot from you
Definitely a fantastic incentive! I've just landed in the middle east to visit family, but would love to go on an actual vacation to a place I've never been before! Brazil is certainly on my to-go list 😉 Thank you so much for the lovely comment, and if I'm ever in Brazil - you can count me in for a big shawarma plate!! 😃
I wonder how to use AI in order to build an app to model a sim living on a plot of land in a specific geographical location. So, you'd enter in the geographical location and the size of the plot of land, and the current state that land is in. The app tells you what the yield of a combination of crops and plants will be. As the app develops it will be able to model whole communities living off grid. This would be a very interesting project to work on. Plenty of scope for continual development as more and more data is fed into the app. It would have possibilities for graphics and animations to really simulate the metiorological conditons of a given location and compute all practical aspects required in order to live in that given region. It would be a big project but I'm sure one that many programmers would want to work on.
Thank you for your help Mariya! ☺️ On the occasion of women's day, I would like to wish you all the best, noblest and most beautiful! :) Please keep developing such great content on your youtube channel! ❤️ 🙏
We've saved it after disposing of 35 columns, so the dataset was already 35 times smaller before we re-saved it 😉 In addition, I believe that Pandas is optimizing the data type of each column before exporting your DataFrame, so there should be a boost of efficiency there as well 🙂
@@PythonSimplified ah, I expected the dropping of not necessary columns, but I did not know that pandas can optimize data exporting. (I think this feature is new, and will be included in pandas 2.0 or so I heard.) Excited for new pandas though.
Can this be done using json as input as well? Actually I saw your SQLite video too. The problem I'm really trying to crack is I have a dataset where each record(300 million records in total) is a nested json object. I have the option to convert it into SQLite3 db but I am not sure how we can store something like the 'Categories' field in your dataset which has an indefinite number of array elements within it. End goal is to write an app which can send SQL queries to get filtered results from the database. Perhaps you can advise.
Hi Tom!! 😃 I've checked with data.columns (somewhere around the middle of the tutorial, I believe). Otherwise - I wouldn't even have the slightest clue as this dataset won't open with Excel (and probably their MAC equivalent too 😉). The software completely collapses and my only way of accessing the contents of this file was via Python 🙃
@Python Simplified Correct me if I'm wrong, but `df.columns` comes AFTER you have created the df. But creating that df process is resource consuming because of the size of the dataset. My approach to getting to know my data would be to create a much smaller df using `nrows`. Something like, dummy_df = pandas.read_csv('large_data.csv', nrows=50) This will consist only 50 rows, but I'm more interested in the column names. dummy_df.columns would now give me the column names that I can use in chunks. 😀
The biggest challenge from my perspective is the lack of privacy. We don't notice how much of our data we voluntarily provide to all kinds of services/software. This data is not always used in our favour and very often sold to other services/software of which we are unaware of 😉
What if my project requires loading the data as it is into the oracle database? I have done these tasks of loading 7 to 10 Million records into the database using Chunks or you can say by batches. I am not sure if there is any convenient way to reduce the time.
If you have a CUDA compatible GPU - try opening it with cuDF pandas. I have a tutorial of how to set it up and you can use a regular read_csv() command to read via GPU rather than CPU (if the dataset is compatible, of course): ruclips.net/video/9KsJRyZJ0vo/видео.htmlsi=hnHA2gW4GzDBykDH I hope it helps! Otherwise - try a library called Polars or other Pandas alternatives :)
hey maam i am a bignner in feild of programming and i have doubt i am really bad at maths so if i want to do AI or ML in future do i need to learn maths
Hello I really like your tutorials. I have a LARGE json file(22GB) and I can not open it with pandas read_json. I will be really thankful if you make similar tutorial for json files.
The exact same techniques will work with read_json as well 😃 You can combine the usecols and chunksize properties to load the dataset bit by bit, no need for a special as it's not really different from read_csv 😉
@@PythonSimplified thank you for your response, but I am getting this error: TypeError: read_json() got an unexpected keyword argument 'usecols' , and I can not see usecols in the documentation for read_json.
aha! you're right, usecols is not a property of read_json()! 😱 The problem with JSON files is that one is structured differently from the other and requires a great level of customization. My suggestion: use chunksize to have a look inside the individual items of your file, and try combining it with orient='columns' to get a table-like structure for each chunk. From there - you can call the .drop() method on each chunk to dispose of unnecessary columns and then save the much smaller chunks into a new csv file (using the code example I shared in the pinned comment up top 😉) I hope it helps! it's hard to tell without seeing the actual structure of your JSON file and it's something you can only find out after successfully loading it 🙃
it would have been complete if you would have shown how to save to a new file if the loading was in chucks.
That actually was the original plan!!! I had it filmed and eventually, I edited it out! 😀😀😀
The reason is - after experimenting on 3 different computers I've noticed it was a major waste of time!
Saving the data in chunks is much much slower! And you also need to make sure to set header=False, otherwise - each chunk inside the CSV will begin with the column names. The way around it is to store the column names in advance and then use the for loop to add each individual chunk without its headers.
If you're still interested - try the following:
cols = pd.DataFrame(columns=["final_price", "image_url", "title", "url", "categories"])
cols.to_csv("modified_data.csv", mode="a", encoding="utf-8", index=False)
data = pd.read_csv("bd_amazon.csv", chunksize= 50000, usecols=["final_price", "image_url", "title", "url", "categories"])
for idx, chunk in enumerate(data):
chunk.to_csv("modified_data.csv", mode="a", encoding="utf-8", index=False, header=False)
This should do the trick (but much slower, in more lines of code and a much less elegant syntax 😉)
I hope it helps! 😀
@@PythonSimplified Thank you for the explanation. I'm sure a lot of people interested in the subject will appreciate.
@@PythonSimplified
Can i draw you, I am a painter and I am sure you will like it.
@@hasanzurqa911 Python is beautiful !!
@@PythonSimplified this guy is amazing! Thank you PythonSimplified
You can time things in a Jupyter notebook using an inbuilt magic command. If you wanna time a single line statement, prefix it with %time. If you wanna time the whole cell put %%time at the start. Similarly there are magic commands %timeit and %%timeit which will run your code multiple times and report the fastest time.
WHAT SORCERY IS THIS, PHSOPHER??? 🤩🤩🤩
Just tried it and it is absolutely incredible!!! I had no idea we can do that and I've been using Jupyter since my very first print statement in Python!!! 😱
Folks, you must try the following:
%time data = pd.read_csv("data.csv",usecols=["final_price", "image_url", "url", "title", "categories"])
Nothing short of magic!! I see you as Gandalf now🧙♂️🧙♂️🧙♂️
How about testing Pandas VS Polars?
@@PythonSimplified do we have a similar sort of magic command in the VS code?😂🤔
@@vishaldas6346 You can run jupyter notebooks also in vscode
Good tip. Just tried it on Carnets Plus on iPad, and it worked.
Your enthusiasm shines through as usual. I know there have been some difficult times since the introduction of Openai etc, but you must not stop doing what you’re doing because thousands of people are relying on you, your wonderful teaching skills and your python (amongst other) knowledge. Thank you.
Thank you so much Shane!!! 😃
I definitely spent way too much time going over the comments of my ChatGPT vlog, and my only regret is that I only focused on the job aspect rather than the whole doomsday package 🤪 hahahaha
The reason why I was off radar was not related to the vlog though, I was finishing up a rough semester in university and just landed in the middle east to visit my family. I'll be gone for a bit longer, but will come back with a brand new AI Simplified series! 🥳🥳🥳
The first video is all about the question of "can computers think?" and it's a huge tribute to Alan Turning! so please stay tuned 😉
Great vid! Are you going to do any vids on Natural Language Programing (NLP), with tools like, Spacy,NLTK, Genism, Core NLP?
Thank you for your great video. But perhaps for 15 GB data it's better to use Polars instead of Pandas. It has a similar syntax to Pandas so you don't find yourself on a different planet and it uses Rust code for faster execution. It is particularly suitable for processing large data sets, as it has built-in support for multi-threaded and multi-core processing.
I need to get more into Data Science and Machine Learning processes and such videos help me a lot. Thanks for that
Absolutely! The chunking solution goes hand in hand with the Machine Learning batching 😉
When we load data into a neural network (or other type of models) we load it in batches rather than all at once!
If you'd like to see a specific example in the realm of ML, I have a special beginner friendly tutorial covering it:
⭐ Machine Learning Databases and How to Access Them:
ruclips.net/video/8z2oLfK2sIc/видео.html
This gives you a really nice introduction to the Pytorch framework as well 🙂
It's part of a nice AI and ML Simplified playlist which you may find helpful on your exciting new journey:
ruclips.net/p/PLqXS1b2lRpYTpUIEu3oxfhhTuBXmMPppA
(I'm actually working on a new AI Simplified series, starting very soon on this channel so definitely stay tunned! 😀)
TUVM = Very Timely and Helpful !
Thank you David! Super happy to hear! 🙂
Saving as a pickle or feather format instead of csv will be much faster and less memory consumable.
Absolutely! Thank you for the awesome tip Prakash! 😀
Pickled data is fantastic in terms of reading speed! it's a bit limited in terms of readability (and as a result - security! as you wouldn't necessarily notice any shenanigans inside what seems to be normal data😉)
It's a great solution if you don't mind a slight learning curve and as we are the ones who pickle our files - security is not a problem 😀
If anyone is curious about pickling, here's the documentation:
docs.python.org/3/library/pickle.html
And please let me know in the replies below if you'd like to see a simplified tutorial about the pickle module 🙂
I am just learning python and found some of your videos. You are very good and very clear. I had issues installing Anaconda on my Windows 11 computer. It was very slow and crashing most of the time. I have good hardware so that was not the issue. I might try it again as this Juniper looks good. Thank for your time and efforts making these videos.
For me personally, the best channel out there to learn python!!! I am not kidding! Thank you so much! ❤
Thank you so much for the incredible feedback!!! Super happy you like my tutorials! 😁😁😁
Great presentation- your explanations are the best!!
Your videos are extremely didactic and easy to understand, they are the most beautiful and elegant projects on youtube! Congratulations.
My approach when downloading very large csv files, is to use
data = requests.get(url,stream=True).iter_lines()
That returns an iterable to the data, but doesn't start downloading it at this stage.
The first row will be the headings, so get that with something like
headings = next(data).decode("utf-8").split(",")
Then loop over the body of the data either with a for loop, list comprehension, or multiprocessing.Pool().map()
and dump each line into a database, then do queries on the database to analyse it.
Or, if it isn't quite so big, then put it in a numpy array and work on it from there.
Thank you for the awesome tips, Katrina!! You're a rockstar!! 🤩🤩🤩
This approach reminds me of loading data with C++!
It looks like it gives you full control over your data and doesn't leave any grey areas for libraries like Pandas to fill in. I LOVE IT!!!
Do you find that multiprocessing works better than multithreading when it comes to loading/storing data?
I've heard from online folks that multiprocessing is time-costly, but it's highly recommended for CPU-intence tasks... which I find a bit confusing! hahaha I wonder what's your take on that 🙂
(sorry, I haven't had a chance to follow up on our ChatGPT conversation. The comment section there got absolutely insane after a short while. It got substantially more comments than any one of my videos... I wasn't expecting that at all! 😅)
@@PythonSimplified I guess it depends. Maybe if all your data is strings that, that isn't particularly CPU intensive. I am usually downloading files where most of the columns need to be converted to ints and floats, and generally at least one of them needs to be converted to a date, and it needs more than one CPU thread to keep up with the maximum download speed I can do.
@functools.lru_cache can sometimes help as well, but not always, it depends on the data.
Another thing I've found is, multiprocessing can be really slow to get up and running on Windows, but it is much faster on FreeBSD and Linux. If you are using a Windows computer, put your Python/Jupyter in wsl-2, then amend your Jupyter config file to run a Windows instance of the browser.
Thanks you alot maria you are doing great job, may god bless you and your efforts.
We need clear map to data science and machine-learning in long length videos, if you will
i love your clarity.
Super happy to hear, Dimitrios! Thank you! 😀
I really would suggest to use Polars instead of pandas when using big files...it can be 7 times faster, use timeit to measure to difference. Handling that amount of data every min counts. I love pandas but Polars is WAY faster. Cheers. Nice tutorial though you bought the full set? How did you get access to the full dataset?
Beautifully done! Please never stop making these videos :)
Wanted to ask if you are gonna do another hackathon this summer, I really enjoyed the one from last year!
Wow! very helpful video. I was dealing with this problem. Love your videos thanks.
Great video! If I'm going to apply these skills I'll just look up the syntax 😄.
Great video.
Very helpful !
But how would you handle 6 to 10 gb of data but in 11 000 + xml files?
Hi Mariya, I always wanted know more about giant datasets and Python. Thank you. I am looking forward for simplified python.😉😉
Thank you, another great tutorial. Once we have the dataset we can search (SQL type search, any indexing?) inside the data? Maybe a follow on tutorial?
I really was suprised how fast Python loaded that huge dataset!
me too! In contrast to Excel - my new PC was actually able to load it as is! Excel on the other hand collapses almost immediately, regardless of the computer system 🙃
The most beautiful voice on youtube, thank you for the well narrated and produced content ;)
Very interesting. Please do more videos on handling pandas dataframes + tkinter. Logical operations, comparisons, filtering, unique values etc. etc. , and how to include the result in a frame on the window.
Hi! Love your videos! I learn so much, and in an easier way! Question, which laptop would you recommend to use as an entry level programmer-data-analyst. Mac? Windows? Or Linux? I dont mind Ubuntu GUI. Thank you!
Welcome
I have an off topic question.
What language was Civilization VI developed in?
Please answer the question as soon as possible, thank you.
My personnel record is the Citibike data. 50 csv files containing 300 + million rows. I import it into Excel using Power Query. Yep it did take a few minutes.
Could you please make a video on how to quickly add those cvs files to a sql table?
I have a bunch of these already 😉
⭐️ SQLite Basics:
ruclips.net/video/Ohj-CqALrwk/видео.html
Please don't forget to add connection.commit() to insert all values to table, it's in the description but was omitted in the video.
⭐️ Webscraping Databases:
ruclips.net/video/MkGQmZoMuRM/видео.htmlsi=Q_Z3jFfoLnF54Bxb
You'll find exactly what you're looking for in this bideo, csv to SQL, just skip the webscraping part 😃
Cheers!
Many thanks. I'll watch them now.
Thank you for this informative video. Do you think you can make a video on machine learning where we take a dataset and train and test a model to predict some event? Please it will help us many a lot. I really enjoy the way you explain these concepts. Keep up the good work!
I'm working on a brand new AI aeries as we speak 😉
In the meanwhile you can checkout my old AI Simplified playlist:
ruclips.net/p/PLqXS1b2lRpYTpUIEu3oxfhhTuBXmMPppA
I have 2 videos showcasing the entire training + neural network making process (N-gram modelling):
⭐ Build a Neural Network with Pytorch - Storyeller PART 1:
ruclips.net/video/mzbJd0NhW2A/видео.html
⭐Train a Neural Network - StoryTeller PART 2:
ruclips.net/video/GTyTG3XzPq8/видео.html
I must warn you though - it's a very stupid neural network as I haven't optimized it 😅
These were some of the first tutorials I've ever filmed, so I wasn't particularly good in terms of explaining. The new series will be much much better 😃
@@PythonSimplified thank you so much
Would you mind doing a follow-up on this, where we do decide to iterate through the batches? (ie with the amazon example app, how would we best implement batch processing in this context). Thanks. Your vids are great.
Thank you for this very useful video!
It's so refreshing seeing code with run right smoothly after you write your code and hit run. My code will almost always get an error after I type it.
Very clearly explained and with your usual enthousiasm, keep it up! :)
This was beatiful, just in time for my work. Keep it up!
A delightful presentation. thanks.
Awesome. I wonder if loading them into chunks and appending them into a single DataFrame leads to a faster FULL DataFrame that we can work with altogether
Hi Mariya,
I have a question regarding programming logic, algorithm and data structures:
What do you recommend for a beginner who wants to learn programming logic, algorithms and data structures? I am totally confused on what I should be doing and what are the best resources to use...
I know Java syntax, but problem solving is a show stopper for me and I don't know what to study, what to do or what resources to use to improve...
Please help! I am a desperate beginner
Thank you!
( I am learning Java atm, but I love your videos so I follow you! :D)
Any tutorial Python, Kivy and bluetooth ?
Awesome we can also use dask or worker-based distributed approaches. Perhaps a follow-up for you?
have you considered a voice career? like singer or actress, your voice is deep and clear, I like it very much. Thank you for all your work, you make a lot for python learners like myself
Great information and knowledge! and Love your energy!
Thank you so much :)
I’d love to see a version of this video but for geospatial data. Manipulating such data is often complicated by the geospatial aspect. Anyway, great video!
Thanks for the suggestion and for the lovely comment! 🙂
I cant say that I'm an expert in mapping and coordinates... but if I stumble upon a nice geospatial dataset, I'll definitely explore it and see if I understand it well enough to film a tutorial about it 😉
thanks for sharing this content, I learn a lot from you, keep going! 💪
Hoping one of the following videos will be on processing giant datasets with vectorization, using apply() instead of for loops, etc. 🤞
Haven't had a chance to explore it yet, thank you so much for suggesting! 😃
My guess is - call.apply() on each individual chunk using my code from minute 10:25, it should speed things up quite a bit 😉
You should have a look at the python lib datatable. That one was especially designed for huge data sets and is orders of magnitudes faster than pandas. And can do memory mapping to process data that does not fit into RAM.
I loved the new intro so much!
First - that was awesome walkthrough, Mar.
Second - would be nice to see more clustering workloads. Like koalas, instead of pandas. Working on a single node is acceptable/tolerable. But to future-proof our skills we'll have to work with parallelization. Aka Spark/Dask/Ray.
I reckon you can use free acc to run your jup notebooks from a company-we-all-know-that-uses-spark-as-core-engine, something to do with bricks...
Beautifully explained
I use parquet bc it's so much faster and takes up less disk space AND saves data types, which is really useful if your optimizing dtype for speed/storage. And it's compatible with R/Power BI/ect. When I'm starting from CSV though this is an awesome tip, thanks! 🙏 I'm shocked at the speed increase!
parquet with polars ;)
I was wondering if you were going to do a gui app series on pyside6 or qt design? Love the work!!
Next on the menu - a brand new AI Simplified series 😉
I'll post a few GUI projects in between, but my main focus at the moment is there 🙂
@@PythonSimplified I love it! Can't wait to watch. I am currently using your Qt5 tutorial to help with my chatbot with gpt-3.5 where the user can select a character and the bot will emulate chat with that character. It's super fun!
Can you give me an explanation why after saving and opening the data as a new file, instead of taking long to load , it took so little time?
The reason is - we're dealing with a much much smaller file! we reduced it from 11GB to 600MB so it's not nearly as challenging to load as the previous one 😃
Save it to parquet files it is the fastest method to save and store large datasets 😎
I'm still waiting your first English course
you should check out timeit
from timeit import timeit
result = timeit(stmt=f"main()", globals=globals(), number=n)
print(f"Execution time is {result/n} seconds")
good afternoon, when will you come here in Brazil to visit us?, I work in an Arab restaurant when you come let us know, you are very charismatic, thanks for the videos, we learn a lot from you
Definitely a fantastic incentive! I've just landed in the middle east to visit family, but would love to go on an actual vacation to a place I've never been before! Brazil is certainly on my to-go list 😉
Thank you so much for the lovely comment, and if I'm ever in Brazil - you can count me in for a big shawarma plate!! 😃
@@PythonSimplified It will be a pleasure to receive your visit, when you are here in Brazil, I can't wait.
I wonder how to use AI in order to build an app to model a sim living on a plot of land in a specific geographical location. So, you'd enter in the geographical location and the size of the plot of land, and the current state that land is in. The app tells you what the yield of a combination of crops and plants will be. As the app develops it will be able to model whole communities living off grid. This would be a very interesting project to work on. Plenty of scope for continual development as more and more data is fed into the app. It would have possibilities for graphics and animations to really simulate the metiorological conditons of a given location and compute all practical aspects required in order to live in that given region. It would be a big project but I'm sure one that many programmers would want to work on.
Could you talk about custom tkinter ?
I would wear a t-shirt that says "the chunk comes as-is" :D haha. Another great video! Thanks for your hard work!
Thank you for your help Mariya! ☺️ On the occasion of women's day, I would like to wish you all the best, noblest and most beautiful! :) Please keep developing such great content on your youtube channel! ❤️ 🙏
@Python Simplified Can you apply this python code inside of SharePoint to overcome the 5k record limit?
thank you for such an amazing video!
what did you do? what sorcery is this? how does saving it with different name improves memory and processing?
please explain.
We've saved it after disposing of 35 columns, so the dataset was already 35 times smaller before we re-saved it 😉
In addition, I believe that Pandas is optimizing the data type of each column before exporting your DataFrame, so there should be a boost of efficiency there as well 🙂
@@PythonSimplified ah, I expected the dropping of not necessary columns, but I did not know that pandas can optimize data exporting. (I think this feature is new, and will be included in pandas 2.0 or so I heard.) Excited for new pandas though.
Decoding problem with the utf-8 codec. I had to add encoding='latin-1'.
Your way smarter than me ..God Bless.. 🤙
Channel active after long time ❤️🙂
I like your knowledge and tutorials
Thank you so much! Super happy to hear! 😀😀😀
I have a 6 million rows of healf data and Jupyter crash. Thank you for a solution with chuncks
Can this be done using json as input as well? Actually I saw your SQLite video too. The problem I'm really trying to crack is I have a dataset where each record(300 million records in total) is a nested json object. I have the option to convert it into SQLite3 db but I am not sure how we can store something like the 'Categories' field in your dataset which has an indefinite number of array elements within it. End goal is to write an app which can send SQL queries to get filtered results from the database. Perhaps you can advise.
What font are you using in your terminal? (part where you activate environment in anaconda)
Just the default font that Anaconda comes with 😃
How do you know the names of columns before creating the dataframe?
Hi Tom!! 😃
I've checked with data.columns (somewhere around the middle of the tutorial, I believe).
Otherwise - I wouldn't even have the slightest clue as this dataset won't open with Excel (and probably their MAC equivalent too 😉). The software completely collapses and my only way of accessing the contents of this file was via Python 🙃
@Python Simplified Correct me if I'm wrong, but `df.columns` comes AFTER you have created the df. But creating that df process is resource consuming because of the size of the dataset. My approach to getting to know my data would be to create a much smaller df using `nrows`. Something like,
dummy_df = pandas.read_csv('large_data.csv', nrows=50)
This will consist only 50 rows, but I'm more interested in the column names.
dummy_df.columns would now give me the column names that I can use in chunks. 😀
can we combine kivy with flask to use kivy program as a web app
Great video! As usual. Thanks!
Thank you so much Marcin!! 😃 😃 😃
(Present and) Future challenges concern big data.
The biggest challenge from my perspective is the lack of privacy. We don't notice how much of our data we voluntarily provide to all kinds of services/software. This data is not always used in our favour and very often sold to other services/software of which we are unaware of 😉
Please Mariya this is awesome
Can you do something on python data structures and algorithms...🙏
You are just amazing brain with beauty 😍.
Thank you so much Jk!!! 😀😀😀
What if my project requires loading the data as it is into the oracle database? I have done these tasks of loading 7 to 10 Million records into the database using Chunks or you can say by batches. I am not sure if there is any convenient way to reduce the time.
Is there no code completion and code inspection/parameter preview available for jupyter? (sorry, I´m python beginner ;-) )
good data crunching, we just need quantum computers now so we can work on data that hasn't been created in this universe yet.
or even better - created in a parallel universe 😉
The new intro is nice :)
not working woth the 50 gb of dataset any other alternative i trying to import in kaggle file its keeps on crashing
If you have a CUDA compatible GPU - try opening it with cuDF pandas. I have a tutorial of how to set it up and you can use a regular read_csv() command to read via GPU rather than CPU (if the dataset is compatible, of course):
ruclips.net/video/9KsJRyZJ0vo/видео.htmlsi=hnHA2gW4GzDBykDH
I hope it helps! Otherwise - try a library called Polars or other Pandas alternatives :)
@@PythonSimplified i am using kaggle on Macbook with 8GB of RAM. ill try this one and will connect you thank you for reply. inspiring many.
Do you have like a Python tutoria crashl course series
Hi you are the best 👏👏👏🌻🌻🌻
Thank you so much Alexandro! 😃
Great video!!! As usual ! 🏆🏆🏆
hey maam i am a bignner in feild of programming and i have doubt
i am really bad at maths so if i want to do AI or ML in future do i need to learn maths
I'm working with >54,000,000 records. The ETL process from Oracle to SQL Server takes 8 hours.. 😅
Thanks!
Thank you so much Rickey! I really appreciate it! 😃 😃 😃
I cant download the dataset, can anyone help me.
Use f-string instead of concatenating your strings!
One is not better than the other, it's just a matter of personal preference 🙃
wow video was uploaded 3 months ago, the dataset was 2.3m records & now its 77.9m , that's too much data
Приятно смотреть, пол канала просмотрел. Улыбка убийственная))))
WHERE IS THIS JUMP SCARE SOUND! I GOT SCARED JESUS 0:03
Hello I really like your tutorials. I have a LARGE json file(22GB) and I can not open it with pandas read_json. I will be really thankful if you make similar tutorial for json files.
The exact same techniques will work with read_json as well 😃
You can combine the usecols and chunksize properties to load the dataset bit by bit, no need for a special as it's not really different from read_csv 😉
@@PythonSimplified thank you for your response, but I am getting this error: TypeError: read_json() got an unexpected keyword argument 'usecols' , and I can not see usecols in the documentation for read_json.
aha! you're right, usecols is not a property of read_json()! 😱
The problem with JSON files is that one is structured differently from the other and requires a great level of customization. My suggestion: use chunksize to have a look inside the individual items of your file, and try combining it with orient='columns' to get a table-like structure for each chunk.
From there - you can call the .drop() method on each chunk to dispose of unnecessary columns and then save the much smaller chunks into a new csv file (using the code example I shared in the pinned comment up top 😉)
I hope it helps! it's hard to tell without seeing the actual structure of your JSON file and it's something you can only find out after successfully loading it 🙃
@@PythonSimplified Thank you, I will try it.
will it be about duckdb?
What happen about custom gpt???
Where are you ?
Super thanks
var1="Very Useful Topic."
var2=" Thanks you Mariya. (\/)"
var3=var1.replace(".",var2) ; )
Wow
You are making magic
Just keep doing these videos 🎉❤…..;) bye