Importing a Pandas Dataframe to a Database in Python [For Your Data Science Project]

Поделиться
HTML-код
  • Опубликовано: 16 сен 2024

Комментарии • 105

  • @IrakliChitishvili
    @IrakliChitishvili 3 года назад +6

    How the hell this channel isn't more popular?

    • @stratascratch
      @stratascratch  3 года назад +1

      I know right?! I guess not many people like to watch me code...

    • @IrakliChitishvili
      @IrakliChitishvili 3 года назад +4

      @@stratascratch I think it's just matter of time before your channel and the work you do takes off.

    • @flipcase
      @flipcase 3 года назад +1

      I second that!

  • @toilinginobscurity3091
    @toilinginobscurity3091 2 года назад +3

    There's really no rhyme or reason in the universe. This channel and these tutorials are gold but instead I am bombarded by ML channels playing with toy datasets and brand themselves as all you need clickbaits. You Sir are a saint.

    • @stratascratch
      @stratascratch  2 года назад

      haha =) I know exactly what you mean. Too many of those channels around.

  • @prateek2159
    @prateek2159 3 года назад +18

    Hey Nate, your videos are just too good. I love how your channel is so dedicated towards real word data science. By the way I noticed that you started a video series, "For your Data Science Project" and I really want you to continue making videos for this particular series because there's literally no one on RUclips with such guidance on DS projects and I have been looking for one since a very long time because I have my placements just after 12 months and I really want to make a full stack data science project. Thank you.

    • @stratascratch
      @stratascratch  3 года назад +6

      Thanks for the kind words! I have created a series of these videos from importing a pd df to a db to grabbing data from an API. I think most of the videos in the series should get you started on grabbing and managing data.Are there other types of skills you're interested in?

  • @andrealves3104
    @andrealves3104 Месяц назад

    I was having weird issues to insert data via pandas to_sql method on a Postgresql database on AWS while passing it through a sqlalchemy engine API interface.
    For now, adding it manually through a SQL query while iterating row by row on pandas worked fine, both for my project's purpose as well as for the video's goal.
    All and all, thank you very much for the overall explanation.

  • @Networkprofessor
    @Networkprofessor Год назад +2

    Just what I needed! This will take you to a whole new level. Thank you.

    • @stratascratch
      @stratascratch  Год назад

      You are welcome. Thank you for the nice comment. 😀

  • @jaquelinerazzo
    @jaquelinerazzo 2 года назад

    I already play back the video like 4 times just to watch the ads because I don't know how to thank you! Great channel, great video. :)

  • @awds121
    @awds121 2 года назад +1

    Hey Nate! Great video! I have a few feedback. First, the videos are very helpful and you should definitely continue doing more videos like this. Here are some ideas: How to schedule scripts to run on AWS; create tables, analyze data, and create a dashboard. Second, the playlist that includes this video is unordered. It is not hard to figure out which video should be watched first, but it doesn't hurt to sort it in the order that you want videos to be watched. Third, AWS doesn't include PostgreSQL in the free tier (at least anymore). Just a heads up. Keep up the good work!

  • @RenatoCosta1
    @RenatoCosta1 Год назад +1

    Hey Nate, amazing stuff! I think it would be great if you could do two more videos covering the topics you've mentioned in you video about the only project one would need to get a job in data science. I guess the next videos would be about machine learning modelling and the last one on how to deploy it online. That would be awesome! Thanks anyway

    • @stratascratch
      @stratascratch  Год назад +2

      We're planning to publish 1 data project a month! When they are released, it'd be great if you can provide feedback. For example, did we dive deep enough into the technical aspects of the project? Did we dive too deep? Is the video too long/too short? etc? This will help us iterate on the videos so we can provide something of value for all viewers. Thanks for the feedback!

  • @zynkers401
    @zynkers401 2 года назад

    Great video Nate! You really have the spirit for teaching. The vocabulary and content is easy to understand and follow. I hope Strata Scratch has a continued exponential growth.

    • @stratascratch
      @stratascratch  2 года назад +1

      Thanks for the kind words! And thanks for watching!

  • @PATRICKCHUAD
    @PATRICKCHUAD 2 года назад

    Thanks Nath for your video on Pandas. I will apply this to my project. in python. Thanks,

    • @stratascratch
      @stratascratch  2 года назад +1

      Great! Happy to help, thanks for watching!

  • @javierjdaza
    @javierjdaza 3 года назад +2

    Brother, never miss one video. this is pure gold!! keeo doing this please, can i connect with u in linked in?
    thanks bro, ure the best

    • @stratascratch
      @stratascratch  3 года назад

      Thanks so much for watching! And glad you like my videos =) Yes please connect with me on LinkedIn. You can search for StrataScratch and I should pop-up (I think?). If not, let me know.

    • @javierjdaza
      @javierjdaza 3 года назад

      @@stratascratch i cant find u. :(

    • @stratascratch
      @stratascratch  3 года назад

      @@javierjdaza Sorry about that. Here you go! linkedin.com/in/nathanrosidi

  • @its_me7363
    @its_me7363 3 года назад

    much awaited video by myself...thanks for your time and effort Nate

    • @stratascratch
      @stratascratch  3 года назад

      Thanks for watching! I just updated the description with a link to the github repo that has the final version of this project. I'd definitely take a look at that and follow along with the notebook. I felt a bit too rushed on this video trying to explain all the concepts. But if there's anything to take away from this video, it's (1) how to properly structure your functions and (2) how to solve for memory & performance issues when potentially dealing with millions and billions of rows of data.

    • @its_me7363
      @its_me7363 3 года назад

      @@stratascratch how industry people manage large data stored in database while exploring data? what are the best practices for data scientist or analyst in these cases?

    • @stratascratch
      @stratascratch  3 года назад +1

      @@its_me7363 The dbs are optimized to handle large amounts of data with speed. You definitely won't be using any postgres or mysql dbs. You'll be using HIVE or Greenplum or Snowflake. Also, depending on what you are trying to do, you will likely move away from using pandas and onto py spark to do some work. All of this is taught when you join the company. Most people know that not all entry level ds have used these tools.

    • @its_me7363
      @its_me7363 3 года назад

      @@stratascratch does these dbs are similar to postgresql? and what do you mean by moving away from pandas...does industry not use pandas in their projects?

    • @stratascratch
      @stratascratch  3 года назад

      @@its_me7363 yes the dbs are similar to postgres. There's just minor differences in syntax but it's easy to learn. The industry uses pandas but sometimes when dealing with large amounts of data, you want to work with something that can handle large amounts of data and that's pyspark.

  • @JeffersonCanedo
    @JeffersonCanedo 2 года назад +1

    I so glad I found this video Thankyou

  • @quadrialli3715
    @quadrialli3715 Год назад

    Great and detailed explanation

  • @jessinthamathew3032
    @jessinthamathew3032 3 года назад +1

    Hi Nate! Really appreciate your content, its relevance and quality. So much to learn with each video! I wanted to ask why you opted for postgreSQL specifically? Can I use MS SQL server too?

    • @stratascratch
      @stratascratch  3 года назад +2

      I'm just used to postgres and it's open source. MS SQL is fine to learn as well. In the future, the platform will be able to handle both db engines but currently it only uses postgres. Thanks for watching!

  • @codingstyle9480
    @codingstyle9480 2 года назад +11

    Hi, Thank you for the video. I have a question: How would you go around if you wanted to update your database in real time without you manually running the code. Perhaps we could design the code such that it is triggered when something changes in the source data to insert or update data automatically in real-time(perhaps with some delay, of course)

    • @muhammedmarong6645
      @muhammedmarong6645 2 года назад

      hey did you figure that out?

    • @pashlenenaidoo3695
      @pashlenenaidoo3695 2 года назад

      I would also like to know if you have achieved this

    • @stratascratch
      @stratascratch  2 года назад +1

      You would need something like Airflow or Jenkins to schedule the trigger.

    • @caioriet458
      @caioriet458 2 года назад

      I would like to know if you solved this problem!

    • @stratascratch
      @stratascratch  2 года назад +1

      @@caioriet458 Yes, I left a comment. You'll need to use a scheduler like Airflow in order to keep updating your db without manually running the code. The scheduler will run it for you.

  • @bodhid3091
    @bodhid3091 2 года назад +1

    Good video. A couple of questions:
    Why do you store the new rows into another dataframe rather than calling insert_into_table directly?
    Any reason you don't use the postgresql INSERT ON CONFLICT option to create an upsert, combining the insert and update into one call?

  • @TripPiper
    @TripPiper 2 года назад

    If your connection to AWS fails try creating a new inbound security group that allows All Access.
    But also learn how to create a connection with TCP or some form of security because in production you’d never have inbound connections on All access

    • @denyseperezdevera7188
      @denyseperezdevera7188 Год назад

      Hello! do you have any idea why it returns " database "database-1" does not exist " ?

  • @GauravKumar-xl1fs
    @GauravKumar-xl1fs 3 года назад +1

    please make more video of this series and pipelines

    • @stratascratch
      @stratascratch  3 года назад

      glad you liked the videos. Will do if there are more views!

  • @williamausenka9251
    @williamausenka9251 2 года назад +1

    Great video! One question, what ML implementation would you do with that data? I'm trying to come up with something, but no success so far.

  • @wisjnujudho3152
    @wisjnujudho3152 2 года назад +1

    hi Nate, is there any alternative to aws rds, a cloud database server which doesn't require credit card even though it's free tier. alternatively, is it possible (maybe) to use heroku postgrest as database? thx

    • @stratascratch
      @stratascratch  2 года назад +1

      Hi Winsjnu, I find this article helpful.
      You might want to check out. www.lastweekinaws.com/blog/10-free-cloud-databases-you-should-consider-and-1-you-shouldnt/

  • @MartinoxxHD
    @MartinoxxHD 2 года назад +1

    Nice vid, I actually learned new things! One question, what's the point on creating a temporary DF for the new videos and then UPDATE them on a new for lopp. Can't we just UPDATE and INSERT the values in the same for loop?

    • @stratascratch
      @stratascratch  2 года назад +1

      I hold the new videos in a temp df so that I only need to upload the new videos or replace existing videos with new metrics. I'm trying to stay away from updating all videos for performance. Then I do the update and insert all at once but it's only for the new vids. Hope that makes sense. Thanks for watching

  • @hogthebuilder
    @hogthebuilder 2 года назад

    2:20 in and know that this is about to be the video for me.

  • @noobshady
    @noobshady 2 года назад

    That was really helpful.

  • @skyace173
    @skyace173 2 года назад

    Hey Nate! Great video! How much time do you expect to get this finish if this is a project at work?

    • @stratascratch
      @stratascratch  2 года назад

      This is probably a 1-2 day project depending on what is being asked of me. But it shouldn't take too long if you already know what you're doing.

  • @elextures
    @elextures 5 месяцев назад

    im having a problem with channel id, in CHANNEL_ID, i put my own channel, but each time i press response, i get a different channel

  • @khairulhakimi508
    @khairulhakimi508 2 года назад

    Im new to this field. I have one noob question. Why people(DA, DE, DS) using pyhton(PANDAS DATAFRAME) for data ingestion to transfer sql . Why not just upload the csv into the ssms using query or using import flat file?

    • @stratascratch
      @stratascratch  2 года назад +1

      It's basically to allow processing or manipulation of the column names and data, if you want to do that.

  • @suprithm3802
    @suprithm3802 2 года назад

    I am having data type clob i am not able to update data but the append function is working fine

  • @ruthfussee5503
    @ruthfussee5503 2 года назад

    Hi Nate, thank you for this video super useful was scanning through the internet for such a solution but could not find it. I am thinking what if there are many columns (e.g. 100 columns) wouldnt it be inefficient to type in row['column1], ..., row['column100]. ISsthere a way we can put the column information in a tuple, list or dictionary and passing through all this, like after SET also or vars_to_update.
    another issue is where the excel column names have space (different from SQL columns) how do I then write the code as I keep having sytnax errors
    Lastly, in the real world sometimes the excel files do not follow the template we stipulated, how should I design a validation check that is useful?

  • @denyseperezdevera7188
    @denyseperezdevera7188 Год назад

    Hi Nate, thank you so much for these amazing resources ! Do you know why when i try to connect the database i get"OperationalError: FATAL: database "database-1" does not exist" ?

    • @stratascratch
      @stratascratch  Год назад +1

      Do you have a db? Did you create one for the project? You need to add your credentials.

    • @denyseperezdevera7188
      @denyseperezdevera7188 Год назад +1

      just solved my issue! i switched DB instance identifier with DB username.

    • @artyomashigov
      @artyomashigov 18 дней назад

      @@denyseperezdevera7188 oh thanks, i had the same issue, so my username and dbname are the same

  • @andrewraad5626
    @andrewraad5626 2 года назад

    Hey Nate, thanks for this video. how do you deal with foreign keys? My csv file has strings instead of the foreign keys needed to upload to a specific table that uses foreign key constraints. What do you suggest?

    • @stratascratch
      @stratascratch  2 года назад

      I have no idea how to fix your situation. I would just change the strings to FK. Some data manipulations will need to be done with your issue it seems.

  • @acsrr4288
    @acsrr4288 2 года назад +1

    what library shall we use to load data to azure DB Cloud?

    • @stratascratch
      @stratascratch  2 года назад

      You can use various libraries and frameworks to connect.

    • @acsrr4288
      @acsrr4288 2 года назад

      @@stratascratch can you recommend a couple for us to explore
      ? thx

  • @yashgupte3044
    @yashgupte3044 2 года назад

    Getting a connection timed out error while connecting to the RDS instance. I have allowed all traffic in the inbound rules of the security group.

    • @stratascratch
      @stratascratch  2 года назад

      Hi Yash, how about this link could help to troubleshoot. aws.amazon.com/premiumsupport/knowledge-center/rds-oracle-connection-errors/

    • @jakubjankovic4144
      @jakubjankovic4144 2 года назад

      Have you managed to solve it? I am having the same issue

    • @llamashockz
      @llamashockz 2 года назад

      having the same issue for hours, nothing seems to work to help fix it

    • @jakubjankovic4144
      @jakubjankovic4144 2 года назад

      @@llamashockz I ended up using ElephantSQL instead of AWS to host the db and it resolved the issue for me.

  • @mcmoodoo
    @mcmoodoo 2 года назад

    Why not just use pandas.to_sql("table_name", engine) which will create the table based on the dataframe object? You can also pass types dictionary to specify the exact types for the columns like varchar and integer?

    • @stratascratch
      @stratascratch  2 года назад +1

      Yes you can do that. It doesn't always work the way I want it to when I use to_sql() and I often like to have QA checks along the way so my method helps with that. But there are multiple ways to do what I'm doing. I would choose the option that allows you to get your work done!

  • @reality8221
    @reality8221 2 года назад

    Hi Nate, Thanks for the great videos. Won't it be simpler to insert it as a part of 1st for loop; instead of creating a df and then iterating through once again. Would like to know your thoughts.

    • @stratascratch
      @stratascratch  2 года назад

      My thinking about the temp df was to only update the new videos and metrics, and leave the rest alone. I think that's what I did in this project. Hope that makes sense.

  • @SensiStarToaster
    @SensiStarToaster 2 года назад

    Nice video. But writing sql directly gets messy to maintain, why not use something like Sqlalchemy to separate your python code from the having to worry about the specific details of the specific sql implementation and make it independent of the type of database manager used?

    • @stratascratch
      @stratascratch  2 года назад +1

      yes, you can definitely do that. I haven't used sqlalchemy but have seen the documentation and agree with you. It could be a much better solution.

  • @ShahnazMalik.
    @ShahnazMalik. 2 года назад

    To insert or update in a table having millions of records in SQL database takes hours to complete. What is the best solution , please advise
    Thank you.

    • @stratascratch
      @stratascratch  2 года назад +1

      That's normal. Sometimes jobs just take hours to complete. I would optimize for ensuring that the process doesn't break and if it does, you don't lose all the work. One way to do this is to batch the update/ inserts so that if the process does break half way through, you at least have updated/inserted half the records.

  • @Digital-Light
    @Digital-Light Год назад

    i'm geting this error "DatatypeMismatch: column "upload_date" is of type date but expression is of type double precision
    LINE 3: ... VALUES('mTL23Gd-T3g','Engagement_Ring_Animation','NaN'::flo..." please help

    • @stratascratch
      @stratascratch  Год назад +1

      The error is saying that you have a dtype mismatch in the upload_date column. So clean up the values in the column and get it to a dtype that the db is expecting.

    • @Digital-Light
      @Digital-Light Год назад +1

      @@stratascratch thank you for reply! i love your videos.

  • @amadysvlog3449
    @amadysvlog3449 2 года назад

    Great video! Thank you! But question:
    How do i get access to the Amazon RDS free databases? I created an account on aws but can't find any database-yt.

    • @stratascratch
      @stratascratch  2 года назад +1

      you'll need to create your own db. That's a whole topic by itself. But you can try it by going on aws and then creating a db on the RDS service.

  • @Davidkiania
    @Davidkiania 2 года назад

    Hi Nate, any chance you might show us how to automatically run the script? it is a cron job or how does that work? Thanks

    • @stratascratch
      @stratascratch  2 года назад

      Yeah, it's essentially a cron/scheduled job. I use Airflow at work but I've seen others use Jenkins.

  • @andrew6233
    @andrew6233 2 года назад

    Newbie question: what is the database manager you show at 11:00? Thank you so much for these video

    • @stratascratch
      @stratascratch  2 года назад +1

      It's Datagrip!

    • @stratascratch
      @stratascratch  2 года назад +1

      It's my favorite BTW. I hate the in-browser db managers.

    • @andrew6233
      @andrew6233 2 года назад

      @@stratascratch Thanks! It looks really nice + convenient to work with compared to what I've been using.

  • @hoanglam2814
    @hoanglam2814 2 года назад

    Hi Nate, thanks for your videos. I have to retrieve data from almost 100 API and your video help a lot. But normally it takes me almost 10 hours to update all the data/records each day.
    Yesterday, I just came across some articles about Webhooks, and it mentioned that Webhooks will push data whereas we have to pull data from API. What do you think if I can use Webhooks to update data and insert new data into table? It will only retrieve data when there are events happen, instead of retrieving all data like pulling through API.
    For API, I think I can use it to retrieve old data from last year.

    • @stratascratch
      @stratascratch  2 года назад +1

      How many records do you have where it would take 10 hrs a day? If it takes that long, I wouldn't implement this method. Seems like you have a lot of data. I would use Airflow and setup DAGs to get those jobs scheduled.
      Without an understanding of what type of data you're trying to move and for what purpose, I can't tell you if webhooks is the right way

    • @hoanglam2814
      @hoanglam2814 Год назад

      ​@@stratascratch I retrieve about 40 thousand records each day, these are orders from a CRM system. My company helps around 100 clients manage their business, and each of them have one account in this CRM provided by a third party ( this means 100 API source). Actually, we are building an analytic system to work on it (included ETL process and dashboard), but until that, I to collect data from 100 APIs one by one. So I tried to implement your method and it works.
      But, yeah, like you said, it took too much time. I also did research on Airflow but I don't know what the difference between using Airflow and using a cron tool like Power automate/Zapier is. For me, Power automate/Zapier is much easier to learn.

    • @stratascratch
      @stratascratch  Год назад +1

      @@hoanglam2814 The difference between Airflow and something like Zapier is that Zapier is good for low volume jobs that a non-technical person can setup. Airflow is the industry standard and can help manage huge databases and data lakes.
      For your use case, an automated ETL process where you collect data from APIs seems like it should work just fine for 40K records. Just make sure you automate the pulls since there are 100 of them. You can try webhooks if you'd like. I just don't have any experience using webhooks for your use case.

    • @hoanglam2814
      @hoanglam2814 Год назад

      @@stratascratch Thank you for your short and concise answer :)) it is really helpful!

  • @suprithm3802
    @suprithm3802 2 года назад

    The code is not updating the only itz appending the new values

  • @hardikvig2723
    @hardikvig2723 Год назад

    Where to get the name of database??

    • @stratascratch
      @stratascratch  11 месяцев назад

      It's your db that you need to setup so it can be any name you give it.

  • @nqomuleya
    @nqomuleya 2 года назад

    InterfaceError: (pyodbc.InterfaceError) ('IM002', '[IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified (0) (SQLDriverConnect)')