Solve Data Science Tasks In Python (Saving Hours) (automating import CSV files to database)

Поделиться
HTML-код
  • Опубликовано: 27 окт 2024

Комментарии • 106

  • @stratascratch
    @stratascratch  3 года назад +10

    Part 2 where I fully automate the notebook is here ruclips.net/video/TDwy1lSjEZo/видео.html

  • @sahej97
    @sahej97 3 года назад +6

    Love this series of real world on the job tasks

  • @prateek2159
    @prateek2159 3 года назад +9

    Hey Nate, your videos are just too good. I love how your channel is so dedicated towards real word data science. By the way I noticed that you started a video series, "For your Data Science Project" and I really want you to continue making videos for this particular series because there's literally no one on RUclips with such guidance on DS projects and I have been looking for one since a very long time because I have my placements just after 12 months and I really want to make a full stack data science project. Thank you.

  • @mickyelmb
    @mickyelmb 2 года назад +2

    Excellent tutorial. I was able to follow along in Jupyter. with no issues. Thank you for posting it.

  • @brentsimpson3791
    @brentsimpson3791 Год назад +1

    Mate! I'm a newbie python 'programmer' and this video has to be one of the most useful practical examples I've found. It'll take me a while to absorb it all but THANK YOU!😊
    Brent

  • @hameedmulani21
    @hameedmulani21 Год назад +2

    One of the best tutorial I have ever seen

  • @stratascratch
    @stratascratch  3 года назад +3

    Timestamps:
    Intro: (0:00)
    Outlining steps & import python libraries: (2:23)
    Reading your CSV file as a pandas dataframe: (4:31)
    Cleaning your table name: (6:03)
    Cleaning your column names: (10:25)
    Generating SQL query to create database table: (12:40)
    Connecting to your database: (19:04)
    Generating SQL queries to import CSV file to database: (21:53)
    Double checking file was imported correctly: (30:50)
    Conclusion: (31:47)

  • @austincelestine1925
    @austincelestine1925 2 года назад +1

    Holy cow, I wish I came across this earlier. This helped alot to importing these csv files into mysql. The 2nd video was great to except my formatting was different for my job. Great work!!! Absolutely love your content it really helps!

    • @stratascratch
      @stratascratch  2 года назад +1

      Thank you and glad you found this vid useful!

  • @lettalkaboutit
    @lettalkaboutit Год назад +1

    Your video is fantastic, it help me, but this is where I struggle and I don't know if you can help with the following: my entire csv file needs to go in the table results, the dataframe created from the csv file have columns that contain foreign keys, I mean the csv files is a mix of different tables values, each column of my dataframe represents a table, My use case is that: 1/ first step: I have to loop through each column and compare the value in the column with the value from the corresponding table in the database, if the value does not exist then I create a SQL insert query to add this new value. After getting all new values in the databases then I do step 2
    2/ Step : I get the value from each corresponding table from the database and for each columns I have to replace the data in each column by the table ID (foreign key) , and next send all the dataframe in the database
    Can you help me achieve this please?

  • @alanamantinodasilva4362
    @alanamantinodasilva4362 2 года назад +2

    Awesome. Thank you for this content.

    • @stratascratch
      @stratascratch  2 года назад +1

      You're welcome. Glad you liked it.

  • @zynkers401
    @zynkers401 3 года назад

    These series are the best! Thanks Nate

  • @luzestrada9536
    @luzestrada9536 Год назад +1

    You are the best! 💯 Thank you so much for sharing this valuable information 👏

  • @arpitakar3384
    @arpitakar3384 Год назад +1

    I loved it man thAt"s a serious effort vro

  • @gregNFL
    @gregNFL 2 года назад

    10/10 Clear, concise.

  • @TheSwil325
    @TheSwil325 3 года назад

    I really appreciate the quality of your videos! 💯

    • @stratascratch
      @stratascratch  3 года назад

      Thanks so much! Will keep doing more!

  • @exili
    @exili 3 года назад +1

    Extremely helpful video, and a very useful tool , glad I found this on Reddit.

    • @stratascratch
      @stratascratch  3 года назад

      I'm glad you found it useful. You should also take a look at part 2 where I talk about automating the entire process (ruclips.net/video/TDwy1lSjEZo/видео.html). Part 3 will come out soon and will cover using software development techniques to polish off the code and share with your team.

  • @Davidkiania
    @Davidkiania 2 года назад

    This is really good thanks. You’ve saved me lots of manual hours.

  • @smsyrandomacts1725
    @smsyrandomacts1725 3 года назад

    Simply amazing.. knowledge transfer series. Great job Nat 👍

  • @akshayjadiya
    @akshayjadiya 3 года назад +2

    Great Video Nate! Can you make a video playlist on how to get started with cloud services or provide relevant reading material for that here in comments?

    • @stratascratch
      @stratascratch  3 года назад +4

      Absolutely! I'd start simple and try to create a database on AWS first. If you're not going to be creating apps, there's no reason to use the other services like EC2. Here's a nice overview on how to create your (medium.com/edureka/rds-aws-tutorial-for-aws-solution-architects-eec7217774dd). In reality it probably won't be so simple. But a lot of learning this stuff is trial and error. My videos always seems simple but there's actually hours and hours of troubleshooting that never make it to the video. Maybe I should add a few minutes of "outtakes" showing all the errors I get in my next video =)

  • @cristiansoto7581
    @cristiansoto7581 Год назад

    Hi Nate, thank you for your this valiable information. Could you make more videos about this?
    I'm suscribed in stratascratch an it's amazing.

  • @radhakrishnanaik5656
    @radhakrishnanaik5656 2 года назад

    Hey Nate, your way of explanation is really very good .I would like to request you to prepare a video to fetch database table in csv file in python code in the form of pipeline

  • @gambu4810
    @gambu4810 2 года назад +3

    Hi Nate. Your work is so good, I can't believe my eyes. Can you please let me know how to make a connection to a SQL management studio instead? This can really help if I knew how to do it in SQL server management studio instead of postgres.

    • @ihhdina
      @ihhdina 8 месяцев назад +1

      I also need help with importing CSV to SQL.

  • @rj_nelson_97
    @rj_nelson_97 Год назад +1

    Thanks again for the tutorial:
    I ran this code:
    # Upload to database
    SQL_STATEMENT = """
    COPY fast_food_data FROM STDIN WITH
    CSV
    HEADER
    DELIMITER AS ','
    """
    cursor.copy_expert(sql=SQL_STATEMENT, file=my_file)
    print('File copied to database')
    I received this error:
    InFailedSqlTransaction Traceback (most recent call last)
    Input In [34], in ()
    1 # Upload to database
    3 SQL_STATEMENT = """
    4 COPY fast_food_data FROM STDIN WITH
    5 CSV
    6 HEADER
    7 DELIMITER AS ','
    8 """
    ---> 10 cursor.copy_expert(sql=SQL_STATEMENT, file=my_file)
    11 print('File copied to database')
    InFailedSqlTransaction: current transaction is aborted, commands ignored until end of transaction block
    Currently stuck. Bummer. I'll keep working it.

  • @RedShipsofSpainAgain
    @RedShipsofSpainAgain 3 года назад +1

    Great video Nate. Few suggestions:
    6:58 It is more clear to use a variable name like "file_name" rather than the more ambiguous variable name "file"
    8:31 To improve readability, consider putting each .replace() call on a separate line followed by the escape character. And perhaps consider consolidating all these replace statements into a single replace() call, using, say, a regex.

    • @stratascratch
      @stratascratch  3 года назад +1

      Thanks so much for your input. I'll keep those in mind in my next python video. Keep the feedback coming. Also, if you have any requests on topics, I'd be happy to try to cover them on this channel.

    • @RedShipsofSpainAgain
      @RedShipsofSpainAgain 3 года назад +1

      @@stratascratch happy to help and offer constructive feedback! Admittedly my advice is opinionated, but I've found adapting certain details like these really help to prevent headaches for debugging and for future development of your code. Sort of like a set of "best practices". Love your videos, btw. Keep up the great work Nate.

    • @stratascratch
      @stratascratch  3 года назад +1

      @@RedShipsofSpainAgain Thanks so much for the kind words. I'm hoping to get into more python code but it takes sooo long to record compared to SQL stuff. Hope to have another python series in Jan or Feb. Your "best practices" are great advice and I'll keep these in mind as I build my solutions out.

  • @sd0n0158
    @sd0n0158 2 года назад +1

    Do you know if there is a work around for Redshift without using an S3 bucket? It doesn't allow COPY FROM STDIN.

    • @stratascratch
      @stratascratch  2 года назад

      No, I don't know about such workaround. Have you tried asking on re: Post or Stack Overflow?

  • @StefanoVerugi
    @StefanoVerugi Год назад +2

    As a beginner I find this video useful and well explained, thanks for posting it
    To remove non alphanumerical characters can't you use re.sub from re?
    like: print(re.sub("[^a-zA-Z.0-9]+","",string))
    more or less you have the same result, anything wrong with this method?
    cheers

    • @stratascratch
      @stratascratch  Год назад +2

      Thanks for the kind words. Yes, you can use re.sub just fine!

    • @cubano100pct
      @cubano100pct Год назад

      @@stratascratch There is the issue of changing the whitespaces into one _ character, then removing all other non a-z and 0-9 characters and _
      tablename = re.sub('[\x00-\x1F\s\-]+', '_', filename.lower(), 0)
      tablename = re.sub('[^0-9a-z_]*','',tablename, 0)

  • @komaldiwe5667
    @komaldiwe5667 2 года назад +1

    hi Nate, how can we do cursor.copy_expert(sql= SQL_statement, file = my_file) for MySQL ???

    • @stratascratch
      @stratascratch  2 года назад

      Have you tried MySQL Connector/Python?

  • @javierjdaza
    @javierjdaza 3 года назад

    How The fuck this video only has +1000 views? you deserve more visibility man, u re a fucking Genius and great explaining the things. Go ahead, u got a new subs

    • @stratascratch
      @stratascratch  3 года назад

      I guess not a lot of people like to see me code =) But, honestly, thanks for the kind words. I'll keep trying to code and provide thoughtful explanations to my solutions and approaches. This python series took a really long time and I had probably like 20 pages of notes that I covered across the 3 videos.

  • @prabirkumarsahoo6368
    @prabirkumarsahoo6368 Год назад

    Could you tell me from where you copied the 'AWS connection string'????

  • @sureshkumarramachandran6993
    @sureshkumarramachandran6993 3 года назад +1

    Great job!

  • @adithya5604
    @adithya5604 2 года назад +1

    Thank you so much 🎉

  • @yassw1995
    @yassw1995 2 года назад +1

    Hi Nate, this video is amazing and concise, thank you so much. However, I'm trying to learn this using as few advanced python libraries as possible for a class, so I can't use pandas or the like. I'm limiting myself to mySQL.connector and csv libraries. For this reason, MySQLCursor does not accept the "cursor.copy_expert" syntax. What would be a good alternative method to insert the contents of the csv file we saved during that step into the sql database?

  • @chyang0107
    @chyang0107 2 года назад

    Hello Nate, I appreciate your valuable work! Could you provide your insights on these two questions, First, what is 'STDIN' within the SQL Statement '"""Copy Customer_contract from STDIN With ....""" around (28:51)?
    Second, if I would like to insert value to table from Excel file, My SQL Statement would be
    """"
    Copy Customer_contract from STDIN With
    Excel
    Header
    """"
    Am I on the right track?
    Thanks,
    Hank

  • @AndyLeal-pz7xw
    @AndyLeal-pz7xw Год назад

    This is incredibly helpful. I have a question, I'm using paramiko to connect to a SFTP to read csv files from there, how would I go about setting up the cwd to the sftp server? I can only navigate to local directories currently.

    • @stratascratch
      @stratascratch  Год назад

      Thanks for the kind words! Is there a way to navigate to the directory of the stfp server? It might be a permissions issue that you can have your IT dept resolve. If they are able to allow you to access the STFP server via your script, then you're all set. Otherwise, you'll need to manually move the files.

    • @AndyLeal-pz7xw
      @AndyLeal-pz7xw Год назад

      @@stratascratch Thanks for the quick feedback! I checked permissions and everything seemed fine. I ended up switching to pysftp and it worked on the first try! keep up the great work!

  • @junaidmalik9593
    @junaidmalik9593 2 года назад

    Fantastic Video , thanks Nate, u r awesome .
    i just struggled understanding the col_str part at 17:50 of the Video , may be coz i am new to this!

  • @Jay-eh5rw
    @Jay-eh5rw 2 года назад

    Thank you very much. Great help:)

  • @CawlCawl1
    @CawlCawl1 3 года назад +2

    Amazing video - thank you! I am using MS SQL SERVER Management Studio, thus have to use PYODBC and get an error "'pyodbc.Cursor' object has no attribute 'copy_expert'" in the line of code "cursor.copy_expert(SQL_STATEMENT, file=my_file)". Is there any alternative code I can use since Im using PYODBC?

    • @stratascratch
      @stratascratch  3 года назад +1

      If you're going to use PYODBC, you'll probably need to INSERT your values row by row rather than using something like copy_expert() which takes the entire file and just does it INSERT all at once. Here's the documentation on INSERTs for PYODBC (docs.microsoft.com/en-us/sql/connect/python/pyodbc/step-3-proof-of-concept-connecting-to-sql-using-pyodbc?view=sql-server-ver15). Every database is different and requires a different python library. Hope this helps.

  • @darshpancholi4919
    @darshpancholi4919 2 года назад

    I use a macbook M1 and I am unable to import the psycopg2 library in my jupyter notebook. have tried almost all the solutions provided on youtube and StackOverflow. Can someone help me out with the same?

  • @sai6837
    @sai6837 2 года назад +1

    Hai, I am using mysql...what keyword should i use instead of copy_expert

    • @stratascratch
      @stratascratch  2 года назад

      I don't use MySQL that often in these situations, but shouldn't execute() or executemany() work?

    • @sai6837
      @sai6837 2 года назад +1

      @@stratascratch no worries I tried and it worked

  • @BelowIce
    @BelowIce 2 года назад

    Love this video. Can you share why you copy the dataframe to a csv, and then import the csv to the database? Is it possible to directly reference the values in the dataframe and use those to populate the insert statements?

    • @stratascratch
      @stratascratch  2 года назад +1

      You can definitely just upload the df to the database. I used a function that required a csv to import to a db. But there are other approaches like you mentioned that would likely work as well.

  • @sasongkosan8832
    @sasongkosan8832 3 года назад

    Very Good ,, :) Nice tutorial ...

  • @sergeitokarev6033
    @sergeitokarev6033 2 года назад

    If someone has problems with connecting to the database, just edit your VPC Security Group and allow all IPs. It helped in my case.

    • @stratascratch
      @stratascratch  2 года назад

      Wonderful suggestion!

    • @tacy91
      @tacy91 2 года назад

      Yea. Please can you help? How do I edit it? I've been trying since yesterday. Thanks

  • @CodingNinja-lt8ge
    @CodingNinja-lt8ge Год назад

    Hi Nate,
    Thanks for the wonderful explanation. I tried to create the SQL table with your data and code using jupyter notebook. The code works fine but it failed to create the table in Postgres database. I am wondering what could be the reason. In the Pgadmin I found this under SQL TAB. Would appreciate your hints to help me figure out why I am not getting the desired result although code works
    Database: aw
    -- DROP DATABASE IF EXISTS aw;
    CREATE DATABASE aw
    WITH
    OWNER = postgres
    ENCODING = 'UTF8'
    LC_COLLATE = 'English_Finland.1252'
    LC_CTYPE = 'English_Finland.1252'
    TABLESPACE = pg_default
    CONNECTION LIMIT = -1
    IS_TEMPLATE = False;

    • @stratascratch
      @stratascratch  Год назад

      Maybe you need to commit the code to the db? It's a postgres thing to do that.

  • @thelatecomer4245
    @thelatecomer4245 2 года назад

    Hey, Can you give a solution on exporting table from sql server into local as multi character delimited file?

  • @martithon
    @martithon 3 года назад

    [SOLVED] at @16.10 I'm getting a " ', ' KeyError " and have no idea how to solve it. Any tips?
    edit: misplaced the closing bracket of the join function

  • @its_me7363
    @its_me7363 3 года назад

    Thanks for these amazing helpful videos, something different from other youtubers. Can we use vscode to connect and open AWS database and write queries? I don't want to download another software for database.

    • @stratascratch
      @stratascratch  3 года назад +1

      I think you can use vscode to connect to a db but I've never used that platform before so I don't know how to. But seeing as how vscode is for developers, it's likely you can use it.

    • @its_me7363
      @its_me7363 3 года назад

      @@stratascratch I am trying to use an extension in vscode named SQLtools but it is not able to connect to AWS RDS for postgresql database i created as you showed.

    • @stratascratch
      @stratascratch  3 года назад +1

      @@its_me7363 You might need to find another platform that allows you to connect to an aws rds. Another reason could be that the database you created isn't configured right. There could be a firewall in place or ports closed. Try to see if you can connect to the rds through another way.

  • @MichelUNITED
    @MichelUNITED 3 года назад +1

    hello i ran this code:: col_str = ", ".join("{} {}" .format(n, d) for (n, d) in zip(dta.columns, dta.dtypes.replace(replacements)))
    col_str the error received was that i cannot compare types ndarray(dtype = objects) and str. please help

    • @stratascratch
      @stratascratch  3 года назад

      Check out the notebook here: github.com/Strata-Scratch/csv_to_db_automation Hopefully that helps you troubleshoot.

    • @stratascratch
      @stratascratch  3 года назад +1

      But the error sounds like you have an object dtype which is expected but you're trying to compare that to a string. It leads me to believe that your "col_str" might be wrong? Or your zip() code? Hard to say without really seeing your code in its entirety.

    • @MichelUNITED
      @MichelUNITED 3 года назад

      @@stratascratch Hello thank you for the reply. i did exactly what did. i followed your line one after the other.
      datype_str = ", ".join("{} {}" .format(n, d) for (n, d) in zip(dta.columns ,dta.dtypes.replace(replacements)))
      this is the line. sorry but i don't see any kind of comparing going on.

    • @stratascratch
      @stratascratch  3 года назад

      @@MichelUNITED I believe there might be something wrong with the dta values or column headers. You should also check your replacements array too to make sure you're getting the same output as I did. See here for info stackoverflow.com/questions/53478932/cannot-compare-types-ndarraydtype-int64-and-str

    • @MichelUNITED
      @MichelUNITED 3 года назад

      @@stratascratch can i send you my code please. help me please. can i send you my code to have a look at. i hope you see this. i am working on something related to your video.

  • @full_bladder
    @full_bladder 2 года назад

    You did not block your password. If it was an important db or a password used before for other accounts, change it.

  • @jerrywang1550
    @jerrywang1550 2 года назад +1

    connect to the AWS part is difficult

    • @stratascratch
      @stratascratch  2 года назад

      I hear you, could be tricky.

    • @tacy91
      @tacy91 2 года назад

      Yeah. I've been trying for 2 days. Is there any solution for that? The error: could not connect to server: Connection timed out.

    • @tacy91
      @tacy91 2 года назад

      @Andrew Levinton Yeah I did....I don't know what else to do. Thanks for your reply🙏

  • @rj_nelson_97
    @rj_nelson_97 Год назад +2

    Good video. I'm presently looking for something that will process a massive dataset with thousands of rows of data, then upload that data into a database. I've yet to run across a tutorial that processes and uploads a massive dataset into a database. I'm running into some interesting errors.
    1st Code Attempt: df = pd.read_csv('fast-food-data-2020.csv')
    1st Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 20: invalid continuation byte
    2nd Code Attempt: df = pd.read_csv('fast-food-data-2020.csv', encoding='UTF-8')
    2nd Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 52676: invalid continuation byte
    3rd Code Attempt: df = pd.read_csv('fast-food-data-2020.csv', encoding='latin-1')
    3rd Error: DtypeWarning: Columns (12,13,14,15,16,17,18,19,20,21,22,24,25,26,27,28,29,30,37,49,52) have mixed types. Specify dtype option on import or set low_memory=False.
    df = pd.read_csv('fast-food-data-2020.csv', encoding='latin-1')
    4th Code Attempt: df = pd.read_csv('fast-food-data-2020.csv', error_bad_lines=False, index_col=False, dtype='unicode')
    4th Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 20: invalid continuation byte
    5th Code Attempt: df = pd.read_csv('fast-food-data-2020.csv', low_memory=False)
    5th Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 20: invalid continuation byte
    6th Code Attempt: df = pd.read_csv('fast-food-data-2020.csv', engine='python')
    6th Code Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3524: invalid continuation byte

  • @limichelle6895
    @limichelle6895 3 года назад

    Hi, I have a question. How to only upload "newest csv data" in a folder? Thanks

    • @stratascratch
      @stratascratch  3 года назад

      Sorry, not sure what you mean?

    • @limichelle6895
      @limichelle6895 3 года назад

      @@stratascratch hi , thanks for getting back to me. So I’ve like a folder with a new csv data every single day. I would like to only update the newest data to MySQL. Can help me ? Thanks

    • @stratascratch
      @stratascratch  3 года назад +1

      @@limichelle6895 You can just upload it to colabs ever day. That's the manual way. Otherwise, you could automate it by specifying the name of the CSV file in your code. If there's a new csv file everyday, I would make sure that csv file has the current date at the end of the file name. Then have your code find the file name with the current date. The script will find the file and import it.

    • @limichelle6895
      @limichelle6895 3 года назад

      And my csv data come with a new date everyday. For example 280721, 290721 and so on

  • @huseyindansman7415
    @huseyindansman7415 3 года назад

    its really so slow to insert a lot big data into ms sql server db using pyodbc library, any idea you have might be really useful for me to do it faster.

    • @stratascratch
      @stratascratch  3 года назад

      I feel like it's always a slow process no matter what db engine. There are engines that allow you to multi-thread the processes but otherwise, batch processing is the only thing that comes to mind. There's usually a parameter setting for batch loading in whatever python library you're using to import your data

    • @huseyindansman7415
      @huseyindansman7415 3 года назад

      @@stratascratch thank you so much for your reply but I found out a way to get it faster using a different parameter being able to insert 16 thousand rows withing 20 or 30 seconds

  • @arpittrivedi6636
    @arpittrivedi6636 2 года назад +1

    what about excel

    • @stratascratch
      @stratascratch  2 года назад

      That can also work. But for now, we can use Python.

  • @mariusbotezatu6365
    @mariusbotezatu6365 Год назад

    Try the same thing with 1M rows in csv file :)

    • @stratascratch
      @stratascratch  Год назад +1

      It'll break! You'll need to import in batches. I think there's a parameter for batching in one of the import functions in python. =)

  • @bennguyen1313
    @bennguyen1313 7 месяцев назад

    What if the data file is not in a perfect tabular / csv format? For example, I have a large text file like this:
    #Time (HHH:MM:SS): 002:34:02
    # T(ms) BUS CMD1 CMD2 FROM SA TO SA WC TXST RXST ERROR DT00 DT01 DT02 DT03 DT04 DT05 DT06 DT07
    # ===== === ==== ==== ==== == ==== == == ==== ==== ====== ==== ==== ==== ==== ==== ==== ==== ====
    816 B0 D84E BC RT27 2 14 D800 2100 0316 0000 0000 0000 0000 CCCD 0000
    817 A0 DC50 RT27 2 BC 16 D800 2120 0000 4080 3000 0000 3000 0000 0000
    #Time (HHH:MM:SS): 002:34:03
    # T(ms) BUS CMD1 CMD2 FROM SA TO SA WC TXST RXST ERROR DT00 DT01 DT02 DT03 DT04 DT05 DT06 DT07
    # ===== === ==== ==== ==== == ==== == == ==== ==== ====== ==== ==== ==== ==== ==== ==== ==== ====
    056 B0 D84E BC RT27 2 14 D800 2100 0316 0000 0000 0000 0000 CCCD 0000
    057 A0 DC50 RT27 2 BC 16 D800 2120 0000 4080 3000 0000 3000 0000 0000
    How can get just the data from DT00 thru DT07 into an array, without doing lots of preprocessing to scrub out the repeating #Time headers that appear throughout the file?