7 Tips To Structure Your Python Data Science Projects

Поделиться
HTML-код
  • Опубликовано: 1 июл 2024
  • In this video, I’ll cover 7 tips to streamline the structure of your Python data science projects. With the right setup and thoughtful software design, you'll be able to modify and enhance your projects more efficiently.
    Check out Taipy here: github.com/avaiga/taipy.
    The Cookiecutter: github.com/drivendata/cookiec...
    👷 Join the FREE Code Diagnosis Workshop to help you review code more effectively using my 3-Factor Diagnosis Framework: www.arjancodes.com/diagnosis
    💻 ArjanCodes Blog: www.arjancodes.com/blog
    ✍🏻 Take a quiz on this topic: www.learntail.com/quiz/okaeaa
    Try Learntail for FREE ➡️ www.learntail.com/
    🎓 Courses:
    The Software Designer Mindset: www.arjancodes.com/mindset
    The Software Architect Mindset: Pre-register now! www.arjancodes.com/architect
    Next Level Python: Become a Python Expert: www.arjancodes.com/next-level...
    The 30-Day Design Challenge: www.arjancodes.com/30ddc
    🛒 GEAR & RECOMMENDED BOOKS: kit.co/arjancodes.
    👍 If you enjoyed this content, give this video a like. If you want to watch more of my upcoming videos, consider subscribing to my channel!
    Social channels:
    💬 Discord: discord.arjan.codes
    🐦Twitter: / arjancodes
    🌍LinkedIn: / arjancodes
    🕵Facebook: / arjancodes
    📱Instagram: / arjancodes
    ♪ Tiktok: / arjancodes
    👀 Code reviewers:
    - Yoriz
    - Ryan Laursen
    - Dale Hagglund
    🎥 Video edited by Mark Bacskai: / bacskaimark
    🔖 Chapters:
    0:00 Intro
    0:50 Tip #1: Use a common structure
    1:55 Tip #2: Use existing libraries
    4:59 Tip #3: Log your results
    5:55 Tip #4: Use intermediate data representations
    8:09 Tip #5: Move reusable code to a shared editable package
    9:24 Tip #6: Move configuration to a separate file
    11:45 Tip #7: Write unit tests
    14:09 Final thoughts
    #arjancodes #softwaredesign #python
    DISCLAIMER - The links in this description might be affiliate links. If you purchase a product or service through one of those links, I may receive a small commission. There is no additional charge to you. Thanks for supporting my channel so I can continue to provide you with free content each week!

Комментарии • 122

  • @ArjanCodes
    @ArjanCodes  7 месяцев назад

    👷 Join the FREE Code Diagnosis Workshop to help you review code more effectively using my 3-Factor Diagnosis Framework: www.arjancodes.com/diagnosis

    • @digiryde
      @digiryde 7 месяцев назад +2

      You talk about unit tests in several videos. And I agree completely. The problem is for most developers, unit testing is still a box of voodoo that they hope they got right.
      How would you feel about doing a series of videos (or a big one) that goes from simple unit tests to writing a unit test "package" for your hypothetical banking system? Starting with how to discover and define what needs to be tested to how to write those test, capture the output into a ToDo tool, then make that into a development board, and finally how to automate that for every build test where they need to be run (different companies define "need" differently).
      Testing before deployment is one of the most important tools that the average developer does not use as effectively as they should, if at all.
      Thank you for the great content!

    • @dylanloader3869
      @dylanloader3869 7 месяцев назад

      @@digiryde I would love an Arjan take on this process as well. If you're looking for a decent introduction to share (since it sounds like you have a good understanding yourself) I would recommend: "Coding Habits for Data Scientists" by David Tan, it's a playlist on youtube.

    • @digiryde
      @digiryde 7 месяцев назад +1

      @@dylanloader3869 "since it sounds like you have a good understanding yourself"
      When it comes to me knowing anything, the one thing I think I know is that there is nothing I don't have more to learn about. :)

  • @andrewglick6279
    @andrewglick6279 8 месяцев назад +39

    If you use notebooks, I _highly_ recommend enabling Autoreload. I find myself using notebooks / VS Code interactive sessions frequently. One of my biggest frustrations with notebooks was that if I changed a function, I would have to rerun that cell every time to update the function definition. It was also less conducive to separating my code out into submodules (which are quite convenient). It was a total game changer for me to add "%load_ext autoreload
    %autoreload 2" to my Jupyter startup commands. In a way, this workflow promotes the use putting functions in submodules because any time you call a function, it will reload that file with any changes you have made.

    • @henrivlot
      @henrivlot 7 месяцев назад

      woah, that's actually great! Never knew this was a thing.

  • @mikefochtman7164
    @mikefochtman7164 8 месяцев назад +14

    One thing that I learned long before starting my Python learning. Bugs seem to be inevitable and when one happens I ask myself, "How did this get past my unit tests?" So I go back and modify the test suite to catch the bug. Not quite 'test driven development', but really helpful with any sort of iterative-development or refactoring of a project.

  • @robertcannon3190
    @robertcannon3190 8 месяцев назад +53

    I rewrote cookiecutter and turned it into a programmable configuration language. It's called `tackle` and I am almost done with the last rewrite before I start promoting it. Does everything cookiecutter does plus a ton of other features like making modular code generators (ex. a license module you can integrate to your code generator), flow control within the prompts (ex. if you select one option then expose a subset of other options), and schema validation (not really related to cookiecutter but is the main use case of the tool). It's more relatable to jsonnette, cue, and dhall but you can easily make code generators like cookiecutter as well. Almost done with it (a 4 year long personal project that I've rewrote 5 times) and hopefully it gets traction.

    • @Krigalishnikov
      @Krigalishnikov 8 месяцев назад +1

      How do I follow for updates?

    • @robertcannon3190
      @robertcannon3190 8 месяцев назад +3

      @@Krigalishnikov When I'm done with this last rewrite I'll definitely be writing articles. Also should start doing the twitter thing at some point (not one for social media). Setup a discord channel a while back but again, haven't promoted it so it is dead. Since it is a language, it needs really good tutorials and examples and so those will be coming soon. I code generated the API docs with the tool and trying to replace tox and make with it as well so all those pieces should make for some cool examples. I also use it all across my own stack managing kubernetes templates so that will come with its own press. Any recommendations for how to promote it are welcome though.

    • @bimaadi6194
      @bimaadi6194 8 месяцев назад

      @robertcannon3190 github link?

  • @DaveGaming99
    @DaveGaming99 8 месяцев назад +3

    Love these videos! I found your channel from your Hydra config tutorial and all of your videos have been full of invaluable knowledge that I've already been using in my projects at work! Thank you and keep it up!

    • @ArjanCodes
      @ArjanCodes  7 месяцев назад

      Thank you for the kind words, Dave! Will do.

  • @williamduvall3167
    @williamduvall3167 7 месяцев назад +9

    The fastest data storage I have found with python is arrow, but I unusually use csv or json although I have used quite a few databases. Also, I have been slowly learning tip number 5 over the past 10 years or so. Once I force myself to make code that others can use, I find myself being much more proficient and can see why the top coding people generally seem to make tools that others can use in one way or another. Thanks!

  • @thomasbrothaler4963
    @thomasbrothaler4963 8 месяцев назад +8

    This was a really great video. I am a data scientist for over 2 years and was great to see that i already use developed a habit to use some points (probably because of other videos of you 😂❤) but also learned something new too!
    Could you do the same thing for data engineering?? That would be awesome!

  • @Vanessa-vz5cw
    @Vanessa-vz5cw 8 месяцев назад +3

    Great video! As a data scientist myself, I would love to see you work through an example that uses something like MLFlow. It's a very common tool at every DS-job I've worked since it's open source but also part of many managed solutions. Specifically, I'd love to see how you build an MLFlow client, how you structure experiments/runs and when/where you feel it is best in the flow of an ML pipeline to log to/load from MLFlow. Most MLFlow tutorials I've seen are notebook based, which is great for learning the basics but there isn't much guidance out there on how to structure a project that leverages it.

  • @JeremyLangdon1
    @JeremyLangdon1 8 месяцев назад +8

    I’m a big fan of validating data I’m bringing in via Pandera. It is like defining a contract and if the data coming in breaks that contract (data types, column names, nullability, validation checks, etc.) I want to know about it BEFORE I start processing it. I also use Pandera typing heavily to define my function arguments and return types to make it clear that the data going in and out of my functions validate to a schema which is way better than the generic “pd.DataFrame” e.g.
    def myfunc(df: DataFrame[my_input_schema] -> DataFrame[my_output_schema]:

  • @sergioquijanorey7426
    @sergioquijanorey7426 7 месяцев назад +16

    Biggest tip is to combine python scripts with Notebooks. Notebooks allow for fast and visual exploration of the data / problem. Then move pieces of code to a ./lib folder and use it from them. And start adding test. Most of the time you are performing the same operations of the data so you can end up with a shared lib across multiple projects. And that is very handy when starting a new project.

  • @philscosta
    @philscosta 7 месяцев назад

    As always, thank you for the great video!
    I'd love to see more DS content, as well as some content on mlflow and dvc.

  • @haldanesghost
    @haldanesghost 8 месяцев назад +3

    Official request for a full Taipy video 🙌

  • @FlisB
    @FlisB 8 месяцев назад +2

    Great video. I feel data-science projects are rarely examined in terms of design/structure quality. I hope to see more videos about it in the future. Perhaps on writing tests, I sometimes lack ideas about how to test data-science code.

  • @obsoletepowercorrupts
    @obsoletepowercorrupts 8 месяцев назад +1

    Some good points in the video to get people thinking about different aspects. Scalability leads to a temptation for some to use multiple APIs and thereby an API management tool which in turn costs time and increases the probability of a ML library being used like a sledgehammer to crack a nut _(especially if time lack inclines a planner to be avoidant of dependency tree challenges)._ No matter the scalability of a software system _(whilst not seeing it as something to be seen as "regardless" of scale)_ databasing to keep track can become a bottleneck, and so retries and dead letter queues are worth it. Your mention of workflows is wise and jumping from one database to another _(e.g. MySQL to Postgres)_ is very likely to incur thoughts spent on workflows for that very task. You can optimise all you like _(which is noble)_ but these days people are more incentivised to "build-things" and so somebody might pip that "optimiser person" to the post by throwing computer horsepower at the challenge, thereby forcing something to be big rather than scalable in ways other than unidirectionally. Logs tend to mean hash tables. There are advantages to storing in a database like the choices available for DHT versus round robin. Environmental variables can be for ENV metadata to set up a virtual machine. If you own that server, like you suggest, it's an extra thing to secure _(for example against IGMP snooping)._ Containers and sandboxes are an extra layer of security rather than a replacement for security. Multiple BSD jails for example can be managed with ansible for instance.
    My comment has no hate in it and I do no harm. I am not appalled or afraid, boasting or envying or complaining... Just saying. Psalms23: Giving thanks and praise to the Lord and peace and love. Also, I'd say Matthew6.

  • @rafaelagd0
    @rafaelagd0 7 месяцев назад

    Amazing video! I am very happy to notice that these are the bits of advice I have been pushing around to my work environments. I hope these things become the norm soon.

    • @ArjanCodes
      @ArjanCodes  7 месяцев назад

      Glad you enjoyed the content, Rafael! I hope so too :)

  • @chazzwhu
    @chazzwhu 7 месяцев назад +2

    Great vid! Would love to see your approach to doing a data project eg download data, use airflow to process, train model and host via an API endpoint

  • @TheSwissGabber
    @TheSwissGabber 8 месяцев назад +2

    # 6 is such a great tip. Here's how I do it:
    - YAML file with the source code for defaults
    - (maybe user /machine dependant YAML file for some test system)
    - a YAML file with the data to override any values in evaluation
    YAML is great because it's human readable and has comments. So you can tell the user "put a config.yaml with the keys X, Y and Z to the data and off you go".

  • @KA3AHOBA94
    @KA3AHOBA94 8 месяцев назад

    Thank you for the good videos. Will there be any examples of design patterns using a GUI as an example?

  • @ChrisTian-uw9tq
    @ChrisTian-uw9tq Месяц назад

    I have had a semi-emoitional experience listening to this :D I have been asbolutely solitary on a project this last 8 months. Deciding at the start, that instead of dealing with it with my old knowledge, with Excel and SQL I would tackle it while learning python along the way, with GPT.
    To hear these tips, to realize I just through trial and error or logic, got to these tips! I have not done so bad, is the feeling but also lots to learn because this is just the base. And I for sure got a foot in the door even if just slightly.
    Tip 1 - Common structure - totally tried to apply a common structure, but then some external stresses and pressures, I crumbled and it was visible in my code thereafter :/ Next time - this is a must.
    Tip 2 - Existing libraries - learning with GPT, can be limiting - specifically asking GPT for alternative libraries for the same solution, asking pros and cons etc, helps you immediately address as many issues with fewer packages or the perfect package directly. GPT didn't point that out as much as it could, so its on you. But learning in more depth, now, after idenitfying which packages I use all over, would aid me to understand under the hood.
    i.e. Pipeline tool - when I started, I wanted something like this. Head dev guy in the office said nothing like it exists. So made a non gui module of functions for handling data load and export for csv, xlsx, sql - to see Taipy... uff, can't wait.
    Tip 3 - Log Results - as in excel days, copy after copy, different folders, filenames, - to track every transformation to exactly as you said, back track for verification - plus makes it easier to answer external questions as to how and why.
    Tip 4 - that was just forced as along the way, given different data sources, data formats - I just refused to hard code sources, so dedicating just to data load phase and how outputs anywhere in the whole project will be presented, stored, displayed - whether in memory or sql or csv an logging where and when everything is stored for easier recalling downstream... was like a matrix
    Tip 5 - exactly, jupyter notebook once you have done with code - pop them into spyder, and call them whether as functions import or as I learned last week, i can run .py files and have results loaded into jupyter notebook - but I don't know what else there can be done - its sandbox place, but py files is the aim.
    Tip 6 - exactly again, it just had to happen - anything which you start seeing something repeat, think about the context, the occurrence, variations - pop them in a centralized place for easier management - like in excel, lookup tables :) Just trial and error - but to think of it in advance, is game changer for such bigger projects...
    Tip 7 - No idea - biggest wow moment for me this one so far - because exactly now I am asked to hand over the code I wrote and I am certain I put a little of tests break points in the code with comments for what change I did as to why to have the break point condition where it can be reverted to a different approach - didn't want to delete any of my test points but had a feeling it would look terrible handing that over... and I am sure the way you talk about it, is much neater than what I have formulated so if there is 'official approach' I would gladly revamp my approach because i literally just used user changeable boolean variables such as sample_mode =True to trigger df.head(1000) throughout the script for example...
    Next week I find out which team they move me to, as I just came in to clean data and present results and instead I now have code to take any addresses, validate them and proceed to secondary data point evaluations to consolidate data across foreign sources, to spot what is missing and shouldn't be but also validation on more granular data in this case ownership and tax rate. Its the most intense bit of work I have done alone - every single step needed something created and addressed - no normalization anywhere consistent in a single field of data point, crucial or otherwise. The reality even in the end, whatever you do, its only as good as the data you get, but the stories and scenarios of data which can be told after, if more people were interested in that and understanding of the influence of data diligence, its like a civilization booster in awareness :D

  • @luukvanbeusichem7652
    @luukvanbeusichem7652 7 месяцев назад

    Love these Data Science videos

  • @loic1665
    @loic1665 8 месяцев назад

    Great video, and very great advice! I'll be giving a course about software development next semester and I think some of the points you talked about are worth mentioning !

    • @ArjanCodes
      @ArjanCodes  7 месяцев назад

      I'm glad it was helpful! Good luck on your course next semester :)

  • @askii3
    @askii3 6 месяцев назад

    One thing I’ve began doing is using a flatter directory hierarchy and using SQLlite to catalog file paths along with useful, project specific metadata. This way I write a SQL query to pre-filter data by only fetching relevant Parquet file paths to pass into Dask for reading and analyzing.

  • @NostraDavid2
    @NostraDavid2 7 месяцев назад +3

    A .py file can be a notebook too! Just add "# %%" to create a cell - vscode will detect it automatically!
    Downside: the output isn't saved like you can do with a regular notebook
    Upside: code is easier to test, no visual fluff, etc.

    • @askii3
      @askii3 6 месяцев назад

      I love these interactive sessions! Another plus of these is auto-reloading of imported functions that I just edited without re-running the whole script. Just add this at the top to enable reloading edited functions:
      %load_ext autoreload
      %autoreload 2

  • @spicybaguette7706
    @spicybaguette7706 7 месяцев назад +1

    As an intermediate data storage format, I typically use duckdb because I'm quite comfortable with SQL, and duckdb allows me to query large sets of data very quickly

  • @d_b_
    @d_b_ 8 месяцев назад +9

    In defense of notebooks, the exploratory aspect that it allows makes it really nice and quick to find problems within the data. You can look deeper at the objects or dataframes where it halts. If there is a way to combine this ability with a well structured set of scripts, it would be fantastic

    • @joewyndham9393
      @joewyndham9393 8 месяцев назад +4

      Have you used an IDE with a good debugger, like Pycharm? You can set breakpoints to interrogate data, you can evaluate expressions etc

    • @harry8175ritchie
      @harry8175ritchie 7 месяцев назад

      This isn't a bad idea, but the notebooks are great for exploratory data analysis for a few reasons: it combines markdown + python allowing you to explain analyses in detail, seeing various plots on the fly, separating analyses by cells. However, notebooks are a great environment to get messy quicky. Sharing notebooks is essential during my career, specifically for research / exploratory projects that require explaining your analyses and thinking along the way.
      @@joewyndham9393

    • @rfmiotto
      @rfmiotto 7 месяцев назад +2

      I used to be a notebook user myself, but over time, I adopted the pure Python script because you can structure your code better using principles of clean code and clean architecture (plus the benefits of having a linter and a formatter) . You made a valid point, though, which is the ability to inspect the object values with easy in notebooks. I overcome this with the VSCode debugger. Having a well structured code makes it easy to inspect variables in their particular scope. And also, I believe that, probably, there might be some VSCode extension that helps displaying the variables in a more friendly way...

    • @FortunOfficial
      @FortunOfficial 7 месяцев назад +1

      @@rfmiottoi had a similar path as you. I tried notebooks for a while and loved the quick analysis. Especially with PySpark it's handy since you don't always have to start the runtime context again, which takes a couple seconds. BUT somehow Notebooks nudge me to bad practices such as putting all operations into the global scope instead of using functions. VSCode debugger is pretty decent, it is a good replacement for the lost interactivity

    • @isodoubIet
      @isodoubIet 7 месяцев назад

      What works best for me is to write a script and then run that script in a repl like ipython using %run. That way I still get to do things interactively, persist data sets in memory etc but still not deal with any of the annoyances of notebooks.

  • @joshuacantin514
    @joshuacantin514 8 месяцев назад

    Regarding Jupyter notebooks, a lot of things still work when opening the notebook in VS code, such as code formatters. You just may need to trigger it specifically in each cell (Alt+Shift+f for the black formatter, I believe).

  • @Soltaiyou
    @Soltaiyou 8 месяцев назад

    Great content as always. I’ve used csv, json, pickle, parquet, and sql files. I would argue there is no “standard” data science project. Once you get past boilerplate stats, you’ll inevitably have to write ad hoc functions to match the idiosyncrasies of your data either for deeper analysis or visualizations.

  • @ivansavchuk7956
    @ivansavchuk7956 7 месяцев назад

    Thanks sir, great video! Can you recommend a book on software design?

  • @Andrumen01
    @Andrumen01 8 месяцев назад +2

    I started doing Test Driven Development and it has saved me from more than one huge headache! Good advice!

  • @ErikS-
    @ErikS- 8 месяцев назад +2

    Arjan can now fill the dutch city of Tilburg with his 200k subs! Impressive since he only passed the city of Breda (150k) a couple of months ago!
    Congratulations Arjan!

    • @ErikS-
      @ErikS- 8 месяцев назад

      And next up will of course be Eindhoven!
      The dutch city where Royal Philips was founded and now having around 250k inhabitants!

    • @ArjanCodes
      @ArjanCodes  8 месяцев назад +1

      Thanks Erik. It’s nuts when you think about what those numbers actually mean!

  • @ArpadHorvathSzfvar
    @ArpadHorvathSzfvar 7 месяцев назад

    I use CSV many times 'cause it's simple and compact! I've also used parquet when I wanted to be sure about the data types when I've loaded back the data.

  • @TomatePerita
    @TomatePerita 7 месяцев назад +3

    It's a shame the pipeline suggestion came from a sponsor, as it would be very interesting to see you compare different tools like snakemake and nextflow. It's a very niche field and choosing a tools is dificult since you have to comit a lot.
    Great video though, would love to see that pipeline video eventually.

  • @teprox7690
    @teprox7690 8 месяцев назад +1

    Thanks for the content. Quick feedback on the changing camera angles, it may look nice but it disturbs the flow of thought. Thank you very much.

    • @Michallote
      @Michallote 8 месяцев назад

      I agree, perhaps it's simply not well executed

  • @abdelghafourfid8216
    @abdelghafourfid8216 7 месяцев назад

    for caching and storing inermediate results, the fastest formats I've tried are msgspec for json‐like data and feather for table‐like data

  • @MicheleHjorleifsson
    @MicheleHjorleifsson 7 месяцев назад

    BTW jupyter notebooks in visual code with git are nice as you get simplicity of Notebook, performance data and versioning

  • @drewmailman1965
    @drewmailman1965 8 месяцев назад

    For Tip 5, nbdev from fastai is a great package for exporting cells from a Jupyter Notebook to a script. From my notes, ymmv:
    At top:
    #| default_exp folder_name_if_desired.file_name
    Per cell exported:
    #| export
    To export, add below to same cell:
    import nbdev
    # For current directory path:
    nbdev.export.nb_export("Notebook Name.ipynb", "./")

  • @shouldb.studying4670
    @shouldb.studying4670 8 месяцев назад

    That fight to the death line caught me mid sip and now I need to clean my monitor 😂

  • @riptorforever2
    @riptorforever2 8 месяцев назад

    6:50 to query a 'json file' or a collection of 'json files', there is lib tinydb

  • @walterppk1989
    @walterppk1989 8 месяцев назад

    for config, I like to have a centralised config.py with a Config class. It'll have attributes that try to get env vars (environ.get('myvar', 'some_sensible_default'), with a default.

  • @viniciusqueiroz2713
    @viniciusqueiroz2713 8 месяцев назад

    I constantly use the Parquet data format. It makes loading data WAY faster. In Python, it works just as CSV (e.g., using Pandas, instead of using read_csv() you use read_parquet()). It is bundled with an intelligent way of compressing repeated values, so it has a way smaller HD memory footprint when compared to CSV and JSON. It stores data in a columnar fashion, so that if you only need some columns for a project, and other columns for another, you can avoid retrieving unwanted columns into memory. It also works well with Big Data environments (such as Apache Spark).
    Having a smaller HD memory footprint means you can transfer to other people easily as well. And store it in cloud solutions with a lower cost.
    And honestly, as a Data Scientist, you kind of never would open the CSV or JSON file and check it yourself. 99% of the time we use a library like Pandas or a software like Tableau to visualize and work with the data. So being human-readable is not really an advantage for data scientists, as it is for backend and frontend developers.

  • @philscosta
    @philscosta 7 месяцев назад

    It would be great to hear some ideas on how to write good tests for data related projects. Lately I've been using syrupy to write regression/snapshot tests to at least assure that the results don't change when I do a refactor. However this is not very robust. A challenge with all that is creating and managing good test data.

  • @Will29295
    @Will29295 16 дней назад

    Good overview. Would really appreciate some physical examples.

  • @MicheleHjorleifsson
    @MicheleHjorleifsson 7 месяцев назад +3

    Data Formats: Parquet and Pickle because they are lightweight and easily adapted to pipelines.
    Requesst: Would love to see a taipy project video :)

  • @murilopalomosebilla2999
    @murilopalomosebilla2999 8 месяцев назад

    Great content!!

    • @ArjanCodes
      @ArjanCodes  7 месяцев назад

      Thank you so much, happy you're enjoying the content!

  • @ringpolitiet
    @ringpolitiet 8 месяцев назад

    Polars scales great. Read the CSV and query, lazily if needed. Parquet for intermediate file system storage, polars.write_database if needed. "If you have to ask, polars is enough".

  • @DeltaXML_Ltd
    @DeltaXML_Ltd 7 месяцев назад

    Great video 😁

    • @ArjanCodes
      @ArjanCodes  7 месяцев назад

      Thank you so much!

  • @slavrine
    @slavrine 7 месяцев назад

    can you go over using unit tests in data science projects?
    My team does not actively use them. We don't find a use when we add new features, change processing code very quickly

  • @buchi8449
    @buchi8449 8 месяцев назад +1

    Useful list of tips but I have additional tips we can derive by combining these tips.
    tip 1 + tip 6: Use a common way for externalizing configurations
    If each project externalizes configurations differently, for example, one uses a YAML file and another uses a .env file, it will be a nightmare for other people, particularly for engineers working on the deployment and the operation.

    • @buchi8449
      @buchi8449 8 месяцев назад

      tip 4 + tip 7: Implement data science logic as a pure function
      In other words, don't persist intermediate data in the same code where a data science logic is implemented. We can say the same thing for the reading of input data.
      Implement a DS logic as a pure function taking a pandas DF and other parameters as input, and returning a pandas DF as output, for example. File I/O should be done in a different code calling this function.
      This separation of data science logic and file I/O makes unit tests of data science code easier.

    • @lamhintai
      @lamhintai 8 месяцев назад

      I seem to run into this issue with some advice suggesting .env storing DB credentials to avoid leakage via version control. I prefer YAML for other things (non credentials/secrets) more due to its structure. And especially reading in dotenv seems to be a bit messy when using type hints (can return None type theoretically - warnings all over the place in the whole downstream).
      But this means I’m using both formats and not centralized config…

  • @blooberrys
    @blooberrys 7 месяцев назад

    Can you do an in depth logging video? :)

  • @isodoubIet
    @isodoubIet 7 месяцев назад

    The general point that you can learn from libraries is of course very good, and I've used sklearn as a design guide many times. That said, pandas specifically is probably a bad example for that purpose since a lot of its design is weird/messed up. For example, the default settings for reading and loading dataframes from csvs are all wrong -- it should be the case that if you write a dataframe and then read it, you should get the same thing -- saving stuff to disk should round-trip, right? Well, with pandas it doesn't, both because of indexing issues, and because it rounds stuff by default. Lots of little corner things in pandas like that, so IMHO while it's a powerful package that I probably couldn't live without, it's one best used _after_ you've acquired the relevant domain knowledge, not as a teaching tool for it.

  • @user-ml5em9eo2e
    @user-ml5em9eo2e 5 месяцев назад

    I've been working on a project where files contaone raw numpy files and pandas df. I was really struggling to save the data to simple files as pandas is slow and I didn't know what to do with acquisition parameters. I first made my own serialised and deserialiser to store the data in huge JSON (using orjson for performance) but now I stumbled upon pydantic. It's relatively easy to implement compatibility for numpy and pandas and now it's still JSON files but the classes object are very compact and using ABC it's easy to create inheritance and apply these traits. But now I'm looking at alternative to make these files compatible outside of python using either or both SQL and hdf5. I'm quite supprise that this is not a solved problem. I found xarray that could use be it needs complete rewrite. So' yeah maybe another day

  • @alivecoding4995
    @alivecoding4995 7 месяцев назад

    Is taipy running completely locally, or using their web services?

  • @askii3
    @askii3 6 месяцев назад

    I love using Parquet as an intermediate format then use Dask to read them and do lazy processing as much as possible until the pipeline is forced out of Dask.

  • @pj18739
    @pj18739 4 месяца назад

    When would I rather use Taipy than Dagster?

  • @suvidani
    @suvidani 8 месяцев назад +1

    Use schemas to describe and validate your data.

  • @s.yasserahmadi7846
    @s.yasserahmadi7846 7 месяцев назад

    Which video should i watch first?! there's no ordering in this playlist, it's confusing

  • @MicheleHjorleifsson
    @MicheleHjorleifsson 7 месяцев назад

    Different approach to config variables, use a MyConfig class and store in a pickle file. this way the file isnt clear text when stored

  • @MrPennywise1540
    @MrPennywise1540 7 месяцев назад

    I have a Python code that use Tkinter to make a GUI. This code edit images. Now, I'm developing my webpage with Django, and I want to run the GUI code in my page. I hoped I could sove it with PYSCRIPT, but it's not compatible. I'm in a dead end. Can someone give me an advice?

  • @barmalini
    @barmalini 7 месяцев назад

    It might be a silly question in this context, but does anyone know of a similar quality channel with a focus on Java?
    Arjan is such a great educator that I am genuinely considering switching to Python, but it's hard because I must learn Java too.

  • @joshuacantin514
    @joshuacantin514 8 месяцев назад

    HDF5 seems to be a rather useful data format, for both metadata and data.

  • @ali-om4uv
    @ali-om4uv 8 месяцев назад

    Redo the video and asd data version control dvc. That is a must once you work in an organisation. It has rudimentary pipelining for modeltraining as well. Everybody should know mlflow. And.... avoid tools like kubeflow if you do not have sufficient manpower to run it.

  • @abomayeeniatorudabo8203
    @abomayeeniatorudabo8203 8 месяцев назад

    Notebooks are great for experiments.

  • @joewyndham9393
    @joewyndham9393 8 месяцев назад +2

    Can someone outline for me what benefits notebooks have over IDE development? I've recently switched from doing data science with an IDE in a typical software dev environment to using Databricks notebooks (due to a job change). I honestly can't see any benefit, but I can see a lot of drawbacks. In an IDE like Pycharm I can rapidly create experiments, I can visualise data AND I can write clean safe software. Notebooks put so many obstacles in the way of good development. What am I missing?

    • @machoo55
      @machoo55 8 месяцев назад

      in an IDE, isn't it slower when one has multiple longish steps in a pipeline and have to rerun everything each time as one iterates? I'd be keen to learn how do you get around that?

    • @sukawia
      @sukawia 8 месяцев назад +1

      @@machoo55 Tip #4 can get you pretty far in many cases. You can even set it up to work like cache (create a decorator that saves the output of the function in a file and next time it directly loads from it, then you can put the decorator on your preprocess, load_data etc functions)

    • @joewyndham9393
      @joewyndham9393 8 месяцев назад +3

      @@machoo55 Yeah that's a reasonable concern, but I think it's overcome by good coding practices like abstraction, separation of concerns, and good personal process.
      Let me flip the question and ask why you need to run steps A, B, C and D in a pipeline to be able to write step E? I'm assuming your answer might be that you need to know what the data looks like at step E. Then my question would be, why don't you know what is coming out of steps A, B, C and D?
      What I'm getting at here is that well structured, clean code makes it easy to understand what goes in and what comes out of functions and classes. So you can write huge amounts of good code without actually pushing any real data through it.
      I also want to ask you about what you are doing when you are "iterating"? Are you debugging? Are you trying new things in your model? Or are you doing both of these at once? If you are trying to do both at once, then I can see why you like notebooks. They really encourage you to bounce around your code slipping changes in here and there.
      And this is one issue I have with notebooks and the way lots of data scientists work - they don't separate the different tasks they are doing. If I'm writing code, that's all I'm doing. If I'm debugging I'm only debugging. And if I'm extending or modifying a model, it's only after finishing the first version.
      This point gets me to the debugger. In Pycharm I can use the debugger to pause at any moment in the pipeline, to see all the variables, to evaluate expressions, etc. So the functionality you actually want is there. But, it is only there at the right time - when you are debugging. And it encourages you to write cleanly, because debugging is waaaay easier in tightly written functions and classes with limited namespaces

    • @machoo55
      @machoo55 8 месяцев назад

      Thanks for the suggestions! Often times early in projects (I work in a domain that isn't well established) I do need see what the data is like to decide on the choice and order of steps. After that, I refactor everything into a proper class based pipeline. But the helpful things suggested here have given ideas as to how I might start with and stay in an IDE.

    • @joewyndham9393
      @joewyndham9393 7 месяцев назад

      @@machoo55 I agree it is always important to do an ad hoc scan of your data, and for that step, you're not necessarily writing code that will live on in your codebase, so you can relax the rules of clean code a fair bit. But in my opinion you can do that in a good IDE which allows inspection of variables and interactive plots. You also get all of the other super productive tools of the IDE. Happy coding!

  • @walterppk1989
    @walterppk1989 8 месяцев назад +3

    my tip for teams that run many ML pipelines in particular: don't make many projects based on a single cookiecutter. Instead, nest all of your pipelines in a monolithic repo which is ONLY in charge of writing ML code, and separate the projects by subfolders. That way, you don't have to maintain many different CI/CD pipelines and docker imgs (they tend to be large, but that's the dependencies, not the application code).

    • @ali-om4uv
      @ali-om4uv 8 месяцев назад +2

      Thats good and horrible advice at the same time

  • @hubstrangers3450
    @hubstrangers3450 8 месяцев назад

    Thank you, could please return to LLMs for short series with MemGTP, OS and Function calls (YT, v=rxjsbUiuOFo, robot to robot interaction), if time permits, could be able to come up with demo and thought process, how futuristic is the scenario, and will it be a cost effect consideration, on prem cloud platform..... Thank you

  • @truniolado
    @truniolado 3 месяца назад

    hey man, where tf you got that amacin tshirt?

  • @knolljo
    @knolljo 8 месяцев назад +2

    A bear t-shirt and no mention of polars?

  • @scottmiller2591
    @scottmiller2591 7 месяцев назад

    Taipy seems to have abandoned the pipeline as a user concept - it no longer appears in the docs. I assume it's still in the mechanism, but no longer explicitly exposed. Rather, the emphasis seems to be on building GUIs with data nodes. My experience with graphical programming like this has been that they are extremely difficult to review, as one has to unfold a lot of nodes to actually get to the code - maybe they've gotten around this somehow, or maybe they want you to assume their code is foolproof, a bad sign.

  • @user-cr3ti1vj6f
    @user-cr3ti1vj6f 7 месяцев назад

    Aryan codes? Based.

  • @TheSwissGabber
    @TheSwissGabber 8 месяцев назад

    pandas.. everytime i come back to an old (6M+) project it does not work because they changed the API. Never happened with any other library (numpy, matplotlib, scipi etc.) So I would only use pandas if it REALLY benefits you. otherwhise you'll have a guaranteed refactor in 6 months...

    • @suvidani
      @suvidani 8 месяцев назад +4

      Each project should have its own defined environment, this should not be a problem.

    • @Jugular1st
      @Jugular1st 8 месяцев назад

      ... And if your abstraction is good it should only impact a small part of your code.

    • @isodoubIet
      @isodoubIet 7 месяцев назад

      What's amazing is that despite all the breaking changes, pandas still has a bad api with wrong defaults. Very useful, but not a library anyone should emulate.

  • @alexloftus8892
    @alexloftus8892 2 месяца назад

    As a professional data scientist, I disagree with a lot of this advice. With exploratory analysis, you are often writing one-off notebooks that nobody will read or reuse. Spending the extra time to write tests in this situation wasted effort. A good middle ground is including `assert` statements in your functions to make sure they're doing what you think they're doing.
    Pull code you're going to reuse out of your notebooks, and then write tests for it then.

  • @abomayeeniatorudabo8203
    @abomayeeniatorudabo8203 8 месяцев назад

    You are wearing a pandas shirt.

  • @juancarlospizarromendez3954
    @juancarlospizarromendez3954 8 месяцев назад

    As always, start from scratch. It's from zero, nothing, etc.

  • @lukekurlandski7653
    @lukekurlandski7653 8 месяцев назад +47

    Tip Number 0: Don't use Notebooks

    • @slayerdrum
      @slayerdrum 7 месяцев назад +9

      I think they are fine as long as you use them for what they are most suitable for: Exploratory analysis with text. Not for creating production-ready code (which often is not the way a project starts anyway).

    • @sergeys.8830
      @sergeys.8830 5 месяцев назад +1

      Why?

    • @dinoscheidt
      @dinoscheidt 2 месяца назад

      Exactly.

    • @EdwinCarrenoAI
      @EdwinCarrenoAI 24 дня назад

      They are really useful for Proof of Concepts, Exploratory analysis, or basically to test an idea. But, they are not a good idea for deployments and production code.

    • @Michallote
      @Michallote 8 дней назад

      ​@@sergeys.8830they are completely unreusable and inefficient. Each run of a cell changes the metadata of a notebook. This is terrible for versioning systems. Output of cells make the file stupidly large and functions defined in notebooks are not usable anywhere else in the codebase.

  • @prison9865
    @prison9865 8 месяцев назад

    By the time you said what i can do with tipy, i already was not interested. Perhaps tell people what tipy can do for you and then how to install it and shit...

  • @MarkTrombonee
    @MarkTrombonee 2 месяца назад

    Wow! Top of the comments sound like AI generated. Username like: name+numbers, and long text comment that noone asked for.

  • @ardenthebibliophile
    @ardenthebibliophile 7 месяцев назад

    For point 5 I've started to get our teams to think: reusable code/functions go in .py files. Analyses go in .ipynb. they love Jupyter notebooks and this helps it be more readable. Bonus: the functions if used a lot can be packaged easier