7 Tips To Structure Your Python Data Science Projects

Поделиться
HTML-код
  • Опубликовано: 27 ноя 2024

Комментарии • 127

  • @ArjanCodes
    @ArjanCodes  Год назад +1

    👷 Join the FREE Code Diagnosis Workshop to help you review code more effectively using my 3-Factor Diagnosis Framework: www.arjancodes.com/diagnosis

    • @digiryde
      @digiryde Год назад +2

      You talk about unit tests in several videos. And I agree completely. The problem is for most developers, unit testing is still a box of voodoo that they hope they got right.
      How would you feel about doing a series of videos (or a big one) that goes from simple unit tests to writing a unit test "package" for your hypothetical banking system? Starting with how to discover and define what needs to be tested to how to write those test, capture the output into a ToDo tool, then make that into a development board, and finally how to automate that for every build test where they need to be run (different companies define "need" differently).
      Testing before deployment is one of the most important tools that the average developer does not use as effectively as they should, if at all.
      Thank you for the great content!

    • @dylanloader3869
      @dylanloader3869 Год назад

      @@digiryde I would love an Arjan take on this process as well. If you're looking for a decent introduction to share (since it sounds like you have a good understanding yourself) I would recommend: "Coding Habits for Data Scientists" by David Tan, it's a playlist on youtube.

    • @digiryde
      @digiryde Год назад +1

      @@dylanloader3869 "since it sounds like you have a good understanding yourself"
      When it comes to me knowing anything, the one thing I think I know is that there is nothing I don't have more to learn about. :)

  • @andrewglick6279
    @andrewglick6279 Год назад +46

    If you use notebooks, I _highly_ recommend enabling Autoreload. I find myself using notebooks / VS Code interactive sessions frequently. One of my biggest frustrations with notebooks was that if I changed a function, I would have to rerun that cell every time to update the function definition. It was also less conducive to separating my code out into submodules (which are quite convenient). It was a total game changer for me to add "%load_ext autoreload
    %autoreload 2" to my Jupyter startup commands. In a way, this workflow promotes the use putting functions in submodules because any time you call a function, it will reload that file with any changes you have made.

    • @henrivlot
      @henrivlot Год назад

      woah, that's actually great! Never knew this was a thing.

  • @mikefochtman7164
    @mikefochtman7164 Год назад +17

    One thing that I learned long before starting my Python learning. Bugs seem to be inevitable and when one happens I ask myself, "How did this get past my unit tests?" So I go back and modify the test suite to catch the bug. Not quite 'test driven development', but really helpful with any sort of iterative-development or refactoring of a project.

  • @robertcannon3190
    @robertcannon3190 Год назад +56

    I rewrote cookiecutter and turned it into a programmable configuration language. It's called `tackle` and I am almost done with the last rewrite before I start promoting it. Does everything cookiecutter does plus a ton of other features like making modular code generators (ex. a license module you can integrate to your code generator), flow control within the prompts (ex. if you select one option then expose a subset of other options), and schema validation (not really related to cookiecutter but is the main use case of the tool). It's more relatable to jsonnette, cue, and dhall but you can easily make code generators like cookiecutter as well. Almost done with it (a 4 year long personal project that I've rewrote 5 times) and hopefully it gets traction.

    • @Krigalishnikov
      @Krigalishnikov Год назад +1

      How do I follow for updates?

    • @robertcannon3190
      @robertcannon3190 Год назад +3

      @@Krigalishnikov When I'm done with this last rewrite I'll definitely be writing articles. Also should start doing the twitter thing at some point (not one for social media). Setup a discord channel a while back but again, haven't promoted it so it is dead. Since it is a language, it needs really good tutorials and examples and so those will be coming soon. I code generated the API docs with the tool and trying to replace tox and make with it as well so all those pieces should make for some cool examples. I also use it all across my own stack managing kubernetes templates so that will come with its own press. Any recommendations for how to promote it are welcome though.

    • @bimaadi6194
      @bimaadi6194 Год назад

      @robertcannon3190 github link?

  • @ChrisTian-uw9tq
    @ChrisTian-uw9tq 6 месяцев назад +1

    I have had a semi-emoitional experience listening to this :D I have been asbolutely solitary on a project this last 8 months. Deciding at the start, that instead of dealing with it with my old knowledge, with Excel and SQL I would tackle it while learning python along the way, with GPT.
    To hear these tips, to realize I just through trial and error or logic, got to these tips! I have not done so bad, is the feeling but also lots to learn because this is just the base. And I for sure got a foot in the door even if just slightly.
    Tip 1 - Common structure - totally tried to apply a common structure, but then some external stresses and pressures, I crumbled and it was visible in my code thereafter :/ Next time - this is a must.
    Tip 2 - Existing libraries - learning with GPT, can be limiting - specifically asking GPT for alternative libraries for the same solution, asking pros and cons etc, helps you immediately address as many issues with fewer packages or the perfect package directly. GPT didn't point that out as much as it could, so its on you. But learning in more depth, now, after idenitfying which packages I use all over, would aid me to understand under the hood.
    i.e. Pipeline tool - when I started, I wanted something like this. Head dev guy in the office said nothing like it exists. So made a non gui module of functions for handling data load and export for csv, xlsx, sql - to see Taipy... uff, can't wait.
    Tip 3 - Log Results - as in excel days, copy after copy, different folders, filenames, - to track every transformation to exactly as you said, back track for verification - plus makes it easier to answer external questions as to how and why.
    Tip 4 - that was just forced as along the way, given different data sources, data formats - I just refused to hard code sources, so dedicating just to data load phase and how outputs anywhere in the whole project will be presented, stored, displayed - whether in memory or sql or csv an logging where and when everything is stored for easier recalling downstream... was like a matrix
    Tip 5 - exactly, jupyter notebook once you have done with code - pop them into spyder, and call them whether as functions import or as I learned last week, i can run .py files and have results loaded into jupyter notebook - but I don't know what else there can be done - its sandbox place, but py files is the aim.
    Tip 6 - exactly again, it just had to happen - anything which you start seeing something repeat, think about the context, the occurrence, variations - pop them in a centralized place for easier management - like in excel, lookup tables :) Just trial and error - but to think of it in advance, is game changer for such bigger projects...
    Tip 7 - No idea - biggest wow moment for me this one so far - because exactly now I am asked to hand over the code I wrote and I am certain I put a little of tests break points in the code with comments for what change I did as to why to have the break point condition where it can be reverted to a different approach - didn't want to delete any of my test points but had a feeling it would look terrible handing that over... and I am sure the way you talk about it, is much neater than what I have formulated so if there is 'official approach' I would gladly revamp my approach because i literally just used user changeable boolean variables such as sample_mode =True to trigger df.head(1000) throughout the script for example...
    Next week I find out which team they move me to, as I just came in to clean data and present results and instead I now have code to take any addresses, validate them and proceed to secondary data point evaluations to consolidate data across foreign sources, to spot what is missing and shouldn't be but also validation on more granular data in this case ownership and tax rate. Its the most intense bit of work I have done alone - every single step needed something created and addressed - no normalization anywhere consistent in a single field of data point, crucial or otherwise. The reality even in the end, whatever you do, its only as good as the data you get, but the stories and scenarios of data which can be told after, if more people were interested in that and understanding of the influence of data diligence, its like a civilization booster in awareness :D

  • @TheSwissGabber
    @TheSwissGabber Год назад +2

    # 6 is such a great tip. Here's how I do it:
    - YAML file with the source code for defaults
    - (maybe user /machine dependant YAML file for some test system)
    - a YAML file with the data to override any values in evaluation
    YAML is great because it's human readable and has comments. So you can tell the user "put a config.yaml with the keys X, Y and Z to the data and off you go".

  • @williamduvall3167
    @williamduvall3167 Год назад +10

    The fastest data storage I have found with python is arrow, but I unusually use csv or json although I have used quite a few databases. Also, I have been slowly learning tip number 5 over the past 10 years or so. Once I force myself to make code that others can use, I find myself being much more proficient and can see why the top coding people generally seem to make tools that others can use in one way or another. Thanks!

  • @sergioquijanorey7426
    @sergioquijanorey7426 Год назад +18

    Biggest tip is to combine python scripts with Notebooks. Notebooks allow for fast and visual exploration of the data / problem. Then move pieces of code to a ./lib folder and use it from them. And start adding test. Most of the time you are performing the same operations of the data so you can end up with a shared lib across multiple projects. And that is very handy when starting a new project.

  • @obsoletepowercorrupts
    @obsoletepowercorrupts Год назад +1

    Some good points in the video to get people thinking about different aspects. Scalability leads to a temptation for some to use multiple APIs and thereby an API management tool which in turn costs time and increases the probability of a ML library being used like a sledgehammer to crack a nut _(especially if time lack inclines a planner to be avoidant of dependency tree challenges)._ No matter the scalability of a software system _(whilst not seeing it as something to be seen as "regardless" of scale)_ databasing to keep track can become a bottleneck, and so retries and dead letter queues are worth it. Your mention of workflows is wise and jumping from one database to another _(e.g. MySQL to Postgres)_ is very likely to incur thoughts spent on workflows for that very task. You can optimise all you like _(which is noble)_ but these days people are more incentivised to "build-things" and so somebody might pip that "optimiser person" to the post by throwing computer horsepower at the challenge, thereby forcing something to be big rather than scalable in ways other than unidirectionally. Logs tend to mean hash tables. There are advantages to storing in a database like the choices available for DHT versus round robin. Environmental variables can be for ENV metadata to set up a virtual machine. If you own that server, like you suggest, it's an extra thing to secure _(for example against IGMP snooping)._ Containers and sandboxes are an extra layer of security rather than a replacement for security. Multiple BSD jails for example can be managed with ansible for instance.
    My comment has no hate in it and I do no harm. I am not appalled or afraid, boasting or envying or complaining... Just saying. Psalms23: Giving thanks and praise to the Lord and peace and love. Also, I'd say Matthew6.

  • @Lirim_K
    @Lirim_K 2 месяца назад

    I could comment something like this on all your videos but It's too time consuming so. You are easily one of the best python teachers I've known. My Data Science game has skyrocketed ever since I found your channel. I actually feel like a professional now as opposed to a newbie. From the bottom of my heart: Thank you!
    PS: If you ever have time, I'd appreciate an end to end, rigorous Machine Learning workflow where you cover environment set up, folder structure, OOP concepts, Pydantic, deployment. I've looked for videos like this among yours but not been able to find one.

    • @ArjanCodes
      @ArjanCodes  2 месяца назад

      You're most welcome! I'm glad to hear the channel has helped you so much 💪.

  • @Vanessa-vz5cw
    @Vanessa-vz5cw Год назад +5

    Great video! As a data scientist myself, I would love to see you work through an example that uses something like MLFlow. It's a very common tool at every DS-job I've worked since it's open source but also part of many managed solutions. Specifically, I'd love to see how you build an MLFlow client, how you structure experiments/runs and when/where you feel it is best in the flow of an ML pipeline to log to/load from MLFlow. Most MLFlow tutorials I've seen are notebook based, which is great for learning the basics but there isn't much guidance out there on how to structure a project that leverages it.

  • @askii3
    @askii3 11 месяцев назад +1

    One thing I’ve began doing is using a flatter directory hierarchy and using SQLlite to catalog file paths along with useful, project specific metadata. This way I write a SQL query to pre-filter data by only fetching relevant Parquet file paths to pass into Dask for reading and analyzing.

  • @JeremyLangdon1
    @JeremyLangdon1 Год назад +8

    I’m a big fan of validating data I’m bringing in via Pandera. It is like defining a contract and if the data coming in breaks that contract (data types, column names, nullability, validation checks, etc.) I want to know about it BEFORE I start processing it. I also use Pandera typing heavily to define my function arguments and return types to make it clear that the data going in and out of my functions validate to a schema which is way better than the generic “pd.DataFrame” e.g.
    def myfunc(df: DataFrame[my_input_schema] -> DataFrame[my_output_schema]:

  • @thomasbrothaler4963
    @thomasbrothaler4963 Год назад +8

    This was a really great video. I am a data scientist for over 2 years and was great to see that i already use developed a habit to use some points (probably because of other videos of you 😂❤) but also learned something new too!
    Could you do the same thing for data engineering?? That would be awesome!

  • @spicybaguette7706
    @spicybaguette7706 Год назад +1

    As an intermediate data storage format, I typically use duckdb because I'm quite comfortable with SQL, and duckdb allows me to query large sets of data very quickly

  • @Andrumen01
    @Andrumen01 Год назад +2

    I started doing Test Driven Development and it has saved me from more than one huge headache! Good advice!

  • @haldanesghost
    @haldanesghost Год назад +2

    Official request for a full Taipy video 🙌

  • @NostraDavid2
    @NostraDavid2 Год назад +3

    A .py file can be a notebook too! Just add "# %%" to create a cell - vscode will detect it automatically!
    Downside: the output isn't saved like you can do with a regular notebook
    Upside: code is easier to test, no visual fluff, etc.

    • @askii3
      @askii3 11 месяцев назад

      I love these interactive sessions! Another plus of these is auto-reloading of imported functions that I just edited without re-running the whole script. Just add this at the top to enable reloading edited functions:
      %load_ext autoreload
      %autoreload 2

  • @drewmailman1965
    @drewmailman1965 Год назад

    For Tip 5, nbdev from fastai is a great package for exporting cells from a Jupyter Notebook to a script. From my notes, ymmv:
    At top:
    #| default_exp folder_name_if_desired.file_name
    Per cell exported:
    #| export
    To export, add below to same cell:
    import nbdev
    # For current directory path:
    nbdev.export.nb_export("Notebook Name.ipynb", "./")

  • @DaveGaming99
    @DaveGaming99 Год назад +3

    Love these videos! I found your channel from your Hydra config tutorial and all of your videos have been full of invaluable knowledge that I've already been using in my projects at work! Thank you and keep it up!

    • @ArjanCodes
      @ArjanCodes  Год назад

      Thank you for the kind words, Dave! Will do.

  • @d_b_
    @d_b_ Год назад +9

    In defense of notebooks, the exploratory aspect that it allows makes it really nice and quick to find problems within the data. You can look deeper at the objects or dataframes where it halts. If there is a way to combine this ability with a well structured set of scripts, it would be fantastic

    • @joewyndham9393
      @joewyndham9393 Год назад +4

      Have you used an IDE with a good debugger, like Pycharm? You can set breakpoints to interrogate data, you can evaluate expressions etc

    • @harry8175ritchie
      @harry8175ritchie Год назад

      This isn't a bad idea, but the notebooks are great for exploratory data analysis for a few reasons: it combines markdown + python allowing you to explain analyses in detail, seeing various plots on the fly, separating analyses by cells. However, notebooks are a great environment to get messy quicky. Sharing notebooks is essential during my career, specifically for research / exploratory projects that require explaining your analyses and thinking along the way.
      @@joewyndham9393

    • @rfmiotto
      @rfmiotto Год назад +2

      I used to be a notebook user myself, but over time, I adopted the pure Python script because you can structure your code better using principles of clean code and clean architecture (plus the benefits of having a linter and a formatter) . You made a valid point, though, which is the ability to inspect the object values with easy in notebooks. I overcome this with the VSCode debugger. Having a well structured code makes it easy to inspect variables in their particular scope. And also, I believe that, probably, there might be some VSCode extension that helps displaying the variables in a more friendly way...

    • @FortunOfficial
      @FortunOfficial Год назад +1

      @@rfmiottoi had a similar path as you. I tried notebooks for a while and loved the quick analysis. Especially with PySpark it's handy since you don't always have to start the runtime context again, which takes a couple seconds. BUT somehow Notebooks nudge me to bad practices such as putting all operations into the global scope instead of using functions. VSCode debugger is pretty decent, it is a good replacement for the lost interactivity

    • @isodoubIet
      @isodoubIet Год назад

      What works best for me is to write a script and then run that script in a repl like ipython using %run. That way I still get to do things interactively, persist data sets in memory etc but still not deal with any of the annoyances of notebooks.

  • @QueirozVini
    @QueirozVini Год назад

    I constantly use the Parquet data format. It makes loading data WAY faster. In Python, it works just as CSV (e.g., using Pandas, instead of using read_csv() you use read_parquet()). It is bundled with an intelligent way of compressing repeated values, so it has a way smaller HD memory footprint when compared to CSV and JSON. It stores data in a columnar fashion, so that if you only need some columns for a project, and other columns for another, you can avoid retrieving unwanted columns into memory. It also works well with Big Data environments (such as Apache Spark).
    Having a smaller HD memory footprint means you can transfer to other people easily as well. And store it in cloud solutions with a lower cost.
    And honestly, as a Data Scientist, you kind of never would open the CSV or JSON file and check it yourself. 99% of the time we use a library like Pandas or a software like Tableau to visualize and work with the data. So being human-readable is not really an advantage for data scientists, as it is for backend and frontend developers.

  • @buchi8449
    @buchi8449 Год назад +1

    Useful list of tips but I have additional tips we can derive by combining these tips.
    tip 1 + tip 6: Use a common way for externalizing configurations
    If each project externalizes configurations differently, for example, one uses a YAML file and another uses a .env file, it will be a nightmare for other people, particularly for engineers working on the deployment and the operation.

    • @buchi8449
      @buchi8449 Год назад

      tip 4 + tip 7: Implement data science logic as a pure function
      In other words, don't persist intermediate data in the same code where a data science logic is implemented. We can say the same thing for the reading of input data.
      Implement a DS logic as a pure function taking a pandas DF and other parameters as input, and returning a pandas DF as output, for example. File I/O should be done in a different code calling this function.
      This separation of data science logic and file I/O makes unit tests of data science code easier.

    • @lamhintai
      @lamhintai Год назад

      I seem to run into this issue with some advice suggesting .env storing DB credentials to avoid leakage via version control. I prefer YAML for other things (non credentials/secrets) more due to its structure. And especially reading in dotenv seems to be a bit messy when using type hints (can return None type theoretically - warnings all over the place in the whole downstream).
      But this means I’m using both formats and not centralized config…

  • @TomatePerita
    @TomatePerita Год назад +4

    It's a shame the pipeline suggestion came from a sponsor, as it would be very interesting to see you compare different tools like snakemake and nextflow. It's a very niche field and choosing a tools is dificult since you have to comit a lot.
    Great video though, would love to see that pipeline video eventually.

  • @chazzwhu
    @chazzwhu Год назад +2

    Great vid! Would love to see your approach to doing a data project eg download data, use airflow to process, train model and host via an API endpoint

  • @ErikS-
    @ErikS- Год назад +2

    Arjan can now fill the dutch city of Tilburg with his 200k subs! Impressive since he only passed the city of Breda (150k) a couple of months ago!
    Congratulations Arjan!

    • @ErikS-
      @ErikS- Год назад

      And next up will of course be Eindhoven!
      The dutch city where Royal Philips was founded and now having around 250k inhabitants!

    • @ArjanCodes
      @ArjanCodes  Год назад +1

      Thanks Erik. It’s nuts when you think about what those numbers actually mean!

  • @FlisB
    @FlisB Год назад +2

    Great video. I feel data-science projects are rarely examined in terms of design/structure quality. I hope to see more videos about it in the future. Perhaps on writing tests, I sometimes lack ideas about how to test data-science code.

  • @s.yasserahmadi7846
    @s.yasserahmadi7846 Год назад +1

    Which video should i watch first?! there's no ordering in this playlist, it's confusing

  • @rafaelagd0
    @rafaelagd0 Год назад

    Amazing video! I am very happy to notice that these are the bits of advice I have been pushing around to my work environments. I hope these things become the norm soon.

    • @ArjanCodes
      @ArjanCodes  Год назад

      Glad you enjoyed the content, Rafael! I hope so too :)

  • @ArpadHorvathSzfvar
    @ArpadHorvathSzfvar Год назад

    I use CSV many times 'cause it's simple and compact! I've also used parquet when I wanted to be sure about the data types when I've loaded back the data.

  • @ringpolitiet
    @ringpolitiet Год назад

    Polars scales great. Read the CSV and query, lazily if needed. Parquet for intermediate file system storage, polars.write_database if needed. "If you have to ask, polars is enough".

  • @MicheleHjorleifsson
    @MicheleHjorleifsson Год назад +2

    Data Formats: Parquet and Pickle because they are lightweight and easily adapted to pipelines.
    Requesst: Would love to see a taipy project video :)

  • @MicheleHjorleifsson
    @MicheleHjorleifsson Год назад +1

    BTW jupyter notebooks in visual code with git are nice as you get simplicity of Notebook, performance data and versioning

  • @Soltaiyou
    @Soltaiyou Год назад

    Great content as always. I’ve used csv, json, pickle, parquet, and sql files. I would argue there is no “standard” data science project. Once you get past boilerplate stats, you’ll inevitably have to write ad hoc functions to match the idiosyncrasies of your data either for deeper analysis or visualizations.

  • @luukvanbeusichem7652
    @luukvanbeusichem7652 Год назад

    Love these Data Science videos

  • @philscosta
    @philscosta Год назад

    As always, thank you for the great video!
    I'd love to see more DS content, as well as some content on mlflow and dvc.

  • @shouldb.studying4670
    @shouldb.studying4670 Год назад

    That fight to the death line caught me mid sip and now I need to clean my monitor 😂

  • @teprox7690
    @teprox7690 Год назад +1

    Thanks for the content. Quick feedback on the changing camera angles, it may look nice but it disturbs the flow of thought. Thank you very much.

    • @Michallote
      @Michallote Год назад

      I agree, perhaps it's simply not well executed

  • @loic1665
    @loic1665 Год назад

    Great video, and very great advice! I'll be giving a course about software development next semester and I think some of the points you talked about are worth mentioning !

    • @ArjanCodes
      @ArjanCodes  Год назад

      I'm glad it was helpful! Good luck on your course next semester :)

  • @Will29295
    @Will29295 5 месяцев назад

    Good overview. Would really appreciate some physical examples.

  • @joshuacantin514
    @joshuacantin514 Год назад

    Regarding Jupyter notebooks, a lot of things still work when opening the notebook in VS code, such as code formatters. You just may need to trigger it specifically in each cell (Alt+Shift+f for the black formatter, I believe).

  • @barmalini
    @barmalini Год назад

    It might be a silly question in this context, but does anyone know of a similar quality channel with a focus on Java?
    Arjan is such a great educator that I am genuinely considering switching to Python, but it's hard because I must learn Java too.

  • @isodoubIet
    @isodoubIet Год назад

    The general point that you can learn from libraries is of course very good, and I've used sklearn as a design guide many times. That said, pandas specifically is probably a bad example for that purpose since a lot of its design is weird/messed up. For example, the default settings for reading and loading dataframes from csvs are all wrong -- it should be the case that if you write a dataframe and then read it, you should get the same thing -- saving stuff to disk should round-trip, right? Well, with pandas it doesn't, both because of indexing issues, and because it rounds stuff by default. Lots of little corner things in pandas like that, so IMHO while it's a powerful package that I probably couldn't live without, it's one best used _after_ you've acquired the relevant domain knowledge, not as a teaching tool for it.

  • @abdelghafourfid8216
    @abdelghafourfid8216 Год назад

    for caching and storing inermediate results, the fastest formats I've tried are msgspec for json‐like data and feather for table‐like data

  • @philscosta
    @philscosta Год назад

    It would be great to hear some ideas on how to write good tests for data related projects. Lately I've been using syrupy to write regression/snapshot tests to at least assure that the results don't change when I do a refactor. However this is not very robust. A challenge with all that is creating and managing good test data.

  • @askii3
    @askii3 11 месяцев назад

    I love using Parquet as an intermediate format then use Dask to read them and do lazy processing as much as possible until the pipeline is forced out of Dask.

  • @riptorforever2
    @riptorforever2 Год назад

    6:50 to query a 'json file' or a collection of 'json files', there is lib tinydb

  • @alivecoding4995
    @alivecoding4995 Год назад

    Is taipy running completely locally, or using their web services?

  • @ivansavchuk7956
    @ivansavchuk7956 Год назад

    Thanks sir, great video! Can you recommend a book on software design?

  • @KA3AHOBA94
    @KA3AHOBA94 Год назад

    Thank you for the good videos. Will there be any examples of design patterns using a GUI as an example?

  • @walterppk1989
    @walterppk1989 Год назад

    for config, I like to have a centralised config.py with a Config class. It'll have attributes that try to get env vars (environ.get('myvar', 'some_sensible_default'), with a default.

  • @pj18739
    @pj18739 9 месяцев назад

    When would I rather use Taipy than Dagster?

  • @WilliamWatkins-o6z
    @WilliamWatkins-o6z 10 месяцев назад

    I've been working on a project where files contaone raw numpy files and pandas df. I was really struggling to save the data to simple files as pandas is slow and I didn't know what to do with acquisition parameters. I first made my own serialised and deserialiser to store the data in huge JSON (using orjson for performance) but now I stumbled upon pydantic. It's relatively easy to implement compatibility for numpy and pandas and now it's still JSON files but the classes object are very compact and using ABC it's easy to create inheritance and apply these traits. But now I'm looking at alternative to make these files compatible outside of python using either or both SQL and hdf5. I'm quite supprise that this is not a solved problem. I found xarray that could use be it needs complete rewrite. So' yeah maybe another day

  • @slavrine
    @slavrine Год назад

    can you go over using unit tests in data science projects?
    My team does not actively use them. We don't find a use when we add new features, change processing code very quickly

  • @walterppk1989
    @walterppk1989 Год назад +3

    my tip for teams that run many ML pipelines in particular: don't make many projects based on a single cookiecutter. Instead, nest all of your pipelines in a monolithic repo which is ONLY in charge of writing ML code, and separate the projects by subfolders. That way, you don't have to maintain many different CI/CD pipelines and docker imgs (they tend to be large, but that's the dependencies, not the application code).

    • @ali-om4uv
      @ali-om4uv Год назад +2

      Thats good and horrible advice at the same time

  • @blooberrys
    @blooberrys Год назад

    Can you do an in depth logging video? :)

  • @lisaaldridge8756
    @lisaaldridge8756 Месяц назад

    Thanks

  • @DeltaXML_Ltd
    @DeltaXML_Ltd Год назад

    Great video 😁

  • @MicheleHjorleifsson
    @MicheleHjorleifsson Год назад

    Different approach to config variables, use a MyConfig class and store in a pickle file. this way the file isnt clear text when stored

  • @suvidani
    @suvidani Год назад +1

    Use schemas to describe and validate your data.

  • @joewyndham9393
    @joewyndham9393 Год назад +2

    Can someone outline for me what benefits notebooks have over IDE development? I've recently switched from doing data science with an IDE in a typical software dev environment to using Databricks notebooks (due to a job change). I honestly can't see any benefit, but I can see a lot of drawbacks. In an IDE like Pycharm I can rapidly create experiments, I can visualise data AND I can write clean safe software. Notebooks put so many obstacles in the way of good development. What am I missing?

    • @MobsterSam
      @MobsterSam Год назад

      in an IDE, isn't it slower when one has multiple longish steps in a pipeline and have to rerun everything each time as one iterates? I'd be keen to learn how do you get around that?

    • @sukawia
      @sukawia Год назад +1

      @@MobsterSam Tip #4 can get you pretty far in many cases. You can even set it up to work like cache (create a decorator that saves the output of the function in a file and next time it directly loads from it, then you can put the decorator on your preprocess, load_data etc functions)

    • @joewyndham9393
      @joewyndham9393 Год назад +3

      @@MobsterSam Yeah that's a reasonable concern, but I think it's overcome by good coding practices like abstraction, separation of concerns, and good personal process.
      Let me flip the question and ask why you need to run steps A, B, C and D in a pipeline to be able to write step E? I'm assuming your answer might be that you need to know what the data looks like at step E. Then my question would be, why don't you know what is coming out of steps A, B, C and D?
      What I'm getting at here is that well structured, clean code makes it easy to understand what goes in and what comes out of functions and classes. So you can write huge amounts of good code without actually pushing any real data through it.
      I also want to ask you about what you are doing when you are "iterating"? Are you debugging? Are you trying new things in your model? Or are you doing both of these at once? If you are trying to do both at once, then I can see why you like notebooks. They really encourage you to bounce around your code slipping changes in here and there.
      And this is one issue I have with notebooks and the way lots of data scientists work - they don't separate the different tasks they are doing. If I'm writing code, that's all I'm doing. If I'm debugging I'm only debugging. And if I'm extending or modifying a model, it's only after finishing the first version.
      This point gets me to the debugger. In Pycharm I can use the debugger to pause at any moment in the pipeline, to see all the variables, to evaluate expressions, etc. So the functionality you actually want is there. But, it is only there at the right time - when you are debugging. And it encourages you to write cleanly, because debugging is waaaay easier in tightly written functions and classes with limited namespaces

    • @MobsterSam
      @MobsterSam Год назад

      Thanks for the suggestions! Often times early in projects (I work in a domain that isn't well established) I do need see what the data is like to decide on the choice and order of steps. After that, I refactor everything into a proper class based pipeline. But the helpful things suggested here have given ideas as to how I might start with and stay in an IDE.

    • @joewyndham9393
      @joewyndham9393 Год назад

      @@MobsterSam I agree it is always important to do an ad hoc scan of your data, and for that step, you're not necessarily writing code that will live on in your codebase, so you can relax the rules of clean code a fair bit. But in my opinion you can do that in a good IDE which allows inspection of variables and interactive plots. You also get all of the other super productive tools of the IDE. Happy coding!

  • @murilopalomosebilla2999
    @murilopalomosebilla2999 Год назад

    Great content!!

    • @ArjanCodes
      @ArjanCodes  Год назад

      Thank you so much, happy you're enjoying the content!

  • @joshuacantin514
    @joshuacantin514 Год назад

    HDF5 seems to be a rather useful data format, for both metadata and data.

  • @ali-om4uv
    @ali-om4uv Год назад

    Redo the video and asd data version control dvc. That is a must once you work in an organisation. It has rudimentary pipelining for modeltraining as well. Everybody should know mlflow. And.... avoid tools like kubeflow if you do not have sufficient manpower to run it.

  • @ScinDBad
    @ScinDBad 4 месяца назад

    Where did you get that T-hirt?😅

  • @MrPennywise1540
    @MrPennywise1540 Год назад

    I have a Python code that use Tkinter to make a GUI. This code edit images. Now, I'm developing my webpage with Django, and I want to run the GUI code in my page. I hoped I could sove it with PYSCRIPT, but it's not compatible. I'm in a dead end. Can someone give me an advice?

  • @yickysan
    @yickysan Год назад

    Notebooks are great for experiments.

  • @scottmiller2591
    @scottmiller2591 Год назад

    Taipy seems to have abandoned the pipeline as a user concept - it no longer appears in the docs. I assume it's still in the mechanism, but no longer explicitly exposed. Rather, the emphasis seems to be on building GUIs with data nodes. My experience with graphical programming like this has been that they are extremely difficult to review, as one has to unfold a lot of nodes to actually get to the code - maybe they've gotten around this somehow, or maybe they want you to assume their code is foolproof, a bad sign.

  • @knolljo
    @knolljo Год назад +2

    A bear t-shirt and no mention of polars?

  • @hubstrangers3450
    @hubstrangers3450 Год назад

    Thank you, could please return to LLMs for short series with MemGTP, OS and Function calls (YT, v=rxjsbUiuOFo, robot to robot interaction), if time permits, could be able to come up with demo and thought process, how futuristic is the scenario, and will it be a cost effect consideration, on prem cloud platform..... Thank you

  • @truniolado
    @truniolado 8 месяцев назад

    hey man, where tf you got that amacin tshirt?

  • @PeterT-i1w
    @PeterT-i1w Год назад

    Aryan codes? Based.

  • @alexloftus8892
    @alexloftus8892 7 месяцев назад

    As a professional data scientist, I disagree with a lot of this advice. With exploratory analysis, you are often writing one-off notebooks that nobody will read or reuse. Spending the extra time to write tests in this situation wasted effort. A good middle ground is including `assert` statements in your functions to make sure they're doing what you think they're doing.
    Pull code you're going to reuse out of your notebooks, and then write tests for it then.

  • @lukekurlandski7653
    @lukekurlandski7653 Год назад +55

    Tip Number 0: Don't use Notebooks

    • @slayerdrum
      @slayerdrum Год назад +11

      I think they are fine as long as you use them for what they are most suitable for: Exploratory analysis with text. Not for creating production-ready code (which often is not the way a project starts anyway).

    • @sergeys.8830
      @sergeys.8830 10 месяцев назад +1

      Why?

    • @dinoscheidt
      @dinoscheidt 7 месяцев назад

      Exactly.

    • @EdwinCarrenoAI
      @EdwinCarrenoAI 5 месяцев назад

      They are really useful for Proof of Concepts, Exploratory analysis, or basically to test an idea. But, they are not a good idea for deployments and production code.

    • @Michallote
      @Michallote 5 месяцев назад

      ​@@sergeys.8830they are completely unreusable and inefficient. Each run of a cell changes the metadata of a notebook. This is terrible for versioning systems. Output of cells make the file stupidly large and functions defined in notebooks are not usable anywhere else in the codebase.

  • @yickysan
    @yickysan Год назад

    You are wearing a pandas shirt.

  • @juancarlospizarromendez3954
    @juancarlospizarromendez3954 Год назад

    As always, start from scratch. It's from zero, nothing, etc.

  • @prison9865
    @prison9865 Год назад

    By the time you said what i can do with tipy, i already was not interested. Perhaps tell people what tipy can do for you and then how to install it and shit...

  • @TheSwissGabber
    @TheSwissGabber Год назад

    pandas.. everytime i come back to an old (6M+) project it does not work because they changed the API. Never happened with any other library (numpy, matplotlib, scipi etc.) So I would only use pandas if it REALLY benefits you. otherwhise you'll have a guaranteed refactor in 6 months...

    • @suvidani
      @suvidani Год назад +4

      Each project should have its own defined environment, this should not be a problem.

    • @Jugular1st
      @Jugular1st Год назад

      ... And if your abstraction is good it should only impact a small part of your code.

    • @isodoubIet
      @isodoubIet Год назад

      What's amazing is that despite all the breaking changes, pandas still has a bad api with wrong defaults. Very useful, but not a library anyone should emulate.

  • @MarkTrombonee
    @MarkTrombonee 7 месяцев назад

    Wow! Top of the comments sound like AI generated. Username like: name+numbers, and long text comment that noone asked for.

  • @ardenthebibliophile
    @ardenthebibliophile Год назад

    For point 5 I've started to get our teams to think: reusable code/functions go in .py files. Analyses go in .ipynb. they love Jupyter notebooks and this helps it be more readable. Bonus: the functions if used a lot can be packaged easier