After working in BI/Data analytics for the last decade, the lack of software engineering standards for database code is my biggest pet peeve. Anyone that has had to debug a >5000 line stored procedure will agree that the current standards for managing database code can bring a grown person to tears.
Humans tend to be unable to understand older code as it gets larger. I remember the Boeing 777 had a dangerous bug somewhere in its 200000 lines (if I recall correctly) and it took them years to find it and delayed the release of the plane - costing them serious $. Somebody where you worked should have abstracted the SP into smaller layers long before this point. Of course, they didn't.
I refactor, so that a class is about 120 Lines, a Method is about 5 lines, and usually there‘s a balance at every class of a slim public interface a bigger inner interface and a small group of variables. Not that managers understand it, or the younger get an idea of what I am doing there. But I do anyway, even if they fire me. There is no alternative to good code. Speaking about Databases: You can glue such historic concepts very well with modern code. But it needs some engineering. Fowler explains how.
As a Data Scientist... yes. So many folks in the industry don't think of what they're doing as software engineering, and so ~93% of models (if I remember the statistic correctly) don't even make it to production, let alone stay there for any reasonable amount of time! Also, I would love more content in this vein. Looking at Data work from the software engineering side, especially in terms of designing systems, has been super helpful in improving my thinking.
There is certainly more learning and experimentation in data science and engineering than in the 'normal' software development, so it is kind of normal many of models do not reach production. But even if 93% models do not reach production, there needs to be a way to tell this (i.e. failed models kept in source control history or whatever) otherwise someone will forget they are failed and lose time experimenting on them again. The model versions, the input variables/model parameters and even the output model performance metrics, all need to be versioned.
Wow, very timely video for me. I'm a data engineer at an academic institution which suffers from "legacy data". Managing information is difficult - but given that storage and compute costs are decreasing, people don't bother. I think technical debt (or "data debt"?) is going to bite us in the arse! I'm going to rewatch this video a couple more times! Cheers.
A Software Engineer who works at a game publisher is not called a Game Engineer. A Software Engineer developing a payment system at a bank is not called a Money Engineer. But interestingly it seems like Data Engineers and their work are treated outside the realm of Software Engineering with all its principles and practices.
I studied data science to completion before moving to software engineering. To be fair, these concepts and traps were brought up many times in the subject matter - remember, most of the lecturers are battle hardened software engineers before becoming data scientists. The question therefore remains why are they struggling to do this in practice? Part of the problem is size of the datasets- part of the problem was the Gold Rush in this area 2016-2019, part of the problem is the individuals concerned, part of the problem is Dave's experience is not representative of what every data department is doing. The field of Data Engineering exists to solve this problem and firms that hire these people will naturally avoid these problems. All that said - every point in the video is golden.
JavaScript/web dev here and one of my biggest goals has been to version control my DB just as much as code. For any other JS devs out there, can highly rec Prisma as a reasonable JS/SQL interface with version-controlled migrations, etc. Nothing like being a able to check out a commit from last year and have my dev DB replay seeds and all migrations. That with containerized development... Truly feels like the future.
I'm a Data Engineer. We source control all our tooling, have dev and prod environments and data is backed up. Building data applications seems different to software engineering in some areas, testing and design particularly. We can get very generic requirements to combine data from several large (up to multi-billion event) sources for different systems in varying structures with little to no documentation; that requires a certain design approach. To respond the point about different contexts of data; If you understand the grain (lowest level of detail) of a measurement event it can then belong to an infinite number of contexts (sets) and behave the same way. "Those who control the grain control the universe!" A frequent request is to answer questions across these different contexts of data, do books on farming incur greater delivery costs etc. You wont see 'real' live data engineering demonstrated on-line as you can't show what real enterprise data actually looks like. I never get asked at work how many people got off the titanic.
Nothing more dreaded than a data scientist telling me that their notebook is ready to go to prod. Cheekiness aside, the software development mindset is growing in the data space albeit at a slow pace.
Most of this is very accurate. The "steel thread" approach works for an R&D team with no existing system or hard deliverable date. If you need to show results on an existing system over time, a "steel thread" is a bad idea. Start at the end. Have a feature path for deliverables and how they change over time. Work backward to make those happen. Often the requirements vary as the PMs and others see results. Building the thread from end to end will create a lot of throwaway work and time gaps where stakeholders can't provide feedback.
I typically call the steel thread a walking skeleton. I don't remember where I got that term from, but it's a pretty good way to explain to someone how to get a minimal system architecture running.
I think it may be a Dan North expression. Dan's good at those sticky phrases. In the olden days we used to call them "Architecturally significant use cases" which sounds rather pompous and less descriptive 😎
Notebooks without backups? Okay, some people like to live dangerously... I think the most important thing was a little bit rushed in this video: the lack of consideration of the value of data. We store data because we think it can be meaningful - so use nondestructive filters. The problem one wants to solve is one reason for that data - so try to understand it, the solution then comes almost for free if you understood it well enough. Most people in the business just care for linking some APIs together and are happy with classification rates, without a critical reflection. The creation of systems is very often not the core component. That's what I see are the biggest problems, but I am a mathematician (in applied mathematics), not a (fully fledged) software engineer.
The main cause of that is that ideally notebooks shoudl have some form of specialized version control. Git is problematic because it cannot differentiate automatically the CODE lines from the execution lines. THat result in frequent conflicts and makes everything painful.
I think he means the dataset on which the notebook was based has not been kept as a copy - maybe it is 22TB etc. Of course the notebooks themselves are backed up.
@@michaelnurse9089 But that would be a false claim, wouldn't it? Redundant devices are used for data storage for at least 25 years now. The more I think about it, the more I have to disagree with the claims in the video anyway; first of all, data science is not "new", it predates computer science by a huge margin and was then considered to be "statistics". Then a Jupyter Notebook really is a good hint that the code inside is barely reusable; one should strive to have not too many code lines in them anyway, but more comments. If code is reusable, it is better to write good old Python modules (or R, or whatever), and I think most have includeded that in their workflow. This video really isn't up to par to his usual quality.
Love it! I landed in a great place and right now i'm on a project where we are implementing a data vault style warehouse but also setting up data governance and SDLC for the data team. I'm the BA on the team, but still happy as a clam
Not sure I fully agree with this. The whole point of DB normalization (used for decades) is to avoid imposing structure on data as much as possible, leaving it up to subsequent tools to make the required joins to transform the data into useful information sets. In my experience, that normalization process has been extremely effective, but it does require good skills to recognize core relationships that can survive the test of changing requirements. Those solid core relationships are often to be found in physical world relationships.
I understand what you're saying, but to build something because the data at the source is not the same as the data at the destination after transferring is handled by the transferring protocol. the risk that something is lost is (i.e. copying something from a network drive to a local drive) negligible. Of course keeping a copy of the source data before manipulating it is more than best practice.
The problems in data handling... It's not surprising as you pointed out. But it is... as you pointed out! LOL Just gotta love those paradoxes🤣 The complexity arises from simpler things and ironically it's the simpler things that we "ignore". I face that mentality a lot. I like the "steel thread" concept it's been my approach until now I didn't have a name for it... but I'm dealing with "not practiced here" just like the "not invented here" syndrome. Good topic, Thanks for bringing it up.
For my personal tiny home projects, I have a general-purpose library (several files, several subjects) that uses to serve several of these tiny projects. I don't use to have control of versions, because I'm always changing bits of the lib. Projects that I don't modify for months, use to get bugs from obsolescence, regarding to that main lib. So, do you think that, for each of the projects, I should have a copy from the current state of the lib., in order to them never get obsolete?
Use git submodules maybe? I like it for the same purpose you describe. It's gits within gits, gitception. Have separate repos for your different libraries. Add them as a submodule to your repo, and then whenever you check out one of your main repos, that last commit you were on for the library gits is still there, still working, waiting for you to pull new changes when you're ready
@@jangohemmes352 Git is super problematic with notebooks because it does nto understand which lines are code and which are execution results (they are there intermixed). So when you make a pull over one of your own versiosn that was running in other data.. it nearly always result is HUGE number of conflicts and you spending half your day trying to fix it. We need a version control tool that knows how to handle notebooks safely, until then we need to believe everyoen using the notebook remembered to erase every single cell of execution before a commit and that is well like believing in santa claus
@@tiagodagostini We're sidetracking the thread now, but to give you some idea on how to version control notebooks in git: use a pre-commit hook to clean the outputs of the notebooks and commit only the code part. There are a few ready-to-use open source solutions out there.
@@antoruby Yes there are, but the more pieces you need to put in a system to work, the more chance something would go wrong. I really wish git had soemthing embedded for that in the sense that it ignores certain things in merges.
most of these problems arise from poor to no training in programming. most of datascientists are just scientists, who also know some python or java. and that is quite unfortunate, because those languages produce bad habbits. first language must be something from the functional family. they have rigid rules, and enforce structured and compositional programming style. they also make one constantly aware of side effects, scopes and data flow. then the ideas of versioning, configuration transactions (that can be rolled back), model responsibility separation come to mind naturally.
Data Mesh seems to be a bit of consultancy-speke that means its progenitor can appear to be a thought leader in some way. The literature is very low quality for a Thoughtworks article.
Hi Dave, Could you maybe do a video about how the approach taken by engineers and such to fix the Orville Dam Spill compares to a DevOps or Continous Delivery approach? Their approach immediately reminded me of what I have learned on your channel and it is what made them so successful. You can see a summary here ruclips.net/video/ekUROM87vTA/видео.html
After working in BI/Data analytics for the last decade, the lack of software engineering standards for database code is my biggest pet peeve. Anyone that has had to debug a >5000 line stored procedure will agree that the current standards for managing database code can bring a grown person to tears.
Humans tend to be unable to understand older code as it gets larger. I remember the Boeing 777 had a dangerous bug somewhere in its 200000 lines (if I recall correctly) and it took them years to find it and delayed the release of the plane - costing them serious $. Somebody where you worked should have abstracted the SP into smaller layers long before this point. Of course, they didn't.
It sometimes brings two grown persons to tears - the one who has to debug and the one who has to pay for it.
I refactor, so that a class is about 120 Lines, a Method is about 5 lines, and usually there‘s a balance at every class of a slim public interface a bigger inner interface and a small group of variables. Not that managers understand it, or the younger get an idea of what I am doing there. But I do anyway, even if they fire me. There is no alternative to good code. Speaking about Databases: You can glue such historic concepts very well with modern code. But it needs some engineering. Fowler explains how.
As a Data Scientist now developing an MLOps platform, I've been looking foward to this video for a while. I'm watching avidly :)
As a Data Scientist... yes. So many folks in the industry don't think of what they're doing as software engineering, and so ~93% of models (if I remember the statistic correctly) don't even make it to production, let alone stay there for any reasonable amount of time!
Also, I would love more content in this vein. Looking at Data work from the software engineering side, especially in terms of designing systems, has been super helpful in improving my thinking.
There is certainly more learning and experimentation in data science and engineering than in the 'normal' software development, so it is kind of normal many of models do not reach production. But even if 93% models do not reach production, there needs to be a way to tell this (i.e. failed models kept in source control history or whatever) otherwise someone will forget they are failed and lose time experimenting on them again. The model versions, the input variables/model parameters and even the output model performance metrics, all need to be versioned.
Well it is better that they do not get to production than if they got to production being unfit for it :)
Wow, very timely video for me. I'm a data engineer at an academic institution which suffers from "legacy data". Managing information is difficult - but given that storage and compute costs are decreasing, people don't bother. I think technical debt (or "data debt"?) is going to bite us in the arse!
I'm going to rewatch this video a couple more times! Cheers.
A Software Engineer who works at a game publisher is not called a Game Engineer. A Software Engineer developing a payment system at a bank is not called a Money Engineer. But interestingly it seems like Data Engineers and their work are treated outside the realm of Software Engineering with all its principles and practices.
I studied data science to completion before moving to software engineering. To be fair, these concepts and traps were brought up many times in the subject matter - remember, most of the lecturers are battle hardened software engineers before becoming data scientists. The question therefore remains why are they struggling to do this in practice? Part of the problem is size of the datasets- part of the problem was the Gold Rush in this area 2016-2019, part of the problem is the individuals concerned, part of the problem is Dave's experience is not representative of what every data department is doing. The field of Data Engineering exists to solve this problem and firms that hire these people will naturally avoid these problems. All that said - every point in the video is golden.
JavaScript/web dev here and one of my biggest goals has been to version control my DB just as much as code. For any other JS devs out there, can highly rec Prisma as a reasonable JS/SQL interface with version-controlled migrations, etc. Nothing like being a able to check out a commit from last year and have my dev DB replay seeds and all migrations. That with containerized development... Truly feels like the future.
Thank you for saying this and giving me something to point at to prove I'm not crazy and out of touch for raising these concerns.
I'm a Data Engineer. We source control all our tooling, have dev and prod environments and data is backed up. Building data applications seems different to software engineering in some areas, testing and design particularly. We can get very generic requirements to combine data from several large (up to multi-billion event) sources for different systems in varying structures with little to no documentation; that requires a certain design approach.
To respond the point about different contexts of data; If you understand the grain (lowest level of detail) of a measurement event it can then belong to an infinite number of contexts (sets) and behave the same way. "Those who control the grain control the universe!" A frequent request is to answer questions across these different contexts of data, do books on farming incur greater delivery costs etc.
You wont see 'real' live data engineering demonstrated on-line as you can't show what real enterprise data actually looks like. I never get asked at work how many people got off the titanic.
Nothing more dreaded than a data scientist telling me that their notebook is ready to go to prod. Cheekiness aside, the software development mindset is growing in the data space albeit at a slow pace.
Most of this is very accurate. The "steel thread" approach works for an R&D team with no existing system or hard deliverable date. If you need to show results on an existing system over time, a "steel thread" is a bad idea. Start at the end. Have a feature path for deliverables and how they change over time. Work backward to make those happen. Often the requirements vary as the PMs and others see results. Building the thread from end to end will create a lot of throwaway work and time gaps where stakeholders can't provide feedback.
I typically call the steel thread a walking skeleton. I don't remember where I got that term from, but it's a pretty good way to explain to someone how to get a minimal system architecture running.
I think it may be a Dan North expression. Dan's good at those sticky phrases. In the olden days we used to call them "Architecturally significant use cases" which sounds rather pompous and less descriptive 😎
Notebooks without backups? Okay, some people like to live dangerously...
I think the most important thing was a little bit rushed in this video: the lack of consideration of the value of data. We store data because we think it can be meaningful - so use nondestructive filters. The problem one wants to solve is one reason for that data - so try to understand it, the solution then comes almost for free if you understood it well enough. Most people in the business just care for linking some APIs together and are happy with classification rates, without a critical reflection. The creation of systems is very often not the core component.
That's what I see are the biggest problems, but I am a mathematician (in applied mathematics), not a (fully fledged) software engineer.
The main cause of that is that ideally notebooks shoudl have some form of specialized version control. Git is problematic because it cannot differentiate automatically the CODE lines from the execution lines. THat result in frequent conflicts and makes everything painful.
I think he means the dataset on which the notebook was based has not been kept as a copy - maybe it is 22TB etc. Of course the notebooks themselves are backed up.
@@michaelnurse9089 But that would be a false claim, wouldn't it? Redundant devices are used for data storage for at least 25 years now. The more I think about it, the more I have to disagree with the claims in the video anyway; first of all, data science is not "new", it predates computer science by a huge margin and was then considered to be "statistics". Then a Jupyter Notebook really is a good hint that the code inside is barely reusable; one should strive to have not too many code lines in them anyway, but more comments. If code is reusable, it is better to write good old Python modules (or R, or whatever), and I think most have includeded that in their workflow. This video really isn't up to par to his usual quality.
Love it! I landed in a great place and right now i'm on a project where we are implementing a data vault style warehouse but also setting up data governance and SDLC for the data team. I'm the BA on the team, but still happy as a clam
Not sure I fully agree with this. The whole point of DB normalization (used for decades) is to avoid imposing structure on data as much as possible, leaving it up to subsequent tools to make the required joins to transform the data into useful information sets. In my experience, that normalization process has been extremely effective, but it does require good skills to recognize core relationships that can survive the test of changing requirements. Those solid core relationships are often to be found in physical world relationships.
I understand what you're saying, but to build something because the data at the source is not the same as the data at the destination after transferring is handled by the transferring protocol. the risk that something is lost is (i.e. copying something from a network drive to a local drive) negligible. Of course keeping a copy of the source data before manipulating it is more than best practice.
Where did you get that tshirt?
I'm loving your content Dave. Looking forward to buy your book.
Awesome, thank you!
The problems in data handling... It's not surprising as you pointed out. But it is... as you pointed out! LOL Just gotta love those paradoxes🤣 The complexity arises from simpler things and ironically it's the simpler things that we "ignore". I face that mentality a lot. I like the "steel thread" concept it's been my approach until now I didn't have a name for it... but I'm dealing with "not practiced here" just like the "not invented here" syndrome. Good topic, Thanks for bringing it up.
For my personal tiny home projects, I have a general-purpose library (several files, several subjects) that uses to serve several of these tiny projects. I don't use to have control of versions, because I'm always changing bits of the lib. Projects that I don't modify for months, use to get bugs from obsolescence, regarding to that main lib. So, do you think that, for each of the projects, I should have a copy from the current state of the lib., in order to them never get obsolete?
Use git submodules maybe? I like it for the same purpose you describe. It's gits within gits, gitception. Have separate repos for your different libraries. Add them as a submodule to your repo, and then whenever you check out one of your main repos, that last commit you were on for the library gits is still there, still working, waiting for you to pull new changes when you're ready
@@jangohemmes352 Git is super problematic with notebooks because it does nto understand which lines are code and which are execution results (they are there intermixed). So when you make a pull over one of your own versiosn that was running in other data.. it nearly always result is HUGE number of conflicts and you spending half your day trying to fix it.
We need a version control tool that knows how to handle notebooks safely, until then we need to believe everyoen using the notebook remembered to erase every single cell of execution before a commit and that is well like believing in santa claus
@@tiagodagostini We're sidetracking the thread now, but to give you some idea on how to version control notebooks in git: use a pre-commit hook to clean the outputs of the notebooks and commit only the code part. There are a few ready-to-use open source solutions out there.
I think for small home projects Docker (or even full virtual machines) is a great technology - whether for Data Science or Software Engineering.
@@antoruby Yes there are, but the more pieces you need to put in a system to work, the more chance something would go wrong. I really wish git had soemthing embedded for that in the sense that it ignores certain things in merges.
100% true having worked in the domain
Amazing video
I hope lots of people watch this - these ideas need some traction
most of these problems arise from poor to no training in programming. most of datascientists are just scientists, who also know some python or java. and that is quite unfortunate, because those languages produce bad habbits. first language must be something from the functional family. they have rigid rules, and enforce structured and compositional programming style. they also make one constantly aware of side effects, scopes and data flow. then the ideas of versioning, configuration transactions (that can be rolled back), model responsibility separation come to mind naturally.
absolutely genius.
Data Mesh seems to be a bit of consultancy-speke that means its progenitor can appear to be a thought leader in some way. The literature is very low quality for a Thoughtworks article.
Great shirt!
Nailed it 👍
yep, you got it to the point. A lot of stuff about version control is got lost, i.e. not teached at Universities, scray!
Did you know that Neil Armstrong spelled backwards is “Gnorts Mr Alien”?
Hi Dave,
Could you maybe do a video about how the approach taken by engineers and such to fix the Orville Dam Spill compares to a DevOps or Continous Delivery approach? Their approach immediately reminded me of what I have learned on your channel and it is what made them so successful. You can see a summary here ruclips.net/video/ekUROM87vTA/видео.html
Theorizing without any practical example. Low quality lecture.
Critiquing without any practical example, low quality feedback 🤣
I thought it was a great video