I actually have a large ongoing project where I used namedtuples early on, with typing stored in a second tuple, then refactored to NamedTuple using the built in typing (which simplified storing the typing separately), and finally to dataclasses after seeing your video on that. It fit my application perfectly as I needed the flexibility of being able to modify the dataclass. If only I had known about dataclasses to start with. :) attr class also sounds interesting for my needs, I need to check that out.
Classes usually use a dictionary to store variable. If you define a ___slots___ = 'var1', 'var2', you're class can only set those attributes to value mentioned in slots.
I've used most of these, it turns out. Started with the class that repeats everything. I then used the dict to try and make things slightly more convenient, but that was only feasible in very limited circumstances. Then I discovered named tuples, but... they are tuples. Wasn't a huge fan. Then, finally, I came across attr. That was a huge revelation and I absolutely loved it. Finally something decent. And then dataclasses were introduced to the standard library and I basically switched to using those. attr can do more, sure - but the dataclasses are easier to use and don't need the dependency. Unless I actually need the power of attr, I'll just use these.
Depends on the data, tuples can be very suitable. For instance, I have to consume a YAML file containing a HUGE sequence of geo-coordinates (lat/long). For these kind of data, the kind that you read, keep in memory, and must not change, tuples are perfectly suitable, uses less memory, and very fast. And NamedTuple, just like other classes, can have methods defined within. So for instance I can write a distance_to() method which will calculate the big circle distance between one geo-coordinate with another geo-coordinate. If you need mutability, though, of course tuple just won't cut it.
Great thanks so much for this video! I am starting to study pydantic and I haven't been made aware of these differences. This is a huge help and I wish more people would explain these important differences when telling others they should use this or that.
Basically, one very strong rule of thumb is: If you need immutability and you can validate the data on your own, NamedTuple will _always_ be the best, hands down.
Pydantic 2.0’s just out, built around a Rust core. They claim up to 50x perf improvement so some of this might be changed. Still, kudos for covering v1’s overhead.
I've often struggled with ways people define hyperparameters and inputs to neural networks in open source code. This video definitely helped me in my choice going forward.
I really enjoyed the text comment/annotation overlays in this video. they both add useful background info and give the video a more relaxed vibe without distracting from the main points! :D
This is serious pro stuff. I started using dataclass thanks to your vid, then YT pushed another vid for pydantic and I was like bleh. Luckily this vid set me striaght. Now I understand each's use case. THANKS
Type hint gang. I came from a C background, so ducktyping felt like a Godsend. Then I got into rust and realized how much time I was spending debugging code because it was ducktypable. (lets not forget rust's awesome toolchain compared to python's...well yeah.) Hell, I was using i_var, etc just because it made it easier to reason about and not have to backtrack, which is when I first started wondering about it...Didn't fully click until I made the switch though.
Iʼm starting a new Python project and Iʼm using `attrs` because of this video. Otherwise I would have used `namedtuple`, because I think without your videos I somehow would have missed even `dataclas`.
A few arguments for dict gang: Everybody knows how it works and what the syntax is. Many libraries use it as inputs or outputs. If used as an interface it is easy to change it without breaking things. It is trivial to load and store them in json or send them over the network. I generally don't like speed comparisons of python code since that should never be the bottleneck of your program (you are using the wrong language if it is) but it is nice to know that dicts are fast af. I also haven't had any problems with reliability. But I guess that's partially due to my vscode extensions checking what I'm typing and giving suggestions. But I have to agree with you that the syntax is quite ugly compared to accessing elements with a dot. Though I don't think it would be too bad if the language added similar syntax for that (basically any kind of shorthand for ["string"]. But I guess . would be ambiguous. and most other characters already have a meaning. so maybe a double dot? some_dict..some_element)
The problem of dicts is that you cannot check types in compile time. You can store different data types in one field due all runtime. By the way, static analyzers cannot predict the type of element of dict, so other IDEs (like PyCharm) cannot help you with suggestions, especially with available methods for each field. In your IDE, you use an AI-based extensions that predicts data types, but it is not a static analyzer
This is timely. I've got a program in writing now, and am using dictionaries. I still think that formatting my data as nested dictionaries is the best representation of that data. Also, the original data format is actually defined as a dictionary.
@@Alex-uh6qh This is true, but I think by picking python as your programming language you've already given up on being able to easily track the types of objects. With all the type hinting and other type-related stuff you are still quite far from the type information you have in languages like C++.
Thx for the explanation! But it seems at this point Python is contradicting its own Zen: "There should be one-- and preferably only one --obvious way to do it." IMHO one should always prefer immutability. The diff between creating a new instance and setter could be ignored. If the performance is that critical, maybe one shouldn't choose Python at the first place.
4:36 There's TypedDict to consider too tho. (As in the type safety thing. You could type both dict and tuple and use a static type checker. If you use PyCharm and single quotes, accessing data by key is also not typo prone.)
@@maimee1 well personally I don't find any difference between using single quotes and double quotes. But then again I always use double quotes because Black enforces that.
The apischema package is a good middle ground between Pydantic and dataclasses. It allows you to do the same runtime validation on dataclasses if you need to and has the same features as well as a GraphQL schema generator. It also performs validation faster than Pydantic.
I am the owner of a backend project at my company and i use only pydantic, as we perform multiple api calls the validations are essential, and it integrates really well with fastapi
Great video! As always, the topic is well explained and I learnt something new The on-screen comments are really fun, I hope you'll put more of this in the future videos!
Great stuff! That would be even better if you'll make a follow up video about serialization of those objects and libs that can help. Often it's required to send tuple/datacalss/etc data over Kafka, to a DB or save as json and etc. Include 'marshmallow' lib in the vid as well!
I think in the past, because I've been lazy, I've used tuples but, because I don't hate myself, I had constants for which index was which. I don't recommend this but that would give you the speed power of tuples with some of the naming power of namedtuple
In current versions of attrs, it only requires assigning fields to an `attrs.ib` if you need anything per field option beyond a default. Otherwise you can use regular variable declarations like dataclasses does. (You might need to use the "next-gen" API, I can't remember at the moment)
Immutabulity is definitely a feature, but mutability is also a feature. As always you should choose based on what is most appropriate for your problem.
Yeah that was the biggest surprise for me, but I guess it kinda makes sense since a tuple can be implemented as a thin wrapper around raw memory, but a dict has to do hashing and such.
However, you'll likely end up needing to get the tuple's values in order to instantiate a new one. So performing a function similar to dict setattr would be at a significant cost.
Hi mCoding, this is a pretty good video, it would be really cool if you could maybe do an updated version? Specifcally I would like to see the addition of TypedDict.
OK, on type hints. I probably need a kick in the you-know-where, but I can't get it to play nicely with a few packages and features I need. I frequently use numpy, and a lot of funtions don't really care about receiving a single number or a full array of numbers. They may not even care if the number is a float or an int, but let's focus on floats here. The return value typically has the same shape as the main input, but may be a single number. How do I set up type hinting for numpy arrays? How do I set type hinting up for polymorphism?
I haven't looked into it extensively, but I'm aware there is a numpy.typing module which includes an ArrayLike type for anything that can be converted into an array, including scalars. you might want to look into the module documentation. specifying the dimensions of an array in python's type hinting system is generally difficult however, so I'm not sure there's a way to incorporate that information in your annotations.
"As well as BaseModel, pydantic provides a dataclass decorator which creates (almost) vanilla python dataclasses with input data parsing and validation."
Yeah that's cool and whatnot, but have you ever tried this? class D(dict): __getattr__=dict.__getitem__; __setattr__=dict.__setitem__; __delattr__=dict.__delitem__
I started using pydantic bc it allows specifying a conversion function to try to cast input values to the desired type. I didn't realize how much of a performance hit that library incurs, or that attrs can do this also but much more cheaply. I guess I should switch to attrs.
I didn't specifically compare times for when you are doing conversions. Make sure to time your use case yourself since Pydantic may still be faster if you are doing conversions.
@@mCoding Ah, thanks for the heads up. That probably accounts for a decent chunk of the time difference, as I think pydantic is always going to try to do basic type casting on all values when instantiating new instances, which surely comes with some overhead, particularly when you supply a custom function for it.
For me Pydantic is great for prototyping and the losses are acceptable for the sake of being always in detail informed about data errors. It even enables you to to skip writing early tests because of that. Still the charts are extremely useful to show that Pydantic can be an important target in optimization.
An excellent point. This is Python after all, raw speed is not usually what we optimize for and paying some extra runtime cost for data validation when it "shouldn't" be needed may be worth it depending on the situation.
Is it tho ? 5 microseconds for creation, 9 and 400 ns for getting and setting, and it was before pydantic v2 enhanced by rust. Unless doing several thousands of those are we really concerned about those numbers in python ? Especially when writing an API where the network latency and DB queries could easily reach the 100ms mark in good conditions.
@@heroe1486 I agree that it would be great to see this video with Pydantic V2 performance included. They made amazing progress. I agree with the rest you said as well. You asked and answered you question. Like I said, it can be an important part, not everywhere, but it's good to have it on your checklist.
@mCoding are you REALLY sure you've measured memory footprint correctly? What was your test methodology? The difference between NamedTuple/dataclass/class is supposed to be quite different from what you've shown (they do differ but not THAT much). According to this video (it's in russian, but code is clearly visible): youtube /tsEG0WM3m_M?t=60 : 1. The author uses pympler.asizeof() function instead of built-in ones since it's the only right way to measure *FULL* memory consumption of a given object. I personally re-tested it (generated a HUGE collections, taking literally gigabytes if RAM) - and yes, the built-in ones were returning some rediculous results, not even close to the actual RAM taken by python interpreter. 2. According to his tests, the difference is actually like this (on 1k instances): 2:05 - dict = ~ 1.2MB 3:44 - dataclass = ~ 1Mb 5:04 - namedtuple = ~ 720 Kb 5:54 - typed NamedTuple = also ~ 720 Kb
It's hard to say whether the way I counted things is the "correct" way because it depends on what you wanted to count, but the numbers are approximately the same with pympler vs the getsize method I used. The order of which classes use the most memory is exactly the same with either method. The main difference between what pympler does vs what I did not do is that pympler tried to account for object alignment. pympler assumes that all Python objects are 8-byte aligned and no packing is done (hence why the pympler answers are all multiples of 8), counting padding bytes in the total size count. On the opposite end my getsize assumes all objects are optimally packed together, not including padding bytes in the total size. The truth is probably somewhere in the middle and also an implementation detail that could change at any moment. But, in any case, I wouldn't call either method the "correct" one, they are both good estimates and their difference is pretty small. Also note that depending on the way you do your tests the data can make a big difference in how much space is actually used. For example (1,1) uses less memory than (1,2) because the 1 objects in the first tuple are the same. pympler 0: dataclass (slots) - 168 bytes 1: plain class (slots) - 168 bytes 2: tuple - 176 bytes 3: NamedTuple - 176 bytes 4: namedtuple - 176 bytes 5: attr class (slots) - 176 bytes 6: dataclass - 432 bytes 7: plain class - 432 bytes 8: attr class - 432 bytes 9: dict - 512 bytes 10: SimpleNamespace - 552 bytes 11: pydantic - 560 bytes method i used in video 0: dataclass (slots) - 162 bytes 1: plain class (slots) - 162 bytes 2: tuple - 170 bytes 3: NamedTuple - 170 bytes 4: namedtuple - 170 bytes 5: attr class (slots) - 186 bytes 6: dataclass - 408 bytes 7: plain class - 408 bytes 8: attr class - 408 bytes 9: dict - 488 bytes 10: SimpleNamespace - 528 bytes 11: pydantic - 536 bytes
@@mCoding thanks for such a detailed responce. > For example (1,1) uses less memory than (1,2) Obviously, when you do performance tests, you need to intentionally break those under-the-hood optimisations. Back then, when I was checking myself examples from the forementioned video, I used the simplest values for items I could think of. iirc, each class (simple class, dataclass, dict, set, list, tuple and various types of named tuples) had just 3 values: 1. an int, unique for each item (and I know that int is internally optimised up to 256 or smth, but that's neglegable relative to the total number of items I had for test - iirc, it was about millions, tens of millions or smth of that matter). 2. the same int, converted to a string, padded with random ASCII characters to make all the strings of equal length (used random characters instead of zeroes - just to be sure). 3. a float in [0, 100.0] range - also unique for each item. And to be the most precise, as I said, I kept increasing the number of items until the total collection size reached above 1 Gb. Each measure attmpt was done in a separate python session. And that's the thing I'm intrested the most when I asked about your methodology. With your method - did you just create a single instance and measured it or you generated a big enough number of them, measured the total consumption and divided it by the number of items? I mean, a single item difference might be 168 bytes vs 162. But if you have a tuple with a million of dataclass instances vs the same tuple type storing the same million of items with the same underlying data, but items themselves are NamedTuples now, my results were very different from what you've shown. At the end of a day, it doesn't matter that each individual instance is reported about the same. What matters is when you have a ton of them, and the only varying factor is type of an item, you should count the total difference as overhead. You won't use just a single instance of that dataclass/namedtuple in your program. So I don't know the theory behind it, but in practice my own tests gave the same results that russian guy tells in the video. And dataclass vs NamedTuple were nowhere near 162 vs 170 numbers you provide. Speaking of which, I have no idea how it's even possible for dataclass to take less memory than a named tuple or even a simplest tuple. So, could you disclose your methodology? To be clear: I'm not attacking, I really want to know the actual difference in various types of data containers. I'm just concerned that the numbers you provide conflict with basically everything I ever heard on the subject and with my own synthetic tests.
@mCoding. Could you redo this video with Pydantic 2.0? I get what you are saying about @dataclass being used in internal applications but sometimes you don't know for sure if it won't eventually be serialized into JSON, so pydantic is something I choose if I'm not sure. I want to know if the new 2.0 with Rust implementation has gotten the speed into the same ballpark as the other options.
Hmm, perhaps. While a rust implementation under the hood may improve performance, I suspect that it will not change the qualitative picture very much. Pydantic is slower primarily because it is fundamentally doing more work, namely validation and conversion, whereas the other options do neither validation nor conversion.
@@mCoding Maybe a good video might be how to use Dataclass and Pydantic together. I think in my case half of my projects are with FastAPI which I love and it depends on Pydantic. I've seen too many videos that compare Pydantic with Dataclasses (yours included) and have come to think of them in the same category. Since I'm already working with Pydantic in half my projects I've just gotten very comfortable with them. Knowing the performance hit puts a slightly different spin on the situation so maybe Dataclasses should be used for all internal-only data that won't be parsed. So then maybe just wrap a Dataclass in a field of a Pydantic class when you need to parse it. I'll keep this in mind myself in the future. Pydantic + Dataclasses would be an interesting video for me if you solicit ideas.
Hi Excellent Video! I was wondering what would be the right choice if I wanted to use the created class in a jit compiled numba function? As far as I have seen, namedtuples seem to be most suitable?
@mCoding How do measure speed execution in a repative way? I was trying to measure performance, but for the same setup I had got scores differing a lot (more than few percentages). Code was purely in Pyhon, no external sources, no io, but still differences were very noticeable.
For this video I believe I used timeit since they are tiny snippets, and the timing code is available in the github repository in the description. Timing measurements may vary drastically depending on things such as on your your cpu and version of Python, which is why it is always best to verify the timings for your own setup!
Accessing a dictionary's values by key is its primary purpose...it's only error-prone if you are ignorant to the pass-by rules of the value's type. #teamdict
My 2¢ on attrs vs dataclasses: * For application code, you can do whatever you want, but you should consider the cost of taking a dependency. This cost varies depending on the nature of your application and how you build/test/package/etc. it, so there's no one-size-fits-all answer here. * For library code, don't take a dependency unless you absolutely have to, because you will be forcing it on all of your clients. dataclasses exists, so attrs is firmly in the "not absolutely necessary" bucket, and libraries should not depend on it in most cases.
4:43 "SimpleNamespace is just like object, except it allows you to set attributes on it at runtime whereas object doesn't " Maybe I don't get something, but objects do allow you to set attributes (= instance variable, right?) at runtime: ```python-repl >>> class A: ... pass ... >>> a= A() >>> a >>> a.a Traceback (most recent call last): File "", line 1, in AttributeError: 'A' object has no attribute 'a' >>> a.a = 5 >>> a.a 5 ```
Glad to see you are watching so many of my videos! Yes classes let you set attributes at runtime. What im referring to here is that objects whose type is literally object, as in "x = object()" cannot have attributes set on them. If you try "x.a = 0" you get an error!
@@mCoding OK, I see now. Thanks. Yes your videos are really interesting and useful. So I decided to watch them all. But I guess I should spend less time watching videos and more time writing code. :)
Pydantic is always the answer. Since you can use the built-in dataclasses within Pydantic, if needed. But if you work in corporate, than yeah, dataclasses is the only answer 😅
Personally I find that the “potential typo” issue is overstated. I have 20 years of python experience and it’s never been a serious source of errors. Code that isn’t easily understood, like when you use a mess of nested classes instead of a simple data structure with a dictionary at its root, however, has caused me a ton of problems and really hard to debug situations.
Can't you throw your python code through some optimizer to convert everything to a tuple wherever possible? Your source code would still be your own readable code, but the optimized code that comes from that will be more optimized for speed and memory usage. Best of both worlds!
You don’t like dictionaries for this, for reasons that (to me) come down to basically, “I want my IDE to autocomplete for me.” If you remove that consideration dict ends up being the choice with the most flexibility and least work. This is especially true if you need to serialize the data and send it to other systems. You can’t easily jsonify a dataclass! Really, my POV comes down to “why are you using python for that?” It’s not a language that’s made for large systems where this stuff matters, it’s made for rapid prototyping and applications where speed of implementation is your biggest factor. If you’re in a situation where the extra work to use dataclasses is worth it, why are you using python at all?
okay, so first, dataclass has no type checking , with attrs you must give validator with validator= , so the notation alone n:int is not working . This foreign library classes are horrible. How do i do it. It is no problem to write a class what is defined as in pydantic, read kwargs and set args with type checking on init and methods, including checking of collection items types, like List what i am almost sure these libraries are not making, but have nice ability with one- command to turn it off as debug is done.
Hi, it seems like you are new to Python. The notation x: int is not supposed to be something checked at runtime, these hints are completely ignored at runtime as this would be a huge (think 10x) performance penalty, which is shown in the graphs in the video. Most type errors can be found by static analyzers, which is who the x: int is for. The only case when you need to do runtime checking is when you don't know the types ahead of time. The most common situation this happens is parsing since you don't know what data you are going to read in next, and this is why pydantic purposefully pays the cost of runtime type checking.
@@mCoding hey, i am not new. It is function field types what are ignoted, here is declaration of static class field (n) equals to a class (int): n = int I thought, in THIS tools for example @dataclass the notation SHOULD typecheck, because, why do we write it dataclass construction like that? And i just checked to enter like float instead of int, and it works smoothly. So My solution would be, retaining the @dataclass syntax what i found kind of convinient, because it retains order and no need to specify all arguments as named, create default type- checkers and turn on them, and if you want custom checker, you can write there like a = lambda x: 0
if i have a very large set of data, with different types of data like multiple timeseries, single character/digit variables etc, should i use dataclasses to store them? and if so how? do i pickle classes? right now i'm using pandas for everything. thanks for the video
I may be in the extreme minority here, but IMO dataclasses are not a good fit in most situations, but particularly here where you have large sets of nested data. Just stick with dict or pandas.
Okay at 5:55, One of my recent headaches is reading a god-damn xml. I hate the guts out of it. I have to parse everything as it is always in string format. xml.etree is great but I still have to manually input every string. and rename classes
I'm a simple engineering student and a modest python user but those dynamic histograms sent chills down my spine
Check out plotly and the source code in the description!
The buildin dataclass also has default_factory for defining default mutable values .
😬 oops, thanks for pointing this out! I should have been more careful when I made the feature matrix.
Furthermore, unlike slots support, this was on the release.
Hey James, I just wanted to sincerely congratulate you for both the quality content and humor in your videos, amazing work!!
Thank you very much for your kind words and support!
I actually have a large ongoing project where I used namedtuples early on, with typing stored in a second tuple, then refactored to NamedTuple using the built in typing (which simplified storing the typing separately), and finally to dataclasses after seeing your video on that. It fit my application perfectly as I needed the flexibility of being able to modify the dataclass. If only I had known about dataclasses to start with. :) attr class also sounds interesting for my needs, I need to check that out.
I wish he covered classes extending from named tuple, one of my favorite pre attr methods....
This is a great breakdown. I’ve had to explain this so many times to team members, now I’ll refer people to this video!
2:06 got me .. "real life". Those air quotes are heavy.
So when _are_ you going to explain slots? I have no idea what those are
Gulp, I feel the pressure.
@@mCoding yeah, I don't know what you're talking about either D:
That was the Google search I made right after this video 😂 I am intrigued
Classes usually use a dictionary to store variable. If you define a ___slots___ = 'var1', 'var2', you're class can only set those attributes to value mentioned in slots.
@@Elijah_Lopez sir you're a legend
I've used most of these, it turns out.
Started with the class that repeats everything. I then used the dict to try and make things slightly more convenient, but that was only feasible in very limited circumstances.
Then I discovered named tuples, but... they are tuples. Wasn't a huge fan.
Then, finally, I came across attr. That was a huge revelation and I absolutely loved it. Finally something decent.
And then dataclasses were introduced to the standard library and I basically switched to using those. attr can do more, sure - but the dataclasses are easier to use and don't need the dependency. Unless I actually need the power of attr, I'll just use these.
Depends on the data, tuples can be very suitable.
For instance, I have to consume a YAML file containing a HUGE sequence of geo-coordinates (lat/long). For these kind of data, the kind that you read, keep in memory, and must not change, tuples are perfectly suitable, uses less memory, and very fast.
And NamedTuple, just like other classes, can have methods defined within. So for instance I can write a distance_to() method which will calculate the big circle distance between one geo-coordinate with another geo-coordinate.
If you need mutability, though, of course tuple just won't cut it.
Great thanks so much for this video! I am starting to study pydantic and I haven't been made aware of these differences. This is a huge help and I wish more people would explain these important differences when telling others they should use this or that.
this is really the best Python channel on RUclips. I've learned more on this channel than all others combined
Wow thank you!
Basically, one very strong rule of thumb is: If you need immutability and you can validate the data on your own, NamedTuple will _always_ be the best, hands down.
Pydantic 2.0’s just out, built around a Rust core. They claim up to 50x perf improvement so some of this might be changed. Still, kudos for covering v1’s overhead.
Great point! Maybe ill have to do an update video!
@@mCoding Looking forward to see how does it compare to what you presented here :)
Also would have loved thoughts on TypedDict which mirrors NamedTuple for dictionaries, giving type hinting and string key checking
I've often struggled with ways people define hyperparameters and inputs to neural networks in open source code. This video definitely helped me in my choice going forward.
At first you got my interest, after the "Presenting with meaningless example" you got my attention. Awesome video once again!
Yeah pyndatic is so great for parsing/Serializing Json data.
I've been using it. But for simple data, I use built in dataclass
I really enjoyed the text comment/annotation overlays in this video. they both add useful background info and give the video a more relaxed vibe without distracting from the main points! :D
This is serious pro stuff. I started using dataclass thanks to your vid, then YT pushed another vid for pydantic and I was like bleh. Luckily this vid set me striaght. Now I understand each's use case. THANKS
Glad it helped!
Type hint gang. I came from a C background, so ducktyping felt like a Godsend. Then I got into rust and realized how much time I was spending debugging code because it was ducktypable. (lets not forget rust's awesome toolchain compared to python's...well yeah.) Hell, I was using i_var, etc just because it made it easier to reason about and not have to backtrack, which is when I first started wondering about it...Didn't fully click until I made the switch though.
1:24 "Will I ever explain slots?" One week later...
Thank you so much for your explanations, James!
Please keep the onscreen comments coming! Adds the perfect amount of fun to an informative topic "cries in mypy" 😁
Great video, thanks 😊
Iʼm starting a new Python project and Iʼm using `attrs` because of this video. Otherwise I would have used `namedtuple`, because I think without your videos I somehow would have missed even `dataclas`.
A few arguments for dict gang:
Everybody knows how it works and what the syntax is. Many libraries use it as inputs or outputs. If used as an interface it is easy to change it without breaking things. It is trivial to load and store them in json or send them over the network. I generally don't like speed comparisons of python code since that should never be the bottleneck of your program (you are using the wrong language if it is) but it is nice to know that dicts are fast af. I also haven't had any problems with reliability. But I guess that's partially due to my vscode extensions checking what I'm typing and giving suggestions.
But I have to agree with you that the syntax is quite ugly compared to accessing elements with a dot. Though I don't think it would be too bad if the language added similar syntax for that (basically any kind of shorthand for ["string"]. But I guess . would be ambiguous. and most other characters already have a meaning. so maybe a double dot? some_dict..some_element)
The problem of dicts is that you cannot check types in compile time. You can store different data types in one field due all runtime. By the way, static analyzers cannot predict the type of element of dict, so other IDEs (like PyCharm) cannot help you with suggestions, especially with available methods for each field. In your IDE, you use an AI-based extensions that predicts data types, but it is not a static analyzer
This is timely. I've got a program in writing now, and am using dictionaries. I still think that formatting my data as nested dictionaries is the best representation of that data. Also, the original data format is actually defined as a dictionary.
@@Alex-uh6qh This is true, but I think by picking python as your programming language you've already given up on being able to easily track the types of objects. With all the type hinting and other type-related stuff you are still quite far from the type information you have in languages like C++.
I do not use data classes nearly enough. This is good motivation to change that.
Really glad to see slots supported in dataclasses now. When you have a lot of instances of one class slots can save a ton of memory
This was added in 3.10?
@@mishikookropiridze That's more of a statement than a question isn't it
@@falxie_ It is statement and hence you can assign boolean value.
Thx for the explanation! But it seems at this point Python is contradicting its own Zen: "There should be one-- and preferably only one --obvious way to do it."
IMHO one should always prefer immutability. The diff between creating a new instance and setter could be ignored. If the performance is that critical, maybe one shouldn't choose Python at the first place.
Wow this is amazing man thanks for putting this together
4:36 There's TypedDict to consider too tho. (As in the type safety thing. You could type both dict and tuple and use a static type checker. If you use PyCharm and single quotes, accessing data by key is also not typo prone.)
Why single quotes? There's no difference between single quotes and double quotes.
@@PanduPoluan Idk, ask PyCharm (and also VS Code I just found out) out.
Too clarify: not typo prone => there's IntelliSense / auto completion.
@@maimee1 well personally I don't find any difference between using single quotes and double quotes. But then again I always use double quotes because Black enforces that.
I have used almost all of these, so I can say this is a fantastic summary of the various options.
The apischema package is a good middle ground between Pydantic and dataclasses. It allows you to do the same runtime validation on dataclasses if you need to and has the same features as well as a GraphQL schema generator. It also performs validation faster than Pydantic.
Never used that one, thanks foe llr sharing!
I am the owner of a backend project at my company and i use only pydantic, as we perform multiple api calls the validations are essential, and it integrates really well with fastapi
Best Python channel on RUclips. Thank you. If you used neovim too, it would be out of this world 😅
Thank you! Not a neovim user although vim was my main editor for a while there.
Great video! As always, the topic is well explained and I learnt something new
The on-screen comments are really fun, I hope you'll put more of this in the future videos!
Great stuff!
That would be even better if you'll make a follow up video about serialization of those objects and libs that can help.
Often it's required to send tuple/datacalss/etc data over Kafka, to a DB or save as json and etc.
Include 'marshmallow' lib in the vid as well!
Wonderful last seconds, but wonderful video too!
I feel like I should go to casinos more often because I have no idea what slots are :)
Great video! Typehint gang
I think in the past, because I've been lazy, I've used tuples but, because I don't hate myself, I had constants for which index was which. I don't recommend this but that would give you the speed power of tuples with some of the naming power of namedtuple
your videos are always the best
In current versions of attrs, it only requires assigning fields to an `attrs.ib` if you need anything per field option beyond a default.
Otherwise you can use regular variable declarations like dataclasses does.
(You might need to use the "next-gen" API, I can't remember at the moment)
If you care about correctness, I would argue for NamedTuple. The fact that it's immutable is a feature, not a bug.
Immutabulity is definitely a feature, but mutability is also a feature. As always you should choose based on what is most appropriate for your problem.
Creating a new tuple still looks just as fast as modifying a value in a dict, interesting
Yeah that was the biggest surprise for me, but I guess it kinda makes sense since a tuple can be implemented as a thin wrapper around raw memory, but a dict has to do hashing and such.
However, you'll likely end up needing to get the tuple's values in order to instantiate a new one. So performing a function similar to dict setattr would be at a significant cost.
Hi mCoding, this is a pretty good video, it would be really cool if you could maybe do an updated version? Specifcally I would like to see the addition of TypedDict.
Already know this is gonna be helpful and instructive!
For sure 🙂
OK, on type hints. I probably need a kick in the you-know-where, but I can't get it to play nicely with a few packages and features I need. I frequently use numpy, and a lot of funtions don't really care about receiving a single number or a full array of numbers. They may not even care if the number is a float or an int, but let's focus on floats here. The return value typically has the same shape as the main input, but may be a single number.
How do I set up type hinting for numpy arrays? How do I set type hinting up for polymorphism?
I haven't looked into it extensively, but I'm aware there is a numpy.typing module which includes an ArrayLike type for anything that can be converted into an array, including scalars. you might want to look into the module documentation.
specifying the dimensions of an array in python's type hinting system is generally difficult however, so I'm not sure there's a way to incorporate that information in your annotations.
Use TypeVar.
For instance, here's a made up function:
T = TypeVar("T")
def makelist(n: int, item: T) -> list[T]:
return [item for _ in range(n)]
Thanks! This is exactly what I needed.
Very informative video, thanks James!
its also possible to use namedtuple like this:
T = namedtuple('T', 'n f s')
"As well as BaseModel, pydantic provides a dataclass decorator which creates (almost) vanilla python dataclasses with input data parsing and validation."
Yeah that's cool and whatnot, but have you ever tried this? class D(dict): __getattr__=dict.__getitem__; __setattr__=dict.__setitem__; __delattr__=dict.__delitem__
Lol no i never considered that :)
@@mCoding absolutely should, it's so easy and error-prone, practically a cheeseburger of python.
I started using pydantic bc it allows specifying a conversion function to try to cast input values to the desired type. I didn't realize how much of a performance hit that library incurs, or that attrs can do this also but much more cheaply. I guess I should switch to attrs.
I didn't specifically compare times for when you are doing conversions. Make sure to time your use case yourself since Pydantic may still be faster if you are doing conversions.
@@mCoding Ah, thanks for the heads up. That probably accounts for a decent chunk of the time difference, as I think pydantic is always going to try to do basic type casting on all values when instantiating new instances, which surely comes with some overhead, particularly when you supply a custom function for it.
There's also TypedDict (since 3.8) with typesafety
TypedDict is actually just a dict at runtime, it's value is only for static typing.
What library was used to generate the graph? It looks nice.
Plotly, It's easily the most interactive Visualization Library. It's as simple as matplotlib.
Yep, plotly express specifically. Check out the code on github! Link in desc.
@@the_crypter That's a bad measure of simplicity
For me Pydantic is great for prototyping and the losses are acceptable for the sake of being always in detail informed about data errors. It even enables you to to skip writing early tests because of that.
Still the charts are extremely useful to show that Pydantic can be an important target in optimization.
An excellent point. This is Python after all, raw speed is not usually what we optimize for and paying some extra runtime cost for data validation when it "shouldn't" be needed may be worth it depending on the situation.
Is it tho ? 5 microseconds for creation, 9 and 400 ns for getting and setting, and it was before pydantic v2 enhanced by rust.
Unless doing several thousands of those are we really concerned about those numbers in python ? Especially when writing an API where the network latency and DB queries could easily reach the 100ms mark in good conditions.
@@heroe1486 I agree that it would be great to see this video with Pydantic V2 performance included. They made amazing progress.
I agree with the rest you said as well. You asked and answered you question. Like I said, it can be an important part, not everywhere, but it's good to have it on your checklist.
thank you for the thorough explanation
How did u measure memory usage?
Awesome comparison! What did you create the interactive graph at the end with? Looks like a nicer version of matplotlib.
Plotly
Good comparison, thanks!
man, you really hammered down on this issue. No need to watch anything else.
@mCoding are you REALLY sure you've measured memory footprint correctly? What was your test methodology? The difference between NamedTuple/dataclass/class is supposed to be quite different from what you've shown (they do differ but not THAT much).
According to this video (it's in russian, but code is clearly visible): youtube /tsEG0WM3m_M?t=60 :
1. The author uses pympler.asizeof() function instead of built-in ones since it's the only right way to measure *FULL* memory consumption of a given object. I personally re-tested it (generated a HUGE collections, taking literally gigabytes if RAM) - and yes, the built-in ones were returning some rediculous results, not even close to the actual RAM taken by python interpreter.
2. According to his tests, the difference is actually like this (on 1k instances):
2:05 - dict = ~ 1.2MB
3:44 - dataclass = ~ 1Mb
5:04 - namedtuple = ~ 720 Kb
5:54 - typed NamedTuple = also ~ 720 Kb
It's hard to say whether the way I counted things is the "correct" way because it depends on what you wanted to count, but the numbers are approximately the same with pympler vs the getsize method I used. The order of which classes use the most memory is exactly the same with either method. The main difference between what pympler does vs what I did not do is that pympler tried to account for object alignment. pympler assumes that all Python objects are 8-byte aligned and no packing is done (hence why the pympler answers are all multiples of 8), counting padding bytes in the total size count. On the opposite end my getsize assumes all objects are optimally packed together, not including padding bytes in the total size. The truth is probably somewhere in the middle and also an implementation detail that could change at any moment. But, in any case, I wouldn't call either method the "correct" one, they are both good estimates and their difference is pretty small.
Also note that depending on the way you do your tests the data can make a big difference in how much space is actually used. For example (1,1) uses less memory than (1,2) because the 1 objects in the first tuple are the same.
pympler
0: dataclass (slots) - 168 bytes
1: plain class (slots) - 168 bytes
2: tuple - 176 bytes
3: NamedTuple - 176 bytes
4: namedtuple - 176 bytes
5: attr class (slots) - 176 bytes
6: dataclass - 432 bytes
7: plain class - 432 bytes
8: attr class - 432 bytes
9: dict - 512 bytes
10: SimpleNamespace - 552 bytes
11: pydantic - 560 bytes
method i used in video
0: dataclass (slots) - 162 bytes
1: plain class (slots) - 162 bytes
2: tuple - 170 bytes
3: NamedTuple - 170 bytes
4: namedtuple - 170 bytes
5: attr class (slots) - 186 bytes
6: dataclass - 408 bytes
7: plain class - 408 bytes
8: attr class - 408 bytes
9: dict - 488 bytes
10: SimpleNamespace - 528 bytes
11: pydantic - 536 bytes
@@mCoding thanks for such a detailed responce.
> For example (1,1) uses less memory than (1,2)
Obviously, when you do performance tests, you need to intentionally break those under-the-hood optimisations. Back then, when I was checking myself examples from the forementioned video, I used the simplest values for items I could think of. iirc, each class (simple class, dataclass, dict, set, list, tuple and various types of named tuples) had just 3 values:
1. an int, unique for each item (and I know that int is internally optimised up to 256 or smth, but that's neglegable relative to the total number of items I had for test - iirc, it was about millions, tens of millions or smth of that matter).
2. the same int, converted to a string, padded with random ASCII characters to make all the strings of equal length (used random characters instead of zeroes - just to be sure).
3. a float in [0, 100.0] range - also unique for each item.
And to be the most precise, as I said, I kept increasing the number of items until the total collection size reached above 1 Gb. Each measure attmpt was done in a separate python session. And that's the thing I'm intrested the most when I asked about your methodology. With your method - did you just create a single instance and measured it or you generated a big enough number of them, measured the total consumption and divided it by the number of items? I mean, a single item difference might be 168 bytes vs 162. But if you have a tuple with a million of dataclass instances vs the same tuple type storing the same million of items with the same underlying data, but items themselves are NamedTuples now, my results were very different from what you've shown. At the end of a day, it doesn't matter that each individual instance is reported about the same. What matters is when you have a ton of them, and the only varying factor is type of an item, you should count the total difference as overhead. You won't use just a single instance of that dataclass/namedtuple in your program. So I don't know the theory behind it, but in practice my own tests gave the same results that russian guy tells in the video. And dataclass vs NamedTuple were nowhere near 162 vs 170 numbers you provide.
Speaking of which, I have no idea how it's even possible for dataclass to take less memory than a named tuple or even a simplest tuple.
So, could you disclose your methodology?
To be clear: I'm not attacking, I really want to know the actual difference in various types of data containers. I'm just concerned that the numbers you provide conflict with basically everything I ever heard on the subject and with my own synthetic tests.
I wish there was a channel like this for lua
Hello, if you didn't know, I decided to use the dataclass Python decorator as my handle
Haha you are gonna get a lot of accidental mentions with a handle like that!
Great video ! But I feel like the onscreen comments were a bit distracting
Still prefer dataclass since there is no need to install additional packages :)
Please care to explain why shouldn't I assign attributes to an instance of an empty class? 4:50
I'm a bit confused... do you have any idea why SimpleNamespace's get is so horribly slow? I mean, it's a hash lookup anyway.
@mCoding. Could you redo this video with Pydantic 2.0? I get what you are saying about @dataclass being used in internal applications but sometimes you don't know for sure if it won't eventually be serialized into JSON, so pydantic is something I choose if I'm not sure. I want to know if the new 2.0 with Rust implementation has gotten the speed into the same ballpark as the other options.
Hmm, perhaps. While a rust implementation under the hood may improve performance, I suspect that it will not change the qualitative picture very much. Pydantic is slower primarily because it is fundamentally doing more work, namely validation and conversion, whereas the other options do neither validation nor conversion.
@@mCoding Maybe a good video might be how to use Dataclass and Pydantic together.
I think in my case half of my projects are with FastAPI which I love and it depends on Pydantic. I've seen too many videos that compare Pydantic with Dataclasses (yours included) and have come to think of them in the same category. Since I'm already working with Pydantic in half my projects I've just gotten very comfortable with them.
Knowing the performance hit puts a slightly different spin on the situation so maybe Dataclasses should be used for all internal-only data that won't be parsed. So then maybe just wrap a Dataclass in a field of a Pydantic class when you need to parse it.
I'll keep this in mind myself in the future.
Pydantic + Dataclasses would be an interesting video for me if you solicit ideas.
What are you using for the visualizations at the end?
Thank you. This information is really useful
It is possible to use @dataclass(init=False) and custom __init__() for a parsing purpose. With slots for sure ;)
thank you man! You helped me a lot!
What tool are you using for your bar chart?
Plotly express! It can export to html you can share in your browser without python even installed.
Hi
Excellent Video! I was wondering what would be the right choice if I wanted to use the created class in a jit compiled numba function? As far as I have seen, namedtuples seem to be most suitable?
I think you need a class that is serializable. namedtuple and NamedTuple are serializable by default.
@mCoding How do measure speed execution in a repative way? I was trying to measure performance, but for the same setup I had got scores differing a lot (more than few percentages). Code was purely in Pyhon, no external sources, no io, but still differences were very noticeable.
For this video I believe I used timeit since they are tiny snippets, and the timing code is available in the github repository in the description. Timing measurements may vary drastically depending on things such as on your your cpu and version of Python, which is why it is always best to verify the timings for your own setup!
@@mCoding Also with Intel's franken-CPU having "P" cores and "E" cores, it will be a gamble.
Great video, thanks so much!
And thank you for watching!
Accessing a dictionary's values by key is its primary purpose...it's only error-prone if you are ignorant to the pass-by rules of the value's type.
#teamdict
I have my own class decorator that returns dataclass(cls), but I get no type hints this way. Is there a way to fix it ?
My 2¢ on attrs vs dataclasses:
* For application code, you can do whatever you want, but you should consider the cost of taking a dependency. This cost varies depending on the nature of your application and how you build/test/package/etc. it, so there's no one-size-fits-all answer here.
* For library code, don't take a dependency unless you absolutely have to, because you will be forcing it on all of your clients. dataclasses exists, so attrs is firmly in the "not absolutely necessary" bucket, and libraries should not depend on it in most cases.
Really liked the visualization. Is that plotly?
Yep! See the code to produce it on GitHub!
@@mCoding will check that out
would have been cool to compare pydantic with the validation turned off for fairness sake :D
Perfect video. Congrats
Which software/package/language are you using for the graphs UI in the end?
Plotly! See the source code in the description if you would like to see the exact code i use to generate the plots.
Also dataclasses can be "frozen" so they are not modified, which to me is better than pydantic's BaseModel
4:43 "SimpleNamespace is just like object, except it allows you to set attributes on it at runtime whereas object doesn't "
Maybe I don't get something, but objects do allow you to set attributes (= instance variable, right?) at runtime:
```python-repl
>>> class A:
... pass
...
>>> a= A()
>>> a
>>> a.a
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'A' object has no attribute 'a'
>>> a.a = 5
>>> a.a
5
```
Glad to see you are watching so many of my videos! Yes classes let you set attributes at runtime. What im referring to here is that objects whose type is literally object, as in "x = object()" cannot have attributes set on them. If you try "x.a = 0" you get an error!
@@mCoding OK, I see now. Thanks.
Yes your videos are really interesting and useful. So I decided to watch them all. But I guess I should spend less time watching videos and more time writing code. :)
What is the use case for slots?
Pydantic is always the answer. Since you can use the built-in dataclasses within Pydantic, if needed. But if you work in corporate, than yeah, dataclasses is the only answer 😅
Do you mind explaining why using dict is error-prone? Doesn't seem trivial to me.
Unless you define a TypedDict, you might accidentally mistyped a key resulting in a KeyError.
Personally I find that the “potential typo” issue is overstated. I have 20 years of python experience and it’s never been a serious source of errors. Code that isn’t easily understood, like when you use a mess of nested classes instead of a simple data structure with a dictionary at its root, however, has caused me a ton of problems and really hard to debug situations.
Can't you throw your python code through some optimizer to convert everything to a tuple wherever possible? Your source code would still be your own readable code, but the optimized code that comes from that will be more optimized for speed and memory usage. Best of both worlds!
You don’t like dictionaries for this, for reasons that (to me) come down to basically, “I want my IDE to autocomplete for me.” If you remove that consideration dict ends up being the choice with the most flexibility and least work. This is especially true if you need to serialize the data and send it to other systems. You can’t easily jsonify a dataclass!
Really, my POV comes down to “why are you using python for that?” It’s not a language that’s made for large systems where this stuff matters, it’s made for rapid prototyping and applications where speed of implementation is your biggest factor.
If you’re in a situation where the extra work to use dataclasses is worth it, why are you using python at all?
okay, so first, dataclass has no type checking , with attrs you must give validator with validator= , so the notation alone n:int is not working .
This foreign library classes are horrible.
How do i do it. It is no problem to write a class what is defined as in pydantic, read kwargs and set args with type checking on init and methods, including checking of collection items types, like List what i am almost sure these libraries are not making, but have nice ability with one- command to turn it off as debug is done.
Hi, it seems like you are new to Python. The notation x: int is not supposed to be something checked at runtime, these hints are completely ignored at runtime as this would be a huge (think 10x) performance penalty, which is shown in the graphs in the video. Most type errors can be found by static analyzers, which is who the x: int is for. The only case when you need to do runtime checking is when you don't know the types ahead of time. The most common situation this happens is parsing since you don't know what data you are going to read in next, and this is why pydantic purposefully pays the cost of runtime type checking.
@@mCoding hey, i am not new. It is function field types what are ignoted, here is declaration of static class field (n) equals to a class (int): n = int
I thought, in THIS tools for example @dataclass the notation SHOULD typecheck, because, why do we write it dataclass construction like that? And i just checked to enter like float instead of int, and it works smoothly.
So My solution would be, retaining the @dataclass syntax what i found kind of convinient, because it retains order and no need to specify all arguments as named, create default type- checkers and turn on them, and if you want custom checker, you can write there like a = lambda x: 0
Which tool are you using to create the interactive bar charts?
Plotly express
What about list?
if i have a very large set of data, with different types of data like multiple timeseries, single character/digit variables etc, should i use dataclasses to store them? and if so how? do i pickle classes? right now i'm using pandas for everything. thanks for the video
I may be in the extreme minority here, but IMO dataclasses are not a good fit in most situations, but particularly here where you have large sets of nested data. Just stick with dict or pandas.
Your videos are so good that I believe you could create a good intermediate-advanced python course. Just saying
And here I was thinking I was fancy by bundling data into a dictionary vs lots of variables! (Stupid Dunning-Kruger effect)
Give yourself enough time and you’ll come back to the wisdom of simply using dictionaries instead of complex nested objects.
I see you import modules inside function. Any particular reason?
This was just to make it easier to see which imports were needed for which examples.
Okay at 5:55, One of my recent headaches is reading a god-damn xml. I hate the guts out of it. I have to parse everything as it is always in string format. xml.etree is great but I still have to manually input every string. and rename classes
Are you using R ggplot for the plot?
im using plotly!
Well, firstly, your hair was so great tho
If only python-box was included. It provides both dictionary style and dot style access
dataclasses video with slots and inheritance(super_init)