Bytes and encodings in Python

Python and Pandas with Reuven Lerner

Просмотров 103 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 2 дек 2024

Комментарии • 98

@ReuvenLerner 5 лет назад ⁺¹
Hey, there's a newer and more accurate/complete video now available: ruclips.net/video/5zMmc1wmZ_0/видео.html . That video is part of my "Python standard library" series, to which I'll be adding a few videos every week.
@ben3295 Год назад ⁺¹
This was wonderful way to present this topic--concise and clear. I learned a lot here.
@ReuvenLerner Год назад
I'm delighted to hear that it helped!
@MsMJehan 2 года назад ⁺¹
Best explanation of the topic I found yet, really appreciate your time to explain this.
@ReuvenLerner 2 года назад
So happy to hear it; thanks!
@leamon9024 5 лет назад ⁺⁹
Simple and to the point. Thanks!!!
@robertbush9805 2 года назад ⁺²
2:39 - is it a characteristic of the encoding you used that you did s[0] but it printed the character that looks like a 'w' and is actually last in the string?
@ReuvenLerner 2 года назад ⁺¹
Hebrew is written right-to-left! Because I'm using Jupyter (in my browser), it correctly displays the word from right to left, with the first character on the right and the final character on the left. So yes, s[0] will be the rightmost character, "shin," which does indeed look like a w, now that you mention it!
@adilshamji4035 4 года назад
Thank you! Couldn't have hoped for a better explanation.
@takshpatel8109 2 года назад ⁺¹
best explanatation for this topic. thank you
@bensmith9253 4 года назад ⁺¹
This was FANTASTIC.
Subbed!! Cant wait to watch the next
@nocontentnoname5922 4 года назад ⁺¹
Helped a lot, thanks.... anyone else watch this clip in covid lockdown?
@OlafBeardfoot 3 года назад ⁺¹
Thank you so much, this video helped me with a problem haunting me for a week!
@ReuvenLerner 3 года назад
Delighted to hear it!
@shinej11 2 года назад ⁺¹
Thank you very much for such clear explanation on this topic.
@andracoisbored 8 месяцев назад ⁺¹
Interesting. I actually understand all this explanation. Pretty proud of myself 🙂🙂🦆🦆
@ReuvenLerner 8 месяцев назад
Excellent!
@tymothylim6550 3 года назад ⁺¹
Thank you very much for this video, Reuven! It was very educational and I really enjoyed it! :)
@manojkumar-jd1dk Год назад ⁺¹
Hi, i am working on creating mainframe ebcdic file whuch has got comp-3 columns (packed decimal) and regular data type colums. Columns other than comp-3 am encoding to cp037. I am trying how to convert data to comp-3 and encode to cp037 like other columns. Please suggest
@ReuvenLerner Год назад
I don't really know, sorry - I know that there are a ton of encodings you can use, declared at docs.python.org/3/library/codecs.html, but I've never touched EBCDIC, and certainly don't know about comp-3. I hope you can find some help on this!
@abderrahmanedjerourou6369 3 года назад ⁺¹
Thank you. Simple explanation.
@NiklasAndersson7 6 лет назад ⁺³
Great tutorial. Thanks a lot!
@erick5171 2 года назад ⁺¹
thank you. best explanation. loved it.
@ReuvenLerner 2 года назад ⁺¹
Delighted to hear you enjoyed it!
@dexdevlon8941 5 лет назад ⁺²
0:24 No, you're wrong, there is reason binascii library exists...
@angelicamatch211 6 лет назад ⁺¹
sorry just went thru the vid again in full....and not sure why its able to print out the 2 byte characters if you set it to a varaible...and also so around 5minutes in till the end thats where its confsuing....not sure if i should try the method of setting ot to a varaible as you did and put in my code at the lines where i get the cant decode byte in cp 1252.....why is cp 1252 file involved anyway in my script decode issues/?...thnxz bit still very lost on this..even though i feel im actually understanding the different codings schemes
@MilanKarakas 3 года назад ⁺¹
Still getting error when tried to save file file.write(number), from 1 I am getting 0x31, which is ASCII for 1, but I want to save 0x01.
@ReuvenLerner 3 года назад
The "write" method takes a string, not a number. (If you open with "wb", then it can take a bytestring.) So make sure that you're passing a string to the method when you try to write to the file.
@use_FongSha 2 года назад ⁺¹
Excuse me sir, I have a question, What method can I use to let the b'\x00 \x00' convert to string and let it be '00' ?
@ReuvenLerner 2 года назад
You're probably best off with a comprehension that goes through each of the integers in a bytestring and turns it into a string, then joins them together. There isn't, so far as I know, any method that'll do this for you.
@jaikumardas5760 6 лет назад ⁺²
Nice Tutorial.Love It
@angelicamatch211 6 лет назад ⁺¹
i am still trying to print a charcter in the 2 byte range in python2 ..for example........an arabic charcter(which for some reason i cant copy and paste here i dont know why)....but its the U+0626character and i cant encode it..so that it will print..i thought you can do .encode ('utf-8') or u(and the U+0626) and it wil print but i keep getting unsupported chaharcater
@believer9746 3 года назад ⁺¹
Sir how to find the number of bytes value in pythons (attribute value of series)...how the output gives values like 32bytes...etc plz explain
@aaratidhungel2109 5 лет назад ⁺²
sir how can we contvert utf-8 code into character ??
@rememberfact5406 5 лет назад ⁺¹
Hello, sir, I need help I want to remove the non-printable character from the XML file. because while parsing the XML to python dictionary object it shows an error. xml.parsers.expat.ExpatError: reference to invalid character number: line 19495, column 31
@juuamjskn2420 4 года назад ⁺¹
try open the file with a encoding utf-?
@PrashantSharma-ql4yb 3 года назад ⁺¹
Thanks, very valuable.
@mucharlaravivenkatateja2412 3 года назад ⁺¹
Hi, how to get rid of unicode errors while reading excel sheet in python
@ReuvenLerner 3 года назад
If you use Pandas to read your Excel sheets, you should be fine. And if not... then I'm not sure what to say. There are numerous libraries for reading Excel in Python; you almost certainly don't want to be doing it yourself.
@BlenderDumbass 4 года назад ⁺³
saw שלום subbed
@LordBurningStuff Год назад ⁺¹
maybe i'll find this out before you can answer, but how can you use python to write your own binary encodings? I want to assign 4-bit binary codes to a set of
@ReuvenLerner Год назад
I'm sure that you can do this but... OMG, that sounds really hard and annoying. Unless memory is at a *huge* premium, or you're just looking for an intellectual challenge, I would stick with existing standards.
@LordBurningStuff Год назад ⁺¹
@@ReuvenLerner program im writing has a bottleneck caused by reading and writing speed. If i can cut those down using custom codes that are 4 instead of 8 bits, woop woop, might have a game changer.
@ReuvenLerner Год назад
@@LordBurningStuff Ooh, very interesting! You're basically defining a new encoding, which is fine, but (so far as I know) Python is hard-coded to deal with a bunch of others, and can't be extended to use new, custom ones without recompilation. I might be wrong, though,
@erkamtokgoz6130 6 лет назад ⁺¹
thanks a lot. Is there another way?
@andrewhopkins1694 4 года назад
Hi Reuven, I am working on a python program (open source) that I will use to search the Hebrew text of the Tanakh, about 2.2 meg of data. I have written small programs in python before. Because the text I have in unicode it looks like I have to learn a bit. Could you recommend any librarys, and methods for storing the Hebrew text? I was thinking of sqlight, but maybe that is over kill. I hope to index the location of ever word in the text.
@ReuvenLerner 4 года назад
I'm a big fan of PostgreSQL, which handles Unicode and text searches beautifully. But learning to work with it might be overkill for your needs. That said, if you're using Python 3 and any modern version of PostgreSQL, you should be fine, and the Unicode stuff should just work automatically. The best part of using PostgreSQL is that you can do all sorts of text indexing, which I *think* will work with Hebrew.
You might also want to check with Sefaria, which is a public and free version of the Tanakh and other Jewish texts. I don't know if their software is open source, but if it is, you might be able to leverage it.
The coolest part of what you wrote is the fact that the entire Hebrew Bible is only 2.2 MB. Wow, talk about a small book having a big influence!
@andrewhopkins1694 4 года назад
@@ReuvenLerner Thanks for he reply. I am using python 3.8. I will look into the text you spoke of. Because the text I have has a copy write on it which is a problem for me. Also I thought I would try sqlite. I had a little experience with it. I have been able to read and display a unicode file of the Hebrew text and maintain the Hebrew font. So its true as you said it all seems to work. You will see in the following I make one reference to unicode. I cut and pasted the text from a word doc to a txt file and it held the Hebrew text formatting when using print(). Andrew
f_name = 'genesis.txt'
f = open(f_name, "r", encoding="UTF-8")
if f.mode == "r":
contents = f.read()
print(contents)
@yonTlevin 3 года назад ⁺²
איזה מצחיק חיפשתי הסבר לאיזה משהו נכנסתי לסרטון שלך ופתאום אתה רושם שלום בעברית איזה הפתעה כיפית
@ReuvenLerner 3 года назад
טוב לשמוע!
@felixdunkel2091 3 года назад
The man despises you though.
@noahn12 2 года назад ⁺¹
hi. could you please explain how to find crc32 in Python Data?
@ReuvenLerner 2 года назад ⁺¹
I haven't ever used it myself, but the Python standard library includes a function that calculates crc32: docs.python.org/3/library/zlib.html#zlib.crc32
@noahn12 2 года назад ⁺¹
@@ReuvenLerner thank you very much.
@alessioandreoli2145 2 года назад ⁺¹
Super interesting. But I didn't get how I can write a function that can read and properly encode any input from any language... Is that even possible?
@ReuvenLerner 2 года назад ⁺¹
If you use Python strings, with Unicode characters, then yes -- you can work with any language.
@alessioandreoli2145 2 года назад ⁺¹
@@ReuvenLerner so when I read a command output or file I should encode what I read and then decode when I print the output or printing it as byte type?
@ReuvenLerner 2 года назад ⁺¹
@@alessioandreoli2145 When you read a command from the user, or text from a file, the text is in Unicode unless you specify otherwise. That means the text can/will contain any characters known to Unicode, which is basically anything you're likely to get. No bytes are involved. If and when you want, you can encode from Unicode strings into bytes, but there's rarely a need to do this unless you're dealing with binary data or the network.
@jeffjefferson2676 4 года назад
Hello, i got a question: what are the main type of coding sets used? Like latin-1, utf-8, utf-16, utf-32. I would like to know, while decoding seems to be hard to do, but is it really if you just use some main stream used coding sets?There is a pretty big chance that you can get it done.
@ReuvenLerner 4 года назад ⁺²
Most people are currently using UTF-8, so far as I know. But the world is a big, messy place, and there are a lot of documents in other encodings, too.
@nikhilshingadiya7798 3 года назад ⁺¹
What happened if i made mistakes like "b'abcd'" how to get out from this things
@ReuvenLerner 3 года назад ⁺¹
If you made a mistake in creating a string or bytestring, then you should re-create it. Bytestrings, like all strings, are immutable in Python, so you cannot fix/change it.
By contrast, if your problem is that you forgot a quote mark (as in your example), then you can just go and fix your Python code, and rerun it! This is one of those things that shows how important it is to test your code before deploying it -- a lesson that we all learn from time to time.
@nikhilshingadiya7798 3 года назад
@@ReuvenLerner thanks
@peaceorazu Год назад ⁺¹
how were you able to change it hebrew key
@ReuvenLerner Год назад
On the Mac, you can configure any number of keyboards, and then switch between them. On my computer, for example, I've got US English, Hebrew, and simplified Chinese, and move between them with command-shift-space. Your computer (if it's not a Mac) probably has something similar.
@angelicamatch211 6 лет назад ⁺¹
what does in python 3 mean??...i know that in python 2 means data type is string.....
@ReuvenLerner 3 года назад
In Python 2, strings contains bytes. If you used ASCII, then that was fine. But if you used Unicode, where a character can have more than one byte component, this wasn't so good. You could, in Python 2, use Unicode strings, which handled such characters.
In Python 3, it was turned around: Strings handle Unicode, and the new "bytes" type handles sequences of bytes without regard to their contents or whether they can be turned into characters. So if something says "class bytes" in Python 3, that means it's just a bunch of bytes. You'll see a "b" before the opening quote, indicating that. And the elements of a "bytes" sequence in Python are bytes, represented by integers.
@angelicamatch211 6 лет назад
I have a script written in python 2 that I want to run in python 3. ..but I get error Unicode can't decode characters at line 7414 in cp 1252 file. ...I get what you're saying in this video but I still don't know how to actually solve it...
@ReuvenLerner 6 лет назад
The bytes-vs-string and 2-vs-3 thing is confusing for many people. I'm going to try to put together another video, or maybe an e-mail course, on it. I'll announce it to my blog (blog.lerner.co.il) and mailing list (lerner.co.il/newsletter) when it's ready!
@angelicamatch211 6 лет назад
thank you for the email...i didnt know that its such an involved complex solution needed..hope you can post a video..cuz a paid course is not an option for me...if you need to see the code where im having the issue and the related file and the actually error ..i will email you soon..thank you again sir
@ReuvenLerner 5 лет назад ⁺¹
@@angelicamatch211 I've uploaded a much more complete and accurate video: ruclips.net/video/5zMmc1wmZ_0/видео.html
@user-dj9zb3bs1b 4 года назад
can we open any file in 0s and 1s form . or bit mode
@ReuvenLerner 4 года назад
You can open any file in binary mode, and get the bytes. But if you're working with text, then you'll likely want to open it in character mode (i.e,. the default), so as to get the characters, not the bytes.
@angelicamatch211 6 лет назад ⁺¹
when i try your example in the 1st ..2 minutes of the video..its fails in python 2 with the error....unsuported characters in the input..which from my online reveiw study research..makes sense i guess..cuz python 2 uses ASCII by default...so its normal this is out of range and ..i used arabic character..like U+0626 and in python 3 it gives for len (s)...as 8..which makes sense cuz its in the 2 byte range ...so its using 2 bytes per charatacter(goes to 32768 in the top left bit of 2nd byte)....(im not sure why in ur vid it still shows 4 as the len of characters, if its using 2 bytes per charcter)...so for my UnicodeDecode error that i mentioned below...it leads me to belive my script needs to be told..to encode/decode using Unicode character sets and the UTF-8 program to treat each character properly...so how can i check what its actually doing my script in python 3 , thats not making iit work..but more importantly i would say...how to set the script that its running Unicode and UTF-8?????..can you assit me here thank you for the consideration
@angelicamatch211 6 лет назад ⁺¹
i dont get the signifincfice that in python 2 ..b=b'abc' and type(b) gives...type.'str'...and the same code in python 3 gives class'bytes'...and then in. python 2 print(b[0]) gives...a .....and the same code in python 3 gives...97.........i know this represents that in python 2 you get the actual text and in python 3 you get the ascii number code....so like how does play into my issue of UnicodeDeccode error???.....i also read this............................everthing must be encoded b4 being written to disk and then decoded to be read by humans..and if you encode something in lets say ASCII then you can decode it cuz you know it was encoded in ASCII and so you decode it from ASCII back to a text.....but so for my error of unicode decode ..i would think that all the encode was originally done in ASCII cuz the script works fine in python2..and so why is my script trying to decode in cp1252??..how can i get it to decode in ASCII in python 3??..if thats the correct logic
@GgGg-zh3pl 2 года назад ⁺²
אני צופה בסרטון ואני רואה אותך כותב שלום וכולי שמח פתאום
@ReuvenLerner 2 года назад ⁺¹
תמיד שמח לשמוע מאנשים שיכולים לקרוא את העברית! כל טוב וחג שמח!
@GgGg-zh3pl 2 года назад ⁺²
@@ReuvenLerner חג שמח! רק שאלה, אתה מישראל או שלמדת עברית?
@ReuvenLerner 2 года назад ⁺¹
@@GgGg-zh3pl גדלתי בארה״ב, ועליתי לארץ בגיל 25, ב-1995.
@GgGg-zh3pl 2 года назад ⁺²
@@ReuvenLerner מגניבב! אני משוכנע שאתה מתכנת ממש טוב וגם מסביר בין הטובים שראיתי
@ReuvenLerner 2 года назад ⁺¹
@@GgGg-zh3pl תודה רבה רבה!
@baselkelziye4552 4 года назад
very useful, thanks
@noname-sy9vi 4 года назад
how to convert dataframe containing bytes into string?
@ReuvenLerner 4 года назад ⁺¹
I'm not sure, I thought that pandas used Python strings, removing the problem (in theory).
@noname-sy9vi 4 года назад
Reuven Lerner i tried for dataframe but it returns all bytes “ b’mydata’ “ into “NaN.. the decoding method you shown only applicable for a single bytes but not bytes in dataframe right?
@olagerman 6 лет назад ⁺¹
why not use a b.decode()?
@ReuvenLerner 3 года назад
Because I've learned a lot since I made this video. :-)
@clcartlidge 4 года назад ⁺¹
Thanks,
@chandansingh-he3bs 4 года назад
How to convert .bin to .txt in Pyhton 3?
@ReuvenLerner 3 года назад
I realize that you wrote this comment a while back, but I've just discovered how to see all unanswered comments. :-)
If you want to convert a binary file to a text file, then you could read from the binary file one chunk at a time, and then turn each of those chunks into strings using decode. The thing is, that's assuming that the binary file contains UTF-8 encoded bytes. If not, then you need to figure out what you want to convert, and how. And indeed, that's how many binary files work -- they aren't a straightforward translation to characters. So I'm afraid that there isn't one answer to your question, but Python *can* do such things.
@fernandohood5542 4 года назад ⁺¹
How do i convert "10" to a byte 0A/00001010?
@osbaldotheVtenman 4 года назад ⁺¹
use the bin() function and it should do the trick
@ReuvenLerner 3 года назад
If you have the integer 10, and you want to get back a string containing its binary digits, then you can (as @SpaceFace102 mentioned a while back) use the "bin" function: bin(10) will return a string.
That's not quite a byte, though. If you want to get a byte based on the integer 10, you can just... use the integer 10! You can create a bytestring with bytes([10]). That'll return a one-element bytestring with '
' in it. (It's not really
, but it looks that way on the screen, etc.)
@ben_jammin242 Год назад ⁺¹
s[0] = shin? Does python know that the language is Hebrew?!
@ReuvenLerner Год назад
Python doesn't know or care about what language is being used. That's the beauty of Unicode; it can handle all characters, in all languages. Plus emojis, etc.
@omereli1062 6 лет назад ⁺¹
שלום

Следующие

Автовоспроизведение

*args and **kwargs - what are they, and how are they different?