Hey, there's a newer and more accurate/complete video now available: ruclips.net/video/5zMmc1wmZ_0/видео.html . That video is part of my "Python standard library" series, to which I'll be adding a few videos every week.
2:39 - is it a characteristic of the encoding you used that you did s[0] but it printed the character that looks like a 'w' and is actually last in the string?
Hebrew is written right-to-left! Because I'm using Jupyter (in my browser), it correctly displays the word from right to left, with the first character on the right and the final character on the left. So yes, s[0] will be the rightmost character, "shin," which does indeed look like a w, now that you mention it!
Hi, i am working on creating mainframe ebcdic file whuch has got comp-3 columns (packed decimal) and regular data type colums. Columns other than comp-3 am encoding to cp037. I am trying how to convert data to comp-3 and encode to cp037 like other columns. Please suggest
I don't really know, sorry - I know that there are a ton of encodings you can use, declared at docs.python.org/3/library/codecs.html, but I've never touched EBCDIC, and certainly don't know about comp-3. I hope you can find some help on this!
sorry just went thru the vid again in full....and not sure why its able to print out the 2 byte characters if you set it to a varaible...and also so around 5minutes in till the end thats where its confsuing....not sure if i should try the method of setting ot to a varaible as you did and put in my code at the lines where i get the cant decode byte in cp 1252.....why is cp 1252 file involved anyway in my script decode issues/?...thnxz bit still very lost on this..even though i feel im actually understanding the different codings schemes
The "write" method takes a string, not a number. (If you open with "wb", then it can take a bytestring.) So make sure that you're passing a string to the method when you try to write to the file.
You're probably best off with a comprehension that goes through each of the integers in a bytestring and turns it into a string, then joins them together. There isn't, so far as I know, any method that'll do this for you.
i am still trying to print a charcter in the 2 byte range in python2 ..for example........an arabic charcter(which for some reason i cant copy and paste here i dont know why)....but its the U+0626character and i cant encode it..so that it will print..i thought you can do .encode ('utf-8') or u(and the U+0626) and it wil print but i keep getting unsupported chaharcater
Hello, sir, I need help I want to remove the non-printable character from the XML file. because while parsing the XML to python dictionary object it shows an error. xml.parsers.expat.ExpatError: reference to invalid character number: line 19495, column 31
If you use Pandas to read your Excel sheets, you should be fine. And if not... then I'm not sure what to say. There are numerous libraries for reading Excel in Python; you almost certainly don't want to be doing it yourself.
maybe i'll find this out before you can answer, but how can you use python to write your own binary encodings? I want to assign 4-bit binary codes to a set of
I'm sure that you can do this but... OMG, that sounds really hard and annoying. Unless memory is at a *huge* premium, or you're just looking for an intellectual challenge, I would stick with existing standards.
@@ReuvenLerner program im writing has a bottleneck caused by reading and writing speed. If i can cut those down using custom codes that are 4 instead of 8 bits, woop woop, might have a game changer.
@@LordBurningStuff Ooh, very interesting! You're basically defining a new encoding, which is fine, but (so far as I know) Python is hard-coded to deal with a bunch of others, and can't be extended to use new, custom ones without recompilation. I might be wrong, though,
Hi Reuven, I am working on a python program (open source) that I will use to search the Hebrew text of the Tanakh, about 2.2 meg of data. I have written small programs in python before. Because the text I have in unicode it looks like I have to learn a bit. Could you recommend any librarys, and methods for storing the Hebrew text? I was thinking of sqlight, but maybe that is over kill. I hope to index the location of ever word in the text.
I'm a big fan of PostgreSQL, which handles Unicode and text searches beautifully. But learning to work with it might be overkill for your needs. That said, if you're using Python 3 and any modern version of PostgreSQL, you should be fine, and the Unicode stuff should just work automatically. The best part of using PostgreSQL is that you can do all sorts of text indexing, which I *think* will work with Hebrew. You might also want to check with Sefaria, which is a public and free version of the Tanakh and other Jewish texts. I don't know if their software is open source, but if it is, you might be able to leverage it. The coolest part of what you wrote is the fact that the entire Hebrew Bible is only 2.2 MB. Wow, talk about a small book having a big influence!
@@ReuvenLerner Thanks for he reply. I am using python 3.8. I will look into the text you spoke of. Because the text I have has a copy write on it which is a problem for me. Also I thought I would try sqlite. I had a little experience with it. I have been able to read and display a unicode file of the Hebrew text and maintain the Hebrew font. So its true as you said it all seems to work. You will see in the following I make one reference to unicode. I cut and pasted the text from a word doc to a txt file and it held the Hebrew text formatting when using print(). Andrew f_name = 'genesis.txt' f = open(f_name, "r", encoding="UTF-8") if f.mode == "r": contents = f.read() print(contents)
I haven't ever used it myself, but the Python standard library includes a function that calculates crc32: docs.python.org/3/library/zlib.html#zlib.crc32
@@ReuvenLerner so when I read a command output or file I should encode what I read and then decode when I print the output or printing it as byte type?
@@alessioandreoli2145 When you read a command from the user, or text from a file, the text is in Unicode unless you specify otherwise. That means the text can/will contain any characters known to Unicode, which is basically anything you're likely to get. No bytes are involved. If and when you want, you can encode from Unicode strings into bytes, but there's rarely a need to do this unless you're dealing with binary data or the network.
Hello, i got a question: what are the main type of coding sets used? Like latin-1, utf-8, utf-16, utf-32. I would like to know, while decoding seems to be hard to do, but is it really if you just use some main stream used coding sets?There is a pretty big chance that you can get it done.
Most people are currently using UTF-8, so far as I know. But the world is a big, messy place, and there are a lot of documents in other encodings, too.
If you made a mistake in creating a string or bytestring, then you should re-create it. Bytestrings, like all strings, are immutable in Python, so you cannot fix/change it. By contrast, if your problem is that you forgot a quote mark (as in your example), then you can just go and fix your Python code, and rerun it! This is one of those things that shows how important it is to test your code before deploying it -- a lesson that we all learn from time to time.
On the Mac, you can configure any number of keyboards, and then switch between them. On my computer, for example, I've got US English, Hebrew, and simplified Chinese, and move between them with command-shift-space. Your computer (if it's not a Mac) probably has something similar.
In Python 2, strings contains bytes. If you used ASCII, then that was fine. But if you used Unicode, where a character can have more than one byte component, this wasn't so good. You could, in Python 2, use Unicode strings, which handled such characters. In Python 3, it was turned around: Strings handle Unicode, and the new "bytes" type handles sequences of bytes without regard to their contents or whether they can be turned into characters. So if something says "class bytes" in Python 3, that means it's just a bunch of bytes. You'll see a "b" before the opening quote, indicating that. And the elements of a "bytes" sequence in Python are bytes, represented by integers.
I have a script written in python 2 that I want to run in python 3. ..but I get error Unicode can't decode characters at line 7414 in cp 1252 file. ...I get what you're saying in this video but I still don't know how to actually solve it...
The bytes-vs-string and 2-vs-3 thing is confusing for many people. I'm going to try to put together another video, or maybe an e-mail course, on it. I'll announce it to my blog (blog.lerner.co.il) and mailing list (lerner.co.il/newsletter) when it's ready!
thank you for the email...i didnt know that its such an involved complex solution needed..hope you can post a video..cuz a paid course is not an option for me...if you need to see the code where im having the issue and the related file and the actually error ..i will email you soon..thank you again sir
You can open any file in binary mode, and get the bytes. But if you're working with text, then you'll likely want to open it in character mode (i.e,. the default), so as to get the characters, not the bytes.
when i try your example in the 1st ..2 minutes of the video..its fails in python 2 with the error....unsuported characters in the input..which from my online reveiw study research..makes sense i guess..cuz python 2 uses ASCII by default...so its normal this is out of range and ..i used arabic character..like U+0626 and in python 3 it gives for len (s)...as 8..which makes sense cuz its in the 2 byte range ...so its using 2 bytes per charatacter(goes to 32768 in the top left bit of 2nd byte)....(im not sure why in ur vid it still shows 4 as the len of characters, if its using 2 bytes per charcter)...so for my UnicodeDecode error that i mentioned below...it leads me to belive my script needs to be told..to encode/decode using Unicode character sets and the UTF-8 program to treat each character properly...so how can i check what its actually doing my script in python 3 , thats not making iit work..but more importantly i would say...how to set the script that its running Unicode and UTF-8?????..can you assit me here thank you for the consideration
i dont get the signifincfice that in python 2 ..b=b'abc' and type(b) gives...type.'str'...and the same code in python 3 gives class'bytes'...and then in. python 2 print(b[0]) gives...a .....and the same code in python 3 gives...97.........i know this represents that in python 2 you get the actual text and in python 3 you get the ascii number code....so like how does play into my issue of UnicodeDeccode error???.....i also read this............................everthing must be encoded b4 being written to disk and then decoded to be read by humans..and if you encode something in lets say ASCII then you can decode it cuz you know it was encoded in ASCII and so you decode it from ASCII back to a text.....but so for my error of unicode decode ..i would think that all the encode was originally done in ASCII cuz the script works fine in python2..and so why is my script trying to decode in cp1252??..how can i get it to decode in ASCII in python 3??..if thats the correct logic
Reuven Lerner i tried for dataframe but it returns all bytes “ b’mydata’ “ into “NaN.. the decoding method you shown only applicable for a single bytes but not bytes in dataframe right?
I realize that you wrote this comment a while back, but I've just discovered how to see all unanswered comments. :-) If you want to convert a binary file to a text file, then you could read from the binary file one chunk at a time, and then turn each of those chunks into strings using decode. The thing is, that's assuming that the binary file contains UTF-8 encoded bytes. If not, then you need to figure out what you want to convert, and how. And indeed, that's how many binary files work -- they aren't a straightforward translation to characters. So I'm afraid that there isn't one answer to your question, but Python *can* do such things.
If you have the integer 10, and you want to get back a string containing its binary digits, then you can (as @SpaceFace102 mentioned a while back) use the "bin" function: bin(10) will return a string. That's not quite a byte, though. If you want to get a byte based on the integer 10, you can just... use the integer 10! You can create a bytestring with bytes([10]). That'll return a one-element bytestring with ' ' in it. (It's not really , but it looks that way on the screen, etc.)
Python doesn't know or care about what language is being used. That's the beauty of Unicode; it can handle all characters, in all languages. Plus emojis, etc.
Hey, there's a newer and more accurate/complete video now available: ruclips.net/video/5zMmc1wmZ_0/видео.html . That video is part of my "Python standard library" series, to which I'll be adding a few videos every week.
This was wonderful way to present this topic--concise and clear. I learned a lot here.
I'm delighted to hear that it helped!
Best explanation of the topic I found yet, really appreciate your time to explain this.
So happy to hear it; thanks!
Simple and to the point. Thanks!!!
2:39 - is it a characteristic of the encoding you used that you did s[0] but it printed the character that looks like a 'w' and is actually last in the string?
Hebrew is written right-to-left! Because I'm using Jupyter (in my browser), it correctly displays the word from right to left, with the first character on the right and the final character on the left. So yes, s[0] will be the rightmost character, "shin," which does indeed look like a w, now that you mention it!
Thank you! Couldn't have hoped for a better explanation.
best explanatation for this topic. thank you
This was FANTASTIC.
Subbed!! Cant wait to watch the next
Helped a lot, thanks.... anyone else watch this clip in covid lockdown?
Thank you so much, this video helped me with a problem haunting me for a week!
Delighted to hear it!
Thank you very much for such clear explanation on this topic.
Interesting. I actually understand all this explanation. Pretty proud of myself 🙂🙂🦆🦆
Excellent!
Thank you very much for this video, Reuven! It was very educational and I really enjoyed it! :)
Hi, i am working on creating mainframe ebcdic file whuch has got comp-3 columns (packed decimal) and regular data type colums. Columns other than comp-3 am encoding to cp037. I am trying how to convert data to comp-3 and encode to cp037 like other columns. Please suggest
I don't really know, sorry - I know that there are a ton of encodings you can use, declared at docs.python.org/3/library/codecs.html, but I've never touched EBCDIC, and certainly don't know about comp-3. I hope you can find some help on this!
Thank you. Simple explanation.
Great tutorial. Thanks a lot!
thank you. best explanation. loved it.
Delighted to hear you enjoyed it!
0:24 No, you're wrong, there is reason binascii library exists...
sorry just went thru the vid again in full....and not sure why its able to print out the 2 byte characters if you set it to a varaible...and also so around 5minutes in till the end thats where its confsuing....not sure if i should try the method of setting ot to a varaible as you did and put in my code at the lines where i get the cant decode byte in cp 1252.....why is cp 1252 file involved anyway in my script decode issues/?...thnxz bit still very lost on this..even though i feel im actually understanding the different codings schemes
Still getting error when tried to save file file.write(number), from 1 I am getting 0x31, which is ASCII for 1, but I want to save 0x01.
The "write" method takes a string, not a number. (If you open with "wb", then it can take a bytestring.) So make sure that you're passing a string to the method when you try to write to the file.
Excuse me sir, I have a question, What method can I use to let the b'\x00 \x00' convert to string and let it be '00' ?
You're probably best off with a comprehension that goes through each of the integers in a bytestring and turns it into a string, then joins them together. There isn't, so far as I know, any method that'll do this for you.
Nice Tutorial.Love It
i am still trying to print a charcter in the 2 byte range in python2 ..for example........an arabic charcter(which for some reason i cant copy and paste here i dont know why)....but its the U+0626character and i cant encode it..so that it will print..i thought you can do .encode ('utf-8') or u(and the U+0626) and it wil print but i keep getting unsupported chaharcater
Sir how to find the number of bytes value in pythons (attribute value of series)...how the output gives values like 32bytes...etc plz explain
sir how can we contvert utf-8 code into character ??
Hello, sir, I need help I want to remove the non-printable character from the XML file. because while parsing the XML to python dictionary object it shows an error. xml.parsers.expat.ExpatError: reference to invalid character number: line 19495, column 31
try open the file with a encoding utf-?
Thanks, very valuable.
Hi, how to get rid of unicode errors while reading excel sheet in python
If you use Pandas to read your Excel sheets, you should be fine. And if not... then I'm not sure what to say. There are numerous libraries for reading Excel in Python; you almost certainly don't want to be doing it yourself.
saw שלום subbed
maybe i'll find this out before you can answer, but how can you use python to write your own binary encodings? I want to assign 4-bit binary codes to a set of
I'm sure that you can do this but... OMG, that sounds really hard and annoying. Unless memory is at a *huge* premium, or you're just looking for an intellectual challenge, I would stick with existing standards.
@@ReuvenLerner program im writing has a bottleneck caused by reading and writing speed. If i can cut those down using custom codes that are 4 instead of 8 bits, woop woop, might have a game changer.
@@LordBurningStuff Ooh, very interesting! You're basically defining a new encoding, which is fine, but (so far as I know) Python is hard-coded to deal with a bunch of others, and can't be extended to use new, custom ones without recompilation. I might be wrong, though,
thanks a lot. Is there another way?
Hi Reuven, I am working on a python program (open source) that I will use to search the Hebrew text of the Tanakh, about 2.2 meg of data. I have written small programs in python before. Because the text I have in unicode it looks like I have to learn a bit. Could you recommend any librarys, and methods for storing the Hebrew text? I was thinking of sqlight, but maybe that is over kill. I hope to index the location of ever word in the text.
I'm a big fan of PostgreSQL, which handles Unicode and text searches beautifully. But learning to work with it might be overkill for your needs. That said, if you're using Python 3 and any modern version of PostgreSQL, you should be fine, and the Unicode stuff should just work automatically. The best part of using PostgreSQL is that you can do all sorts of text indexing, which I *think* will work with Hebrew.
You might also want to check with Sefaria, which is a public and free version of the Tanakh and other Jewish texts. I don't know if their software is open source, but if it is, you might be able to leverage it.
The coolest part of what you wrote is the fact that the entire Hebrew Bible is only 2.2 MB. Wow, talk about a small book having a big influence!
@@ReuvenLerner Thanks for he reply. I am using python 3.8. I will look into the text you spoke of. Because the text I have has a copy write on it which is a problem for me. Also I thought I would try sqlite. I had a little experience with it. I have been able to read and display a unicode file of the Hebrew text and maintain the Hebrew font. So its true as you said it all seems to work. You will see in the following I make one reference to unicode. I cut and pasted the text from a word doc to a txt file and it held the Hebrew text formatting when using print(). Andrew
f_name = 'genesis.txt'
f = open(f_name, "r", encoding="UTF-8")
if f.mode == "r":
contents = f.read()
print(contents)
איזה מצחיק חיפשתי הסבר לאיזה משהו נכנסתי לסרטון שלך ופתאום אתה רושם שלום בעברית איזה הפתעה כיפית
טוב לשמוע!
The man despises you though.
hi. could you please explain how to find crc32 in Python Data?
I haven't ever used it myself, but the Python standard library includes a function that calculates crc32: docs.python.org/3/library/zlib.html#zlib.crc32
@@ReuvenLerner thank you very much.
Super interesting. But I didn't get how I can write a function that can read and properly encode any input from any language... Is that even possible?
If you use Python strings, with Unicode characters, then yes -- you can work with any language.
@@ReuvenLerner so when I read a command output or file I should encode what I read and then decode when I print the output or printing it as byte type?
@@alessioandreoli2145 When you read a command from the user, or text from a file, the text is in Unicode unless you specify otherwise. That means the text can/will contain any characters known to Unicode, which is basically anything you're likely to get. No bytes are involved. If and when you want, you can encode from Unicode strings into bytes, but there's rarely a need to do this unless you're dealing with binary data or the network.
Hello, i got a question: what are the main type of coding sets used? Like latin-1, utf-8, utf-16, utf-32. I would like to know, while decoding seems to be hard to do, but is it really if you just use some main stream used coding sets?There is a pretty big chance that you can get it done.
Most people are currently using UTF-8, so far as I know. But the world is a big, messy place, and there are a lot of documents in other encodings, too.
What happened if i made mistakes like "b'abcd'" how to get out from this things
If you made a mistake in creating a string or bytestring, then you should re-create it. Bytestrings, like all strings, are immutable in Python, so you cannot fix/change it.
By contrast, if your problem is that you forgot a quote mark (as in your example), then you can just go and fix your Python code, and rerun it! This is one of those things that shows how important it is to test your code before deploying it -- a lesson that we all learn from time to time.
@@ReuvenLerner thanks
how were you able to change it hebrew key
On the Mac, you can configure any number of keyboards, and then switch between them. On my computer, for example, I've got US English, Hebrew, and simplified Chinese, and move between them with command-shift-space. Your computer (if it's not a Mac) probably has something similar.
what does in python 3 mean??...i know that in python 2 means data type is string.....
In Python 2, strings contains bytes. If you used ASCII, then that was fine. But if you used Unicode, where a character can have more than one byte component, this wasn't so good. You could, in Python 2, use Unicode strings, which handled such characters.
In Python 3, it was turned around: Strings handle Unicode, and the new "bytes" type handles sequences of bytes without regard to their contents or whether they can be turned into characters. So if something says "class bytes" in Python 3, that means it's just a bunch of bytes. You'll see a "b" before the opening quote, indicating that. And the elements of a "bytes" sequence in Python are bytes, represented by integers.
I have a script written in python 2 that I want to run in python 3. ..but I get error Unicode can't decode characters at line 7414 in cp 1252 file. ...I get what you're saying in this video but I still don't know how to actually solve it...
The bytes-vs-string and 2-vs-3 thing is confusing for many people. I'm going to try to put together another video, or maybe an e-mail course, on it. I'll announce it to my blog (blog.lerner.co.il) and mailing list (lerner.co.il/newsletter) when it's ready!
thank you for the email...i didnt know that its such an involved complex solution needed..hope you can post a video..cuz a paid course is not an option for me...if you need to see the code where im having the issue and the related file and the actually error ..i will email you soon..thank you again sir
@@angelicamatch211 I've uploaded a much more complete and accurate video: ruclips.net/video/5zMmc1wmZ_0/видео.html
can we open any file in 0s and 1s form . or bit mode
You can open any file in binary mode, and get the bytes. But if you're working with text, then you'll likely want to open it in character mode (i.e,. the default), so as to get the characters, not the bytes.
when i try your example in the 1st ..2 minutes of the video..its fails in python 2 with the error....unsuported characters in the input..which from my online reveiw study research..makes sense i guess..cuz python 2 uses ASCII by default...so its normal this is out of range and ..i used arabic character..like U+0626 and in python 3 it gives for len (s)...as 8..which makes sense cuz its in the 2 byte range ...so its using 2 bytes per charatacter(goes to 32768 in the top left bit of 2nd byte)....(im not sure why in ur vid it still shows 4 as the len of characters, if its using 2 bytes per charcter)...so for my UnicodeDecode error that i mentioned below...it leads me to belive my script needs to be told..to encode/decode using Unicode character sets and the UTF-8 program to treat each character properly...so how can i check what its actually doing my script in python 3 , thats not making iit work..but more importantly i would say...how to set the script that its running Unicode and UTF-8?????..can you assit me here thank you for the consideration
i dont get the signifincfice that in python 2 ..b=b'abc' and type(b) gives...type.'str'...and the same code in python 3 gives class'bytes'...and then in. python 2 print(b[0]) gives...a .....and the same code in python 3 gives...97.........i know this represents that in python 2 you get the actual text and in python 3 you get the ascii number code....so like how does play into my issue of UnicodeDeccode error???.....i also read this............................everthing must be encoded b4 being written to disk and then decoded to be read by humans..and if you encode something in lets say ASCII then you can decode it cuz you know it was encoded in ASCII and so you decode it from ASCII back to a text.....but so for my error of unicode decode ..i would think that all the encode was originally done in ASCII cuz the script works fine in python2..and so why is my script trying to decode in cp1252??..how can i get it to decode in ASCII in python 3??..if thats the correct logic
אני צופה בסרטון ואני רואה אותך כותב שלום וכולי שמח פתאום
תמיד שמח לשמוע מאנשים שיכולים לקרוא את העברית! כל טוב וחג שמח!
@@ReuvenLerner חג שמח! רק שאלה, אתה מישראל או שלמדת עברית?
@@GgGg-zh3pl גדלתי בארה״ב, ועליתי לארץ בגיל 25, ב-1995.
@@ReuvenLerner מגניבב! אני משוכנע שאתה מתכנת ממש טוב וגם מסביר בין הטובים שראיתי
@@GgGg-zh3pl תודה רבה רבה!
very useful, thanks
how to convert dataframe containing bytes into string?
I'm not sure, I thought that pandas used Python strings, removing the problem (in theory).
Reuven Lerner i tried for dataframe but it returns all bytes “ b’mydata’ “ into “NaN.. the decoding method you shown only applicable for a single bytes but not bytes in dataframe right?
why not use a b.decode()?
Because I've learned a lot since I made this video. :-)
Thanks,
How to convert .bin to .txt in Pyhton 3?
I realize that you wrote this comment a while back, but I've just discovered how to see all unanswered comments. :-)
If you want to convert a binary file to a text file, then you could read from the binary file one chunk at a time, and then turn each of those chunks into strings using decode. The thing is, that's assuming that the binary file contains UTF-8 encoded bytes. If not, then you need to figure out what you want to convert, and how. And indeed, that's how many binary files work -- they aren't a straightforward translation to characters. So I'm afraid that there isn't one answer to your question, but Python *can* do such things.
How do i convert "10" to a byte 0A/00001010?
use the bin() function and it should do the trick
If you have the integer 10, and you want to get back a string containing its binary digits, then you can (as @SpaceFace102 mentioned a while back) use the "bin" function: bin(10) will return a string.
That's not quite a byte, though. If you want to get a byte based on the integer 10, you can just... use the integer 10! You can create a bytestring with bytes([10]). That'll return a one-element bytestring with '
' in it. (It's not really
, but it looks that way on the screen, etc.)
s[0] = shin? Does python know that the language is Hebrew?!
Python doesn't know or care about what language is being used. That's the beauty of Unicode; it can handle all characters, in all languages. Plus emojis, etc.
שלום