@@python-programming import spacy import codecs f = codecs.open("./Alice.txt" , "r","utf-8") text = f.read() chapters = text.split("CHAPTER ")[1:] chapter1 = chapters[0] nlp = spacy.load("en_core_web_lg") doc = nlp(chapter1) sentences = list(doc.sents) -> Upto this point...its working now , but 'list(doc.sents)' is not working for me -> Earlier the way you opened the text file was also giving me a error which was as follows UnicodeDecodeError Traceback (most recent call last) Cell In [9], line 5 2 import codecs 3 with open ("./Alice.txt","r") as f: 4 # f = codecs.open("./Alice.txt" , "r","utf-8") ----> 5 text = f.read() 6 chapters = text.split("CHAPTER ")[1:] 7 chapter1 = chapters[0]
Can I also please know if there is a way to check accuracy of this splitting ? Just like how a neural model's accuracy score can be printed....? If I'm feeding a large text document, then I would need a way to be sure about the correctness of the splitting.
What if the text has dots in between like for instance U.K or Rs. ? This piece of code is treating this as end of sentence. Could you please let me know to handle this ?
I've checked many of the sentences,found even after replace operation to remove single and double line breaks, some of the sentences are still not complete. Probably need to load the large model for this. Right? e.g. sentences 27. Another thing that I noted,when I'am replacing only double line breaks, most of the sentences are correct;y identified,but when replacing the single line breaks too along with double line breaks,many of the sentences breaks before completed. Is it really necessary to have single line breaks also removed? How do I address it could you please help?
This is interesting. I read your comment a few times and I cannot figure out why this would have occurred. Are you using the text I linked to in this series?
@@python-programming Yes, I'm using the Alice and Wonderland text,but the link you provided was not working. So, I downloaded this- drive.google.com/file/d/1t5J19VyMXBDXK78AqSPcuLMfauNrMLdo/view?usp=sharing and cleared it like what you're using
@@rahuldey6369 Thanks for letting me know that the link was dead. Here is the same text but with an updated link. I've just updated the description too. www.gutenberg.org/files/11/11-0.txt
Hi, may peace be upon you. I am a student of computer science. I am watching your videos and practicing along with you, I have downloaded everything but it still giving me this error: OSError: [E050] Can't find model 'em_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory. would you please help me with that. I will be very thankfull to you
Hi! How are you doing? How could I activate multi processing and use doc.sents to get sentences? doc = nlp.pipe(StrEncoded.encode("Latin-1","replace").decode("Latin-1"),n_process=4, batch_size=2000) Because only using .pipe I have a option to use n_process and batch_size... but I can't use doc.sents to separete sentences 😔 And the processing are too slow without this... (using this code below) doc = nlp(StrEncoded.encode("Latin-1","replace").decode("Latin-1")) for sent in doc.sents: #print(sent.text) text.append(sent.text) return text Thank's for help!!!
Hi! I'm on the go at the moment, so I am providing a link to another person's video. He works at spaCy. When you use nlp.pipe() you are getting a list of docs. You should call doc, docs. Then iterate over each, so for doc in docs. Then you can grab each sentence. ruclips.net/video/OoZ-H_8vRnc/видео.html
My sentence index ends up different than yours at the end, the result of the code above and "print(sentences[2])" is "There was nothing so _very_ remarkable in that;...." I''m pretty sure that I edited the file the same as you so that the very first line is "CHAPTER I. Down the Rabbit-Hole" and there is a line in between that and the next sentence. However when I search "print(sentences[0])" the title appears in the same lines as the sentence eg. "I. Down the Rabbit-Hole Alice was beginning....".. Any Idea why this would be?
def clean_whitespace(text): return ' '.join(text.split()) def clean_whitespace(text): import regex as re return re.sub(r'\s+', ' ', _text.strip()) Either of these are better than that ugly Line - 4 nonsense: text = f.read().replace("
", " ").replace(" ", " ") Could just replace it with: text = f.read() text = clean_whitespace(text) or text = clean_whitespace(f.read()) The main benefit would be readability and dealing with consecutive, leading, or trailing spaces which your code does not do. Haven't seen other videos here, but I'll probably avoid it now after witnessing that. I was mainly looking for some parallelization ideas using a CUDA context with SpaCy (which they don't seem to support well).
For my spaCy playlist, see: ruclips.net/p/PL2VXyKi-KpYvuOdPwXR-FZfmZ0hjoNSUo
doc.sents is not working for me in spacy , any alternative ?
@@XYZ-lg3qt it should be. Are you using spacy.blank("en")? If not what pipeline are you using?
@@python-programming
import spacy
import codecs
f = codecs.open("./Alice.txt" , "r","utf-8")
text = f.read()
chapters = text.split("CHAPTER ")[1:]
chapter1 = chapters[0]
nlp = spacy.load("en_core_web_lg")
doc = nlp(chapter1)
sentences = list(doc.sents)
-> Upto this point...its working now , but 'list(doc.sents)' is not working for me
-> Earlier the way you opened the text file was also giving me a error which was as follows
UnicodeDecodeError Traceback (most recent call last)
Cell In [9], line 5
2 import codecs
3 with open ("./Alice.txt","r") as f:
4 # f = codecs.open("./Alice.txt" , "r","utf-8")
----> 5 text = f.read()
6 chapters = text.split("CHAPTER ")[1:]
7 chapter1 = chapters[0]
@@XYZ-lg3qt you need to say encoding=“utf-8” when opening the file
Amazing series. I've learnt to much in so little time. Thank you
Thanks! No problem! Glad I could help!!
Thank you, this is really helpful.
Glad it was helpful!
Awesome tutorial Sir.
Sir, can you make video about textcat multi-labelling? . I can't find that one in RUclips. Thank you😅
Good to know! I will do that today or next week.
Can I also please know if there is a way to check accuracy of this splitting ? Just like how a neural model's accuracy score can be printed....?
If I'm feeding a large text document, then I would need a way to be sure about the correctness of the splitting.
am getting this error...any reason why?
Expected a string or 'Doc' as input, but got: .
What if the text has dots in between like for instance U.K or Rs. ? This piece of code is treating this as end of sentence. Could you please let me know to handle this ?
I've checked many of the sentences,found even after replace operation to remove single and double line breaks, some of the sentences are still not complete. Probably need to load the large model for this. Right? e.g. sentences 27. Another thing that I noted,when I'am replacing only double line breaks, most of the sentences are correct;y identified,but when replacing the single line breaks too along with double line breaks,many of the sentences breaks before completed. Is it really necessary to have single line breaks also removed? How do I address it could you please help?
This is interesting. I read your comment a few times and I cannot figure out why this would have occurred. Are you using the text I linked to in this series?
@@python-programming Yes, I'm using the Alice and Wonderland text,but the link you provided was not working. So, I downloaded this- drive.google.com/file/d/1t5J19VyMXBDXK78AqSPcuLMfauNrMLdo/view?usp=sharing and cleared it like what you're using
@@rahuldey6369 Thanks for letting me know that the link was dead. Here is the same text but with an updated link. I've just updated the description too. www.gutenberg.org/files/11/11-0.txt
@@python-programming Thank you for your kind reply. It's working better now
@@rahuldey6369 Yay! I am glad.
Hi, may peace be upon you. I am a student of computer science. I am watching your videos and practicing along with you, I have downloaded everything but it still giving me this error:
OSError: [E050] Can't find model 'em_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
would you please help me with that. I will be very thankfull to you
Hi! How are you doing?
How could I activate multi processing and use doc.sents to get sentences?
doc = nlp.pipe(StrEncoded.encode("Latin-1","replace").decode("Latin-1"),n_process=4, batch_size=2000)
Because only using .pipe I have a option to use n_process and batch_size... but I can't use doc.sents to separete sentences 😔
And the processing are too slow without this... (using this code below)
doc = nlp(StrEncoded.encode("Latin-1","replace").decode("Latin-1"))
for sent in doc.sents:
#print(sent.text)
text.append(sent.text)
return text
Thank's for help!!!
Hi! I'm on the go at the moment, so I am providing a link to another person's video. He works at spaCy. When you use nlp.pipe() you are getting a list of docs. You should call doc, docs. Then iterate over each, so for doc in docs. Then you can grab each sentence.
ruclips.net/video/OoZ-H_8vRnc/видео.html
@@python-programming Thank you so much! I'll tell you if I solve the problem!
Is the model trained by you or pre-integrated in python ?
My sentence index ends up different than yours at the end, the result of the code above and "print(sentences[2])" is "There was nothing so _very_ remarkable in that;...." I''m pretty sure that I edited the file the same as you so that the very first line is "CHAPTER I. Down the Rabbit-Hole" and there is a line in between that and the next sentence. However when I search "print(sentences[0])" the title appears in the same lines as the sentence eg. "I. Down the Rabbit-Hole Alice was beginning....".. Any Idea why this would be?
+1
It's because , 'sentences = list(doc.sents) ' is not working for some reason..
If anyone finds some answer , just reply me
That's basically paragraph segmentaion
def clean_whitespace(text):
return ' '.join(text.split())
def clean_whitespace(text):
import regex as re
return re.sub(r'\s+', ' ', _text.strip())
Either of these are better than that ugly Line - 4 nonsense:
text = f.read().replace("
", " ").replace("
", " ")
Could just replace it with:
text = f.read()
text = clean_whitespace(text)
or
text = clean_whitespace(f.read())
The main benefit would be readability and dealing with consecutive, leading, or trailing spaces which your code does not do.
Haven't seen other videos here, but I'll probably avoid it now after witnessing that.
I was mainly looking for some parallelization ideas using a CUDA context with SpaCy (which they don't seem to support well).
Fair point. Thanks for leaving this better code here for others.