How to Separate Sentences in SpaCy (SpaCy and Python Tutorials for DH - 03)

Python Tutorials for Digital Humanities

Просмотров 17 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 19 ноя 2024

Комментарии • 32

@python-programming 4 года назад ⁺³
For my spaCy playlist, see: ruclips.net/p/PL2VXyKi-KpYvuOdPwXR-FZfmZ0hjoNSUo
@XYZ-lg3qt Год назад ⁺¹
doc.sents is not working for me in spacy , any alternative ?
@python-programming Год назад
@@XYZ-lg3qt it should be. Are you using spacy.blank("en")? If not what pipeline are you using?
@XYZ-lg3qt Год назад ⁺¹
@@python-programming
import spacy
import codecs
f = codecs.open("./Alice.txt" , "r","utf-8")
text = f.read()
chapters = text.split("CHAPTER ")[1:]
chapter1 = chapters[0]
nlp = spacy.load("en_core_web_lg")
doc = nlp(chapter1)
sentences = list(doc.sents)
-> Upto this point...its working now , but 'list(doc.sents)' is not working for me
-> Earlier the way you opened the text file was also giving me a error which was as follows
UnicodeDecodeError Traceback (most recent call last)
Cell In [9], line 5
2 import codecs
3 with open ("./Alice.txt","r") as f:
4 # f = codecs.open("./Alice.txt" , "r","utf-8")
----> 5 text = f.read()
6 chapters = text.split("CHAPTER ")[1:]
7 chapter1 = chapters[0]
@python-programming Год назад
@@XYZ-lg3qt you need to say encoding=“utf-8” when opening the file
@siamakshams1923 3 года назад ⁺¹
Amazing series. I've learnt to much in so little time. Thank you
@python-programming 3 года назад
Thanks! No problem! Glad I could help!!
@chessketeer 11 месяцев назад ⁺¹
Thank you, this is really helpful.
@python-programming 11 месяцев назад
Glad it was helpful!
@tokoobat0166 3 года назад ⁺¹
Awesome tutorial Sir.
Sir, can you make video about textcat multi-labelling? . I can't find that one in RUclips. Thank you😅
@python-programming 3 года назад
Good to know! I will do that today or next week.
@ramyashreepm9902 2 года назад
Can I also please know if there is a way to check accuracy of this splitting ? Just like how a neural model's accuracy score can be printed....?
If I'm feeding a large text document, then I would need a way to be sure about the correctness of the splitting.
@ezrakassa3472 2 года назад
am getting this error...any reason why?
Expected a string or 'Doc' as input, but got: .
@ramyashreepm9902 2 года назад
What if the text has dots in between like for instance U.K or Rs. ? This piece of code is treating this as end of sentence. Could you please let me know to handle this ?
@rahuldey6369 3 года назад ⁺¹
I've checked many of the sentences,found even after replace operation to remove single and double line breaks, some of the sentences are still not complete. Probably need to load the large model for this. Right? e.g. sentences 27. Another thing that I noted,when I'am replacing only double line breaks, most of the sentences are correct;y identified,but when replacing the single line breaks too along with double line breaks,many of the sentences breaks before completed. Is it really necessary to have single line breaks also removed? How do I address it could you please help?
@python-programming 3 года назад ⁺¹
This is interesting. I read your comment a few times and I cannot figure out why this would have occurred. Are you using the text I linked to in this series?
@rahuldey6369 3 года назад
@@python-programming Yes, I'm using the Alice and Wonderland text,but the link you provided was not working. So, I downloaded this- drive.google.com/file/d/1t5J19VyMXBDXK78AqSPcuLMfauNrMLdo/view?usp=sharing and cleared it like what you're using
@python-programming 3 года назад ⁺¹
@@rahuldey6369 Thanks for letting me know that the link was dead. Here is the same text but with an updated link. I've just updated the description too. www.gutenberg.org/files/11/11-0.txt
@rahuldey6369 3 года назад ⁺¹
@@python-programming Thank you for your kind reply. It's working better now
@python-programming 3 года назад
@@rahuldey6369 Yay! I am glad.
@SastaVlogger208 2 года назад
Hi, may peace be upon you. I am a student of computer science. I am watching your videos and practicing along with you, I have downloaded everything but it still giving me this error:
OSError: [E050] Can't find model 'em_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
would you please help me with that. I will be very thankfull to you
@EMDentertainment_BR Год назад ⁺¹
Hi! How are you doing?
How could I activate multi processing and use doc.sents to get sentences?
doc = nlp.pipe(StrEncoded.encode("Latin-1","replace").decode("Latin-1"),n_process=4, batch_size=2000)
Because only using .pipe I have a option to use n_process and batch_size... but I can't use doc.sents to separete sentences 😔
And the processing are too slow without this... (using this code below)
doc = nlp(StrEncoded.encode("Latin-1","replace").decode("Latin-1"))
for sent in doc.sents:
#print(sent.text)
text.append(sent.text)
return text
Thank's for help!!!
@python-programming Год назад ⁺¹
Hi! I'm on the go at the moment, so I am providing a link to another person's video. He works at spaCy. When you use nlp.pipe() you are getting a list of docs. You should call doc, docs. Then iterate over each, so for doc in docs. Then you can grab each sentence.
ruclips.net/video/OoZ-H_8vRnc/видео.html
@EMDentertainment_BR Год назад
@@python-programming Thank you so much! I'll tell you if I solve the problem!
@annismehdi343 4 месяца назад
Is the model trained by you or pre-integrated in python ?
@iljackb 3 года назад
My sentence index ends up different than yours at the end, the result of the code above and "print(sentences[2])" is "There was nothing so _very_ remarkable in that;...." I''m pretty sure that I edited the file the same as you so that the very first line is "CHAPTER I. Down the Rabbit-Hole" and there is a line in between that and the next sentence. However when I search "print(sentences[0])" the title appears in the same lines as the sentence eg. "I. Down the Rabbit-Hole Alice was beginning....".. Any Idea why this would be?
@Heisenbergpizza 2 года назад
+1
@XYZ-lg3qt Год назад ⁺¹
It's because , 'sentences = list(doc.sents) ' is not working for some reason..
@XYZ-lg3qt Год назад
If anyone finds some answer , just reply me
@youthoober21 3 года назад
That's basically paragraph segmentaion
@justinlangley8782 2 года назад ⁺¹
def clean_whitespace(text):
return ' '.join(text.split())
def clean_whitespace(text):
import regex as re
return re.sub(r'\s+', ' ', _text.strip())
Either of these are better than that ugly Line - 4 nonsense:
text = f.read().replace("

", " ").replace("
", " ")
Could just replace it with:
text = f.read()
text = clean_whitespace(text)
or
text = clean_whitespace(f.read())
The main benefit would be readability and dealing with consecutive, leading, or trailing spaces which your code does not do.
Haven't seen other videos here, but I'll probably avoid it now after witnessing that.
I was mainly looking for some parallelization ideas using a CUDA context with SpaCy (which they don't seem to support well).
@python-programming 2 года назад ⁺¹
Fair point. Thanks for leaving this better code here for others.

Следующие

Автовоспроизведение

Spacy and Named Entity Recognition NER (Spacy and Python Tutorial for DH 04)