- Видео 66
- Просмотров 63 995
Richard Gruss
Добавлен 16 авг 2013
Видео
AI Innovation Through Collaboration With Startups
Просмотров 588 месяцев назад
AI Innovation Through Collaboration With Startups
Neural Networks 2: MNIST Classification
Просмотров 1414 года назад
Neural Networks 2: MNIST Classification
introduction to MGNT 671: Artificial Intelligence and Machine Learning for Managers
Просмотров 955 лет назад
introduction to MGNT 671: Artificial Intelligence and Machine Learning for Managers
MGNT 333 Summer 1-- Course Introduction
Просмотров 815 лет назад
MGNT 333 Summer 1 Course Introduction
Syllabus Text Analytics: Inclusive Excellence
Просмотров 365 лет назад
Syllabus Text Analytics: Inclusive Excellence
How to do Multi Level Hierarchical Classification
Thanks for clear explanation - only the wild clicking around is very confusing, maybe consider to go step by step through a code without clicking around, thank you
can i get this coding
Interesting🙌🗣️
Hi guys, for anyone that's looking for the code you may use the following: import os import random import string from nltk import word_tokenize from collections import defaultdict from nltk import FreqDist from nltk.corpus import stopwords from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn import metrics import pickle import csv stop_words = set(stopwords.words('english')) stop_words.add('said') stop_words.add('mr') BASE_DIR = LABELS = ['business', 'entertainment', 'politics', 'sport', 'tech'] # step 1 def create_data_set(): with open('data.csv', 'w', encoding='utf8', newline='') as csvfile: csv_writer = csv.writer(csvfile, delimiter=',') csv_writer.writerow(['Label', 'Filename', 'Text']) # Write header for label in LABELS: dir = '%s/%s' % (BASE_DIR, label) for filename in os.listdir(dir): fullfilename = '%s/%s' % (dir, filename) print(fullfilename) with open(fullfilename, 'rb') as file: text = file.read().decode(errors='replace').replace(' ', '') csv_writer.writerow([label, filename, text]) # [ (label, text), (label, text) ] # step 2 def setup_docs(): docs = [] # (label, text) with open('data.csv', 'r', encoding='utf8') as datafile: for row in datafile: parts = row.split(',') doc = (parts[0], parts[2].strip()) # assuming label is in the first column, and text in the third column docs.append(doc) return docs def get_tokens(text): tokens = word_tokenize(text) tokens = [t for t in tokens if not t in stop_words] return tokens def clean_text(text): text = text.translate(str.maketrans('', '', string.punctuation)) text = text.lower() return text def print_frequency_dist(docs): tokens = defaultdict(list) for doc in docs: doc_label = doc[0] doc_text = clean_text(doc[1]) doc_tokens = get_tokens(doc_text) tokens[doc_label].extend(doc_tokens) # Corrected typo for category_label, category_tokens in tokens.items(): print(category_label) fd = FreqDist(category_tokens) print(fd.most_common(20)) def get_splits(docs): random.shuffle(docs) X_train = [] # training documents y_train = [] # corresponding training labels X_test = [] # test documents y_test = [] # corresponding test labels pivot = int(.80 * len(docs)) for i in range(0, pivot): X_train.append(docs[i][1]) y_train.append(docs[i][0]) for i in range(pivot, len(docs)): X_test.append(docs[i][1]) y_test.append(docs[i][0]) return X_train, X_test, y_train, y_test def evaluate_classifier(title, classifier, vectorizer, X_test, y_test): X_test_tfidf = vectorizer.transform(X_test) y_pred = classifier.predict(X_test_tfidf) precision = metrics.precision_score(y_test, y_pred, average='weighted', zero_division=0) # Use 'weighted' for multiclass recall = metrics.recall_score(y_test, y_pred, average='weighted') # Use 'weighted' for multiclass f1 = metrics.f1_score(y_test, y_pred, average='weighted') # Use 'weighted' for multiclass print("%s\t%f\t%f\t%f " % (title, precision, recall, f1)) def train_classifier(docs): X_train, X_test, y_train, y_test = get_splits(docs) # the object that turns text into vectors/numbers vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 3), min_df=3, analyzer='word') # creates doc-term matrix dtm = vectorizer.fit_transform(X_train) # train Naive Bayes classifier naive_bayas_classifier = MultinomialNB().fit(dtm, y_train) evaluate_classifier("Naive Bayes\tTRAIN\t", naive_bayas_classifier, vectorizer, X_train, y_train) evaluate_classifier("Naive Bayes\tTEST\t", naive_bayas_classifier, vectorizer, X_test, y_test) # store the classifier clf_filename = 'naive_bayas_classifier.pkl' pickle.dump(naive_bayas_classifier, open(clf_filename, 'wb')) # also store the vectorizer so we can transform new data vec_filename = 'count_vectorizer.pkl' pickle.dump(vectorizer, open(vec_filename, 'wb')) def classify(text): # load classifier clf_filename = nb_clf = pickle.load(open(clf_filename, 'rb')) # vectorize the new text vec_filename = vectorizer = pickle.load(open(vec_filename, 'rb')) # preprocess the text processed_text = clean_text(text) tokens = get_tokens(processed_text) # make prediction pred = nb_clf.predict(vectorizer.transform([processed_text])) print(pred[0]) if __name__ == '__main__': # create_data_set() # docs = setup_docs() # print_frequency_dist(docs) # train_classifier(docs) # deployment in production new_doc = "Google showed off some new camera features on the Pixel 4 today" classify(new_doc) print("Done")
Note that the file paths need to be changed and there are steps that need to be followed in the video for the program to work. Remember to download the sample dataset from bbc as well when you're testing the code
I've also adjusted the parts of the codes due to some errors popping up but it should still work
what are the pkl and txt files for? and are they needed for the code to function?
nothing works in jupyter notebook
Thank you so much Richard. You make my sunday with this explanation, excellent video.
I cannot find the source code or the data source :-( so not useful for me
Thanks for taking the time to make this vid. Been learning Python for a short while, although I didnt understand all, it gave a good insight what machine learning does. Doesnt sound so "scary" anymore
Can you please share the github link for source code?
This is very cool, thank you
import pandas as pd import json import numpy as np import csv import os import random import string from nltk import word_tokenize from collections import defaultdict from nltk import FreqDist from nltk.corpus import stopwords from sklearn. feature_extraction.text import TfidfVectorizer from sklearn. feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn import metrics import pickle import os.path stop_words = set (stopwords.words('english')) stop_words.add('said') stop_words.add('mr') BASE_DIR='C:/Users/user/Desktop/bbc/News Articles' LABELS = ['business', 'entertainment', 'politics', 'sport', 'tech'] def create_data_set(): with open('data.csv', 'w', encoding='utf8') as outfile: for label in LABELS: dir = '%s/%s' % (BASE_DIR, label) for filename in os. listdir(dir): fullfilename = '%s/%s' % (dir, filename) print(fullfilename) with open(fullfilename, 'rb') as file: text = file.read().decode(errors='replace').replace(' ', '') feilds = [label,filename,text] # creating a csv writer object csvwriter = csv.writer(outfile) # writing the fields csvwriter.writerow(feilds) print(text) #outfile.write('%s\t%s\t%s ' % (label, filename, text)) def setup_docs(): docs = [] # (label, text) with open('data.txt', 'r', encoding='utf8') as datafile: for row in datafile: parts = row.split('\t') doc = (parts[0], parts[2].strip()) docs.append(doc) return docs def clean_text(text): #remove punctuation text = text.translate(str.maketrans('', '', string. punctuation)) #convert to lower case text = text. lower() return text def get_tokens(text): # get individual words. tokens = word_tokenize (text) # remove common words that are useless tokens = [t for t in tokens if not t in stop_words] return tokens def print_frequency_dist(docs): tokens = defaultdict(list) #lets make a giant list of all the words for each category for doc in docs: doc_label = doc[0] doc_text = clean_text(doc[1]) doc_tokens = get_tokens(doc_text) # doc_text = clean_text(doc[1]) # doc_tokens = get_tokens(doc_text) tokens[doc_label].extend(doc_tokens) for category_label, category_tokens in tokens. items(): print (category_label) fd = FreqDist(category_tokens) print(fd.most_common(20)) def get_splits(docs): # scramble docs. random.shuffle(docs) X_train = []#training documents y_train = [] #corresponding training labels X_test=[] #test documents y_test= [] #correspoding test label pivot = int(.80*len(docs)) for i in range(0, pivot): X_train.append(docs[1][1]) y_train.append(docs[i][0]) for i in range(pivot, len(docs)): X_test.append(docs[i][1]) y_test.append(docs[i][0]) return X_train, X_test, y_train, y_test def evaluate_classifier(title, classifier, vectorizer, X_test, y_test): X_test_tfidf = vectorizer.transform(X_test) y_pred = classifier.predict(X_test_tfidf) precision = metrics.precision_score(y_test, y_pred,average='micro') recall = metrics.recall_score(y_test, y_pred,average='micro') f1 = metrics.f1_score(y_test, y_pred,average='micro') print("%s\t%f\t%f\t%f " % (title, precision, recall, f1)) def train_classifier(docs): X_train, X_test, y_train, y_test = get_splits(docs) # the object that turns text into vectors vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 3), min_df=3, analyzer='word') # create doc-term matrix dtm = vectorizer.fit_transform(X_train) # train Naive Bayes classifier naive_bayes_classifier = MultinomialNB().fit(dtm, y_train) evaluate_classifier("Naive Bayes\tTRAIN\t", naive_bayes_classifier, vectorizer, X_train, y_train) evaluate_classifier("Naive Bayes\tTEST\t", naive_bayes_classifier, vectorizer, X_test, y_test) # store the classifier clf_filename = 'naive_bayes_classifier.pkl' pickle.dump(naive_bayes_classifier, open(clf_filename, 'wb')) # also store the vectorizer so we can transform new data vec_filename = 'count_vectorizer.pkl' pickle.dump(vectorizer, open(vec_filename, 'wb')) #clssify the new content function def classify(text): #load classifier clf_filename='naive_bayes_classifier.pkl' nb_clf = pickle.load(open(clf_filename, 'rb')) #vectorize, the new text vec_filename = 'count_vectorizer.pkl' vectorizer = pickle.load(open (vec_filename, 'rb')) pred = nb_clf.predict (vectorizer.transform([text])) print(pred[0]) #create_data_set() #set up the document #docs = setup_docs() #print the word frequency #print_frequency_dist(docs) #train the classifier #train_classifier(docs) #classify the new content using pkl files new_doc = "Transparency International Sri Lanka (TISL) filed a petition in the Supreme Court yesterday (June 12), seeking to intervene in the ongoing Fundamental Rights case (SC/FR/Application No.168/2021) filed by the Center for Environmental Justice (CEJ) and three more petitioners, highlighting the serious allegations of bribery and corruption surrounding the X-Press Pearl disaster. The intervention petition is filed in the public interest. It refers to serious allegations of irregularity, mishandling, sabotage, bribery and corruption surrounding the claim for compensation arising from the X-Press Pearl disaster. Several key points have been raised in the intervention petition: The grave allegations of interference and extraneous pressure surrounding the claim for compensation arising from the X-Press Pearl disaster. The statement by the Justice Minister in Parliament on April 25, 2023, that one Chamara Gunasekara alias Manjusiri Nissanka had received a payment of USD 250 million into a private bank account in connection with the X-Press Pearl disaster. The media statements of Chinthaka Waragoda, who reportedly invented a machine to remove debris which washed ashore after the shipwreck, alleging that he was offered payment to discontinue the use of his machine, to avoid exposing the full extent of the damage caused by the disaster. Questions surrounding the quantum of compensation due to Sri Lanka for the damages caused by MV X-Press Pearl.The freight ship ‘MV X-Press Pearl’ caught fire off the coast of Colombo on 20th May, 2021. It sank a few days later, releasing its cargo of plastic pellets and tons of toxic chemicals into the ocean, causing Sri Lanka’s worst maritime disaster to date. It is alleged that Sri Lankan authorities obtained the assistance of the International Tanker Owners Pollution Federation Limited (ITOPF), a representative of the insurer of the Shipowner, in the post-disaster activities, despite the grave conflict of interest arising from it. TISL has urged that the private parties involved in the X-Press Pearl incident be held accountable, and be made to pay optimal compensation for the damage and pollution caused to the marine and coastal ecology of Sri Lanka, and the payment of compensation for the loss caused to the fishing communities and those engaged in tourism, as well as obtaining compensation under the Marine Pollution Prevention Act. TISL has also highlighted the need to hold anyone guilty of wrongdoing fully accountable. The petition for intervention is to be mentioned for Support in the Supreme Court on Thursday (June 15)." classify(new_doc) print("Done...!")
loved it. it made a complete sense. thanks.
I have a text file in which I have many lines i want to classify each lines how can i do it?
Rewatch the video.
thank you for this video , form where i can get the source code please ?
where is the code for this
Were you able to find it. If yes please send me the link.
@@vedantvashishth989 I re-write everything that he has put up
Thank you so much this is amazing and so structured
H - great video and it worked!! How would I asses a score the accuracy of the final step, classifying some new text?
have you found out how to do it?
Wonderful! Thank you so much - detailed and I could follow.
Hello thank you for your video but the "create_data_set" function does not work if there are varying multiple files(.txt, .doc, .bin etc) in the subfolder. The Data.txt file output is empty (hence nothing is written to the data.txt file)
I just want it to write only the .txt files to the data.txt file(All the .txt files in the subfolders have the same file name if that helps)
Thank you so much sir. I was looking for this kind of tutorial..
Can i get the code ,please sir?
Were you able to find it. If yes please send me the link.
Please make more videos!!
Some people talk about p vale some talk about critical value - now I am confused.
Great presentation, much appreciated!
Explained super easy ! Thank you !
can i get the code ?
Awesome simple and clean code. Please can we have link to download the code?
mr i have aquation how you prepare dataset for text news classification in document level
Hey, thank you for the tutorial it's very helpful. Can you share your code?
good day sir, I just wanted to ask if an independent variable is not significant or does not have an explanatory power to the model but when removing it lowers the adjusted r-square what does this imply? so far the reason that i know the reason is because the t-statistic is greater than one. With this information, what can we infer?
can i get this code?
Were you able to find it. If yes please send me the link.
Does anyone use excel or csv data to work with text classification. Or should I create these .txt file for each and every row of my data?
Very good tutorial with good explanation. I was able to follow along and also able to run the whole program while watching the video. Thanks.
I am having trouble, Can you please help me :(
Your video seems to be right on point as to what I want to do, but I am very confused. I am confused about the learning model aspect. If I want to just create hard rules for text classification, I do not need the data set training, right?
Thank you for this tutorial. This is a good walkthrough.
Hlo sir, i want to contact you.
thank you very much sir for this lecture this really helped me a lot hoping to see more content on machine learning
The dataset: www.kaggle.com/pariza/bbc-news-summary
Excellent video and demonstration for text analysis. Thank you very much Sir !
many thanks for this. could you please the expert systems list you mention at the end of the video?
Nice! There is however a function in the sklearn library called train_test_split I believe, that does exactly what your get_splits function does. Also, it would be helpful if you submitted the code in the description. Good video and great explanations!
Can you pls help me to answer/solve this with conclusion. Alternative Capacity for New Store New Bridge Built No New Bride A. 1 14 B. 2 10 C. 4 6 Where A= Small B= Medium C= Large 1.Assume the pay offs represents profits. Determine the alternative that would under minimax approach. 2. Assume the pay offs represents profits. Determine the alternative that would be chosen under maximin approach. 3. Assume the pay offs represents the profits. Determine the alternative that would be chosen under maximax approach. 4. Assume the pay offs represents profits. Determine the alternative that would be chosen under laplace approach.
Where is the Python source code?
can you share the code
Hello Mr. Gruss. Is it possible to find Your code from this video somewhere?
Just what i was looking for, thanks man.
You're welcome!