Fake News Detection Intro using Machine Learning (ML) Models and Natural Language Processing (NLP)
HTML-код
- Опубликовано: 12 сен 2024
- Fake news is all around us - whether we can identify it or not. Individuals and organizations publish fake news all the time, whether it be for a persuasion tactic, or to simply override unfavorable truths. Take the search for a Covid-19 vaccine for example, an issue that is especially relevant in our current times. Before a vaccine came out, there were some sources that stated there was already a fully effective vaccine available, some that stated it was coming very soon, and others that stated that it would take decades for a safe and functional one to be released. And trusting and following the wrong source can lead to more harm than good.
Now the question becomes, which websites do we trust, and which do we ignore? In most cases, it might not always be this clear as to which sites to trust or reject, and which sites are real or fake.
Fortunately, Big Data can save the day for us! In today’s world of ever growing data streams, one can imagine crunching through the volumes of data to detect patterns, which can then be analyzed to separate out real news from fake.
That is exactly the project I executed on - a fake news detection machine learning model that utilizes advanced natural language processing techniques to classify news websites as either fake or real.
This machine learning model utilizes binary classification to identify whether a news site is fake or real, in which an output of ‘1’ indicates that the website is most likely fake and ‘0’ indicates that the site is indeed trustworthy. It will take in a list of website URLs and corresponding raw HTML as input data and will train a logistic regression model to output a label of either 0 or 1 depending on whether the website is real or fake.
The core of this model comes in the form of the various natural language processing techniques deployed to transform the input data, previously in the form of words, into numbers that the machine can understand and learn from. I have transformed this data by creating and importing several functions generally referred to as featurizers. The purpose of these featurizers is to extract key features of the URL and HTML that may help predict the trustworthiness of the site and transform the data into numerical values to input into the logistic regression model.
To obtain the data necessary for my model, I scraped the web for news websites and compiled a set of *2557 sites, consisting of roughly 50% fake and 50% real. I then split my data into a training set, cross-validation set, and a test set.
I created my first baseline featurizer to be a domain featurizer that extracts basic features from the domain name extension of each website. This domain featurizer takes in a URL and an HTML and returns a dictionary mapping the feature descriptions to numerical features. The accuracy of this model was only 55%, which was not surprising as the domain extension, while might provide some clues, cannot be the deterministic predictor of a website’s trustworthiness.
The key problem with this model is that there is simply not enough information. To combat this issue, I decided my next step would be to make use of specific (and potentially predictive) keywords of the HTML in addition to the domain extension to feed into the logistic regression model. After a series of steps, I used a logistic regression model to get an accuracy of 73%.
The model performed considerably better than the domain method, but as this is still a relatively simple method, I started to think of more nuanced approaches. The meta descriptions of a website’s HTML is a great source of information conveying the core content of that website. As an improvement to my last keyword featurizer, I used the Bag-of-Words NLP model. Once I obtained my score reports for this model, I observed that all of the metrics yielded much higher percentages that before.
Now a shortcoming of the bag-of-words model was that it only looked at the counts of words in the description for each website. But then I pondered if there was a way to somehow understand the meaning of the words in the descriptions for each site. This is where word vectors come in. I utilized a model called GloVe, to accomplish this task. This model yielded an accuracy of about 87%.
Given that I tried out several different featurizers and observed the score reports for each, I was curious to find out if I would obtain improved results when I combine all of the featurization approaches. I then passed the concatenated vector into my logistic regression model and obtained an accuracy of 91%, which was the highest yet.
It was time to test it out on the unseen test data to obtain the real accuracy of my model. I obtained the score reports and observed my model predicted the trustworthiness of news websites with 91% accuracy.
As with any machine learning model, there are places for improving the score metrics even further, such as obtaining a larger dataset, developing more featurization approaches, etc.
It is good that we have this kind of machine who can filter fake news from real news and the accuracy is amazing.
Impressed about the accuracy level 91% of this model. Great info! Thank you so much for the explanation.
Glad it was helpful!
The fact that you studied properly for this awesome information just can blow everyone's mind. I hope everyone will see this and make it to the top.
This is very relevant nowadays because fake information is so easy to spread. Fascinated by your logistic regression model that obtained an accuracy of 91%. Good job! Thank you for sharing this information.
thanks a lot!
This video is very informative which give us important views on how to detect fake news. We all know that fake news is very well-known today. Through this video everyone will know what is the bad effect of fake news and how to handle fakes using this devices.
this is really helpful. to learn whether it is fake or not is a must especially for today's situation all over the world. with the accuracy of 91%, it is also very impressive and the way it was explained so that others would understand it very clearly.
Great video , it definitely helped me in understanding the different classification problems. also im very impressed with the 91% accuracy level of this model.
Useful and impressive video, it was good to hear about your model, thankyou for sharing.
This is a great video! Very timely especially with what's currently happening. Very important to know what's real and what's fake. Thanks for sharing this. Move videos to come!
Good information for the society about the fake news, I like the information
This is very good and helpful information, thank you for sharing
Thankyouuuu for your sensible thoughts and warning for us!
Nice video. The fake news detection model you've developed is very timely nowadays because there are lots of fake news, especially on social media. This model is very helpful for us to know what is real and not.
Very helpful and amazing video.
this video is really made of good purpose in social aspects.
Thank you for sharing this informative video. Yes this is very helpful for us.I always wanted to know about AI.
very helpful information for people thank you and keep it up!
great video content and a very informative one at that..
Impressive detailing.
This is a great video it really helps me thanks the free tutorial. I can understand whats the difference of fake and real news. Happy to discover this video. Amazing video with 91 percent of accuracy level.
Great information. Everyone should know about this information
I am a newbie in machine learning but loved the way you explain.
Interesting. Thanks for the information.
Thank you for sharing such amazing video which helps me to understand the different classifications. Also understand the level of 91% accuracy of this model.
Thanks for letting us know.
Wow! This is really helpful and interesting!!!
Thanks a lot. I had no idea about detecting of fake news. This video helped me.
thank you!
These information were so reliable and it enlightened me of some issues.
Very informative. Thank you
This kind of video is a must watch. Informative and factual. I like it, keep it up!
Wow fake news deduction will be very useful,and informative
thank you!
It was indeed a great video to watch to, Thank you for this video, I am hoping for more videos like this.
Wow...this was awesome and innovative
A great informative video
What a great and informative video for everyone about the exploratory data analysis.
NIce! super informative
This is very good information. Thanks 👍👍
So nice of you
It's really very interesting and helpful to detect fake news ! thank for sharing such valuable information .
Great informations , I love and like watching you because I can learn a lot fr you , you are great I learned a lot about AI because of you
I'm a computer science student and your video is really informative and helpful to me
Glad to hear, good luck!
Great information
Great video, I learned something new today.
Informative
Thank you for sharing the information.
You bet!
Nice! I want to create programs like this too! hoping I would finish my training in IT. Good video!
Keep up the good work
thank you for this video. Great info
Wery good mis maral
100% useful information
Good job , is this work for a thesis ?
Very helpful information for me thank you and keep it up!
Great info very helpful thank you
thank you!
Thank you for this. Great info
thanks!
Great video , it really helped me in understanding classification problems
good job
Thanks
Very informative video.
Glad it was helpful!
Very informative video
Glad you think so!
Helpful instructions
thanks
This is a great video. This is informative and factual.
This is very informative
thanks
Please share some code Indian, or no one know if you truly did it
Very interesting
Glad you think so!
Very niceee
Thanks a lot
Thank you :D !!! Great and informative :)
Glad it was helpful!
very interesting.
thanks!
Super
Thanks
Great intro, is there a more detailed video with details coming?
Yes, just posted!
Wow its really cool and very informative. keep it up :D
Thanks! 😃
Thanks for the info . and you misspell youre artificial .
source code?
Grow account at instagram.com/horse._paradise_/
Impressed about the accuracy level 91% of this model. Great info! Thank you so much for the explanation.
Great information. Everyone should know about this information
thanks
Great video , it really helped me in understanding classification problems
This is very good and helpful information, thank you for sharing
Thank you!
This is very informative
thanks!
Great information. Everyone should know about this information
thank you!