Malware and Machine Learning - Computerphile

Поделиться
HTML-код
  • Опубликовано: 18 июн 2024
  • Do anti virus programs use machine learning? Dr Fabio Pierazzi looks at the trends and challenges.
    Fabio's website: fabio.pierazzi.com
    Main paper: Arp et al., “Dos and Don’ts of Machine Learning for Computer Security”, USENIX Security 2022 - Distinguished Paper Award - Project website: dodo-mlsec.org/
    / computerphile
    / computer_phile
    This video was filmed and edited by Sean Riley.
    Computer Science at the University of Nottingham: bit.ly/nottscomputer
    Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan.com

Комментарии • 82

  • @ClifBratcher
    @ClifBratcher Год назад +126

    I've spent many years in the industry and the biggest hurdle I've seen to having more dynamic identification is false positives. More specifically stopping users from their day-to-day activities because it has been determined to be malicious. Users are MUCH more forgiving of false negatives (actual infections) than false positives.

    • @elidrissii
      @elidrissii Год назад +25

      To be fair, false positives are really annoying to get as an end user. I don't want to go through hoops to recover my file that I know is safe after dismissing all the warnings.

    • @RealCyberCrime
      @RealCyberCrime Год назад +5

      This. Half my day as an analyst is going through false positives

    • @c1ph3rpunk
      @c1ph3rpunk Год назад +1

      @@RealCyberCrime only half?

    • @TheStevenWhiting
      @TheStevenWhiting Год назад +1

      @@elidrissii Yep SentinelOne notorious for it.

    • @prashantd6252
      @prashantd6252 Год назад

      The "false positive" has been an issue since the internet went "public"

  • @samcooke343
    @samcooke343 Год назад +52

    Can we talk about those flawless freehand bell curves?!

    • @SystemBD
      @SystemBD Год назад +5

      No. Arcane magic is not computable (as of yet).

    • @zzzaphod8507
      @zzzaphod8507 Год назад +4

      Was trying to remember how to call those curves, thanks - the name rings a

    • @ilyaSyntax
      @ilyaSyntax Год назад +1

      @@zzzaphod8507RINGS A WHAT??

    • @cosmic5934
      @cosmic5934 Год назад +1

      @@ilyaSyntax bell

  • @KamiK4ze
    @KamiK4ze Год назад +76

    I did machine learning for ransomware detection as part of my thesis, problem I had was trying to obtain data for the newest variants. The model needed consistent training to keep up with the new malware.

    • @mmm-me4kk
      @mmm-me4kk Год назад

      Sir which ML algorithms did you use and did you use ROC/AUC and K-Fold Cross Validation?

    • @kenbobcorn
      @kenbobcorn Год назад +2

      If you are an academic Virustotal has repositories of large Malware samples from the last quarter year or so. VirusShare also has large torrents of recent samples.

    • @mmm-me4kk
      @mmm-me4kk Год назад

      @@kenbobcorn @Kamik4ze , Indeed you could also use the ISOT dataset, although I think that one is outdated

    • @shouldb.studying4670
      @shouldb.studying4670 Год назад

      Surely there is a consistency in what outcomes the malware is trying to achieve that could be used as the basis for detection???

    • @tommasomorandini1982
      @tommasomorandini1982 Год назад +10

      Wouldn't the point of machine learning be for the program to learn how the malware generally behaves in order to accomplish it's goal, and so not rely on the latest samples to identify malware? Because if you always need the latest samples you might as well just have your program check the files directly against your database no? Or am I missing something?

  • @PHF28
    @PHF28 Год назад +13

    I think there might be a mistake in the diagram at 17:10. The red slice should be test data and the remaining slices should be used for training.
    In any case, great video once again.

  • @saasthavasan
    @saasthavasan Год назад +22

    In my experience, the biggest hurdle I faced while using ML for malware detection or behavior detection was choosing and extracting the features. Often the selected features overlap between malicious and benign software (eg. sequence of APIs). Unlike static and dynamic detection which works on heuristics written by an experienced analyst, ML models learn these heuristics on their own during training. And most of the time these heuristics learned by the ML model do not actually make sense. At the end of the day ML models work on pattern detection. It is really difficult to make the model learn actual features that are responsible for behavior rather than some random reoccurring features in the dataset. As a result, we end up getting high FP.

    • @goldnutter412
      @goldnutter412 Год назад

      Sounds like fun, things were simpler back in the day..

    • @DumbledoreMcCracken
      @DumbledoreMcCracken Год назад

      Hence, ML is not what it is sold to be

    • @lapisFarm
      @lapisFarm Год назад

      Really interesting, thanks

  • @kenbobcorn
    @kenbobcorn Год назад +26

    I would argue Machine Learning is already very prevalent in industry. As someone who has worked in Malware detection for both Microsoft and Amazon, we leverage large tree models and even large language models for detection.

    • @ZandarKoad
      @ZandarKoad Год назад +6

      It depends. Massive organizations that don't focus on tech as a core competency can sometimes be very, very slow at adopting the best tools because it is hard for them to even understand what the best tools are, or come up with a framework for comparing tools. Especially governments.

    • @ttos3093
      @ttos3093 Год назад

      Look at cybersecurity vendors (Fortinet, Palo Alto) - they apply that. It’s natural that MS or Amazon as infrastructure companies have no tradition in this.

    • @kenbobcorn
      @kenbobcorn Год назад

      @@ZandarKoad I can say, at least for the companies I've consulted for, they use some form of SIEM or other IDS system. A lot of the ML side of things are handled by the SIEM vendors, while they just have to comb through alerts and identify FPs.

    • @GordonjSmith1
      @GordonjSmith1 Год назад

      Yup! Totally agree that you can detect previously detected 'models' of current threats, but you are still unable to detect an emerging threat using ML. It is an 'informational problem' that this professor clearly discusses.

  • @trymoto
    @trymoto Год назад +2

    I could talk to that guy over a pint for like three hours. He's oversimplifying here for a general viewer but this topic is fascinating. Thanks for the video.

  • @GaryParris
    @GaryParris Год назад +5

    There is another way to also think about this issue, but it is one that is not talked about as much and that is separation of data systems and data itself from public and private data. because of the increase in online usability and transparency much of the data is exposed to all these forms of attack, also the monetisation of data & proprietary IP creates a reason to profit from it on both sides of the data fence. if you cannot access it directly, it is less likely to be stolen, if the stored information is not valuable, it becomes pointless to steal it. If the identity requirements are removed/reduced, the identity is less value. everything is a trade off. pattern matching machine algorithms (ML & AI) is limited by the algorithms parameters.

    • @andrewharrison8436
      @andrewharrison8436 Год назад +1

      Well said - all in the name of convenience for the user and exploitation by anyone who handles the data.

  • @titaniumdiveknife2
    @titaniumdiveknife2 Год назад

    Very fun to learn about.

  • @christersmith5470
    @christersmith5470 Год назад +2

    Using ML to group different types of malicious applications into different families makes the process of malware detection more adaptive, yet we are still getting zero days where a malicious application succeeds by appearing benign.
    In the medical sciences, there have been many problems, discovered later, where the features used by ML did not accurately predict on new data. This is because researchers let the ML program determine its own features, and the ML program lacked domain expertise. This has resulted in many new companies heavily investing in PhD researchers to prepare the data and relevant features to then run in the model.
    In cybersecurity, we will still need the human element for similar reasons.

    • @stavsherman6632
      @stavsherman6632 Год назад

      Can you give some examples of this? I am curious to read about it

  • @ewookiis
    @ewookiis Год назад

    Actually, there is ways to safely implement this. Using it as a trigger value and not the decision engine. Drillning down into the actual detection tree - there's that many different ways of compromise but can be handled, and they are still limited, in short keeping track of execution, persistence and escalation is first step with this as a possible helper.
    "EDR/XDR" can be quite sufficient in spanning into a larger chain of "observant" behaviour, ie, the detection engine itself does not have to utilize it, but acting and piecing data together does have elevation from this field.
    I do however agree that taking on the whole chain of compromise things gets really tricky.
    Static and/or dynamic binary analysis is such a small portion in the whole part of the indicator chain, but training something to the actual portions, be it a buffer overflow etc etc, it can be used in my opinion.

  • @GenaTrius
    @GenaTrius Год назад +4

    I assumed this was going to be about malware that uses machine learning. Terrifying.

  • @HebaruSan
    @HebaruSan Год назад +11

    I'm on a team that releases a free open source app. For a while, every time we released a new version we would get a handful of false positive reports from users whose virus scanners tripped on it. Seems like some of the companies just give up and flag everything that isn't in their whitelist when faced with an essentially unsolvable task.

    • @ewookiis
      @ewookiis Год назад +2

      Nope, they don't give up - they have FP's that they sadly don't handle - and this is part of the "lazy" way that was described in the signature approach. Ie - they use to badly written indicators and leave the detection engine with to much weight on that portion. Sometimes it's the odd coding from the program as well..

  • @graog123
    @graog123 Год назад

    Thanks for uploading in 4K

  • @IceMetalPunk
    @IceMetalPunk Год назад +4

    I feel like many areas of modern ML, including this one, either do or could benefit greatly from continual learning (which, from my understanding, is synonymous with iterative online learning; if they're different, I'd appreciate an explanation of how!). Now, if only we could make that practically efficient on the massive networks of hundreds of billions of parameters or more 😁

    • @prashantd6252
      @prashantd6252 Год назад +1

      I'd recommend reading more on ML and what scale is being worked on right now. . .from your comment I felt like you think a billion "parameters" is too much of a challenge, which it isn't. I'd recommend you check out *huggingface

    • @romanemul1
      @romanemul1 Год назад

      @@prashantd6252 well training billion params is not a problem. Spending 5k$ for AWS/Azure/Google processing power is a problem.

  • @cernejr
    @cernejr Год назад +1

    I like those markers/pens. :)

  • @delusionnnnn
    @delusionnnnn Год назад +1

    It doesn't help that a lot of false positives are generated by detectors actively equating software piracy with malware. In many cases the techniques are similar, so the issue cannot entirely be dismissed, but even when the techniques are exclusive to piracy, detectors often have a high motivating factor to keep identifying piracy techniques as false positives for "malware", particularly those companies which write both detectors and high-profile commercial software such as Microsoft itself, or who are incentivized by them.

  • @celivalg
    @celivalg Год назад +2

    Its not quite over-fitting, it's just trained for different threats. The problem is that the patterns would change, as if a panda suddenly didn't mean panda but dog, and the ML system cannot adapt to that.
    Maybe a more fitting imagery would be if you had a few images of pandas in your training data, and the ML system would recognize them as pandas very well, but now the context changed and dogs are now also pandas. So it should recognize dogs as pandas but it doesn't, as it has either been trained to recognize dogs as dogs, or not trained on them at all, and the image look so different that it has no way of linking the dog to the panda.

  • @Syntax753
    @Syntax753 Год назад

    Fantastic!

  • @Veptis
    @Veptis 8 месяцев назад

    So machine learning models, such as classifiers. Require a labeld dataset for supervised trained.
    So there is datasets of malware? Maybe like vx underground vault?

  • @FrancescoBazzani
    @FrancescoBazzani Год назад +1

    Heard 20 seconds of the video, and… yes, he’s Italian as me.
    Stepping aside from this inside joke, great content!

  • @shiladityasircar9814
    @shiladityasircar9814 9 месяцев назад

    Prevalence data and diversity of behaviour are two important crieteria. It's difficult to mount an adversarial attack on models that are behaviour dependent. These modern ML approaches to cyber security use static and dynamic behaviour encoding to stop malware. Cylance ML models are an example of it.

  • @GNARGNARHEAD
    @GNARGNARHEAD Год назад

    oh right, check out Christopher Domas talk "The future of RE Dynamic Binary Visualization", I'd bet you'd have much better luck feeding the data in with various transformations, like a Hilbert curve, giving it a semantic structure to deal with.. just might even work with an image recognition algorithm then too.. maybe

  • @CodingTrades
    @CodingTrades Год назад +5

    MLearns evaluates Malware as an Adversarial code execution that's malicious.identity That's detection relies on behavior that is itself a signature representation unique for recognizing it has been deployed. How is a behavior signature not like a fingerprint?

    • @ewookiis
      @ewookiis Год назад

      I agree, it is like fingerprints. However, every itteration just like fingerprints are different to an extent that you can't only rely on it.

    • @goldnutter412
      @goldnutter412 Год назад +2

      It would be one facet of detection, like the MO (modus operandi) in a crime.
      Fingerprinting is "specific" .. I like the MALICIOUS.IDENTITY object ! very handy, you could call it a signature but that wouldn't really be accurate. A specific code execution "process" occurring on the CPU is what is being detected, right ?

  • @DumbledoreMcCracken
    @DumbledoreMcCracken Год назад

    It seems more interesting to write infections with ML that create detection nets

  • @thaihocnguyen7113
    @thaihocnguyen7113 Год назад

    I have a question
    Cross-Validation is a method that supports a machine learning model that can surf on all data (with n-folds you can split train or validate). In time, I'm confused about accuracy we need to "test-set" to check again your model right? Because your model which you trained by cross-validation method can overfit.
    I think cross-validation is used when you have a small data and we need to set of data-test that is checking again. If you have enough data you don't need cross-validation, right?
    sorry for my English

    • @SuperCaptain4
      @SuperCaptain4 Год назад

      Normally cross validation is used for setting hyperparameters to a machine learning model. First you would split your data set into training and test set, say 70/30. Thereafter, you use k-fold cross validation on the training set. What will happen is that a model will be trained k times. (k is a number you choose, the higher k, the better estimates you get for your hyperparameters but the more time you spend cross validating the model as it needs to be retrained)
      Each time the model is trained, during k-fold cross validation the training dataset, the 70% of all the data you had from the beginning, will be split again. Lets say its split 90/10. The model will then be trained on this 90% and evaluated on the remaining 10% of the validation data. After repeating this k times, we select the hyperparameter value which scored the highest on the 10% validation data.
      Now to prevent overfitting, we run the model again on the completely unseen test data, the 30% from the original data that we had kept away during training.

    • @thaihocnguyen7113
      @thaihocnguyen7113 Год назад

      @@SuperCaptain4 thanks you so much i got it.

  • @andrewharrison8436
    @andrewharrison8436 Год назад +3

    It seems to me that the hunt for bells, whistles and bling in applications leads to an enhanced attack surface which allows malware.
    I wrote a secure interface (a long time ago), it was doable because the range of API calls I had to intercept was very limited and I could parse all possible legit parameters and reject the rest. The code was documented and could be checked by my peers.
    Move to a GUI based environment with more levels of abstraction and the operating system being invoked the whole time for sound or video or malice - no chance.
    Security starts from the operating system (disclaimer - Windows user - I do hope the antivirus people know their stuff).

  • @katjejoek
    @katjejoek Год назад

    It has been a while since I've seen BASIC code! 😂

  • @RealCyberCrime
    @RealCyberCrime Год назад +11

    Just wait until chatgpt can write better malicious software

    • @bytefu
      @bytefu Год назад

      If only it understood what it's writing...

  • @artiem5262
    @artiem5262 Год назад

    It's heuristics -- educated guessing -- as the halting problem is still out there, so you can guess but you'll never be able to prove if a target is malware or not.

    • @JorgetePanete
      @JorgetePanete Год назад

      I guess that's only when you treat it as a black box, in a white box you could know what it is

  • @MoxxMix
    @MoxxMix Год назад

    Is there a point in talking about this when windows 11 became a malware.

  • @timothygalvin3021
    @timothygalvin3021 Год назад +1

    I can't express in words how much all the empty shelves in this video bother me. Why have all these shelves if you're not going to use them!?

  • @barreiros5077
    @barreiros5077 Год назад +1

    What API said...

  • @RealCyberCrime
    @RealCyberCrime Год назад +1

    Just waiting on chatgpt to write some good malware

  • @raicyceprine8953
    @raicyceprine8953 Год назад

    i don't know why i watch it full even though i dont understand it

  • @happygimp0
    @happygimp0 10 месяцев назад

    You can not use a computer to detect malware. It is mathematically impossible to do that reliable, since it requires the halting problem to be solvable on a PC, which it isn't.

  • @cytroyd
    @cytroyd Год назад +10

    We need MLware that uses ML to penetrate and replicate across systems. Imagine a GPT-powered worm. Self-generating zero days. I recommend open source LLM's like BLOOM to get started.

    • @adia.413
      @adia.413 Год назад +4

      The computation requirements to run GPT would have to be much less than today, as not all servers have enough computation power to run a model. On the other hand, I can imagine a trained IA model that could analyze the binaries / source code and create zero day approaches based on the input.

    • @-FFFridge
      @-FFFridge Год назад +1

      You could use the same method as actual viruses and randomly mutate the code 1mil times on all already infected system, until some variant actually works, which is then sent outward to penetetrate new hosts. It's incredibly slow, but requires less computing than GPT.

  • @0_1_2
    @0_1_2 Год назад +2

    Give us closed captions please! His accent is difficult

  • @hottertake3818
    @hottertake3818 Год назад +12

    0th

  • @juffler463
    @juffler463 Год назад +3

    Hey sir