I've spent many years in the industry and the biggest hurdle I've seen to having more dynamic identification is false positives. More specifically stopping users from their day-to-day activities because it has been determined to be malicious. Users are MUCH more forgiving of false negatives (actual infections) than false positives.
To be fair, false positives are really annoying to get as an end user. I don't want to go through hoops to recover my file that I know is safe after dismissing all the warnings.
I did machine learning for ransomware detection as part of my thesis, problem I had was trying to obtain data for the newest variants. The model needed consistent training to keep up with the new malware.
If you are an academic Virustotal has repositories of large Malware samples from the last quarter year or so. VirusShare also has large torrents of recent samples.
Wouldn't the point of machine learning be for the program to learn how the malware generally behaves in order to accomplish it's goal, and so not rely on the latest samples to identify malware? Because if you always need the latest samples you might as well just have your program check the files directly against your database no? Or am I missing something?
In my experience, the biggest hurdle I faced while using ML for malware detection or behavior detection was choosing and extracting the features. Often the selected features overlap between malicious and benign software (eg. sequence of APIs). Unlike static and dynamic detection which works on heuristics written by an experienced analyst, ML models learn these heuristics on their own during training. And most of the time these heuristics learned by the ML model do not actually make sense. At the end of the day ML models work on pattern detection. It is really difficult to make the model learn actual features that are responsible for behavior rather than some random reoccurring features in the dataset. As a result, we end up getting high FP.
I think there might be a mistake in the diagram at 17:10. The red slice should be test data and the remaining slices should be used for training. In any case, great video once again.
I would argue Machine Learning is already very prevalent in industry. As someone who has worked in Malware detection for both Microsoft and Amazon, we leverage large tree models and even large language models for detection.
It depends. Massive organizations that don't focus on tech as a core competency can sometimes be very, very slow at adopting the best tools because it is hard for them to even understand what the best tools are, or come up with a framework for comparing tools. Especially governments.
Look at cybersecurity vendors (Fortinet, Palo Alto) - they apply that. It’s natural that MS or Amazon as infrastructure companies have no tradition in this.
@@ZandarKoad I can say, at least for the companies I've consulted for, they use some form of SIEM or other IDS system. A lot of the ML side of things are handled by the SIEM vendors, while they just have to comb through alerts and identify FPs.
Yup! Totally agree that you can detect previously detected 'models' of current threats, but you are still unable to detect an emerging threat using ML. It is an 'informational problem' that this professor clearly discusses.
I could talk to that guy over a pint for like three hours. He's oversimplifying here for a general viewer but this topic is fascinating. Thanks for the video.
There is another way to also think about this issue, but it is one that is not talked about as much and that is separation of data systems and data itself from public and private data. because of the increase in online usability and transparency much of the data is exposed to all these forms of attack, also the monetisation of data & proprietary IP creates a reason to profit from it on both sides of the data fence. if you cannot access it directly, it is less likely to be stolen, if the stored information is not valuable, it becomes pointless to steal it. If the identity requirements are removed/reduced, the identity is less value. everything is a trade off. pattern matching machine algorithms (ML & AI) is limited by the algorithms parameters.
Using ML to group different types of malicious applications into different families makes the process of malware detection more adaptive, yet we are still getting zero days where a malicious application succeeds by appearing benign. In the medical sciences, there have been many problems, discovered later, where the features used by ML did not accurately predict on new data. This is because researchers let the ML program determine its own features, and the ML program lacked domain expertise. This has resulted in many new companies heavily investing in PhD researchers to prepare the data and relevant features to then run in the model. In cybersecurity, we will still need the human element for similar reasons.
I'm on a team that releases a free open source app. For a while, every time we released a new version we would get a handful of false positive reports from users whose virus scanners tripped on it. Seems like some of the companies just give up and flag everything that isn't in their whitelist when faced with an essentially unsolvable task.
Nope, they don't give up - they have FP's that they sadly don't handle - and this is part of the "lazy" way that was described in the signature approach. Ie - they use to badly written indicators and leave the detection engine with to much weight on that portion. Sometimes it's the odd coding from the program as well..
Prevalence data and diversity of behaviour are two important crieteria. It's difficult to mount an adversarial attack on models that are behaviour dependent. These modern ML approaches to cyber security use static and dynamic behaviour encoding to stop malware. Cylance ML models are an example of it.
Its not quite over-fitting, it's just trained for different threats. The problem is that the patterns would change, as if a panda suddenly didn't mean panda but dog, and the ML system cannot adapt to that. Maybe a more fitting imagery would be if you had a few images of pandas in your training data, and the ML system would recognize them as pandas very well, but now the context changed and dogs are now also pandas. So it should recognize dogs as pandas but it doesn't, as it has either been trained to recognize dogs as dogs, or not trained on them at all, and the image look so different that it has no way of linking the dog to the panda.
It doesn't help that a lot of false positives are generated by detectors actively equating software piracy with malware. In many cases the techniques are similar, so the issue cannot entirely be dismissed, but even when the techniques are exclusive to piracy, detectors often have a high motivating factor to keep identifying piracy techniques as false positives for "malware", particularly those companies which write both detectors and high-profile commercial software such as Microsoft itself, or who are incentivized by them.
I feel like many areas of modern ML, including this one, either do or could benefit greatly from continual learning (which, from my understanding, is synonymous with iterative online learning; if they're different, I'd appreciate an explanation of how!). Now, if only we could make that practically efficient on the massive networks of hundreds of billions of parameters or more 😁
I'd recommend reading more on ML and what scale is being worked on right now. . .from your comment I felt like you think a billion "parameters" is too much of a challenge, which it isn't. I'd recommend you check out *huggingface
So machine learning models, such as classifiers. Require a labeld dataset for supervised trained. So there is datasets of malware? Maybe like vx underground vault?
Actually, there is ways to safely implement this. Using it as a trigger value and not the decision engine. Drillning down into the actual detection tree - there's that many different ways of compromise but can be handled, and they are still limited, in short keeping track of execution, persistence and escalation is first step with this as a possible helper. "EDR/XDR" can be quite sufficient in spanning into a larger chain of "observant" behaviour, ie, the detection engine itself does not have to utilize it, but acting and piecing data together does have elevation from this field. I do however agree that taking on the whole chain of compromise things gets really tricky. Static and/or dynamic binary analysis is such a small portion in the whole part of the indicator chain, but training something to the actual portions, be it a buffer overflow etc etc, it can be used in my opinion.
It seems to me that the hunt for bells, whistles and bling in applications leads to an enhanced attack surface which allows malware. I wrote a secure interface (a long time ago), it was doable because the range of API calls I had to intercept was very limited and I could parse all possible legit parameters and reject the rest. The code was documented and could be checked by my peers. Move to a GUI based environment with more levels of abstraction and the operating system being invoked the whole time for sound or video or malice - no chance. Security starts from the operating system (disclaimer - Windows user - I do hope the antivirus people know their stuff).
MLearns evaluates Malware as an Adversarial code execution that's malicious.identity That's detection relies on behavior that is itself a signature representation unique for recognizing it has been deployed. How is a behavior signature not like a fingerprint?
It would be one facet of detection, like the MO (modus operandi) in a crime. Fingerprinting is "specific" .. I like the MALICIOUS.IDENTITY object ! very handy, you could call it a signature but that wouldn't really be accurate. A specific code execution "process" occurring on the CPU is what is being detected, right ?
oh right, check out Christopher Domas talk "The future of RE Dynamic Binary Visualization", I'd bet you'd have much better luck feeding the data in with various transformations, like a Hilbert curve, giving it a semantic structure to deal with.. just might even work with an image recognition algorithm then too.. maybe
It's heuristics -- educated guessing -- as the halting problem is still out there, so you can guess but you'll never be able to prove if a target is malware or not.
You can not use a computer to detect malware. It is mathematically impossible to do that reliable, since it requires the halting problem to be solvable on a PC, which it isn't.
I have a question Cross-Validation is a method that supports a machine learning model that can surf on all data (with n-folds you can split train or validate). In time, I'm confused about accuracy we need to "test-set" to check again your model right? Because your model which you trained by cross-validation method can overfit. I think cross-validation is used when you have a small data and we need to set of data-test that is checking again. If you have enough data you don't need cross-validation, right? sorry for my English
Normally cross validation is used for setting hyperparameters to a machine learning model. First you would split your data set into training and test set, say 70/30. Thereafter, you use k-fold cross validation on the training set. What will happen is that a model will be trained k times. (k is a number you choose, the higher k, the better estimates you get for your hyperparameters but the more time you spend cross validating the model as it needs to be retrained) Each time the model is trained, during k-fold cross validation the training dataset, the 70% of all the data you had from the beginning, will be split again. Lets say its split 90/10. The model will then be trained on this 90% and evaluated on the remaining 10% of the validation data. After repeating this k times, we select the hyperparameter value which scored the highest on the 10% validation data. Now to prevent overfitting, we run the model again on the completely unseen test data, the 30% from the original data that we had kept away during training.
We need MLware that uses ML to penetrate and replicate across systems. Imagine a GPT-powered worm. Self-generating zero days. I recommend open source LLM's like BLOOM to get started.
The computation requirements to run GPT would have to be much less than today, as not all servers have enough computation power to run a model. On the other hand, I can imagine a trained IA model that could analyze the binaries / source code and create zero day approaches based on the input.
You could use the same method as actual viruses and randomly mutate the code 1mil times on all already infected system, until some variant actually works, which is then sent outward to penetetrate new hosts. It's incredibly slow, but requires less computing than GPT.
I've spent many years in the industry and the biggest hurdle I've seen to having more dynamic identification is false positives. More specifically stopping users from their day-to-day activities because it has been determined to be malicious. Users are MUCH more forgiving of false negatives (actual infections) than false positives.
To be fair, false positives are really annoying to get as an end user. I don't want to go through hoops to recover my file that I know is safe after dismissing all the warnings.
This. Half my day as an analyst is going through false positives
@@RealCyberCrime only half?
@@elidrissii Yep SentinelOne notorious for it.
The "false positive" has been an issue since the internet went "public"
I did machine learning for ransomware detection as part of my thesis, problem I had was trying to obtain data for the newest variants. The model needed consistent training to keep up with the new malware.
Sir which ML algorithms did you use and did you use ROC/AUC and K-Fold Cross Validation?
If you are an academic Virustotal has repositories of large Malware samples from the last quarter year or so. VirusShare also has large torrents of recent samples.
@@kenbobcorn @Kamik4ze , Indeed you could also use the ISOT dataset, although I think that one is outdated
Surely there is a consistency in what outcomes the malware is trying to achieve that could be used as the basis for detection???
Wouldn't the point of machine learning be for the program to learn how the malware generally behaves in order to accomplish it's goal, and so not rely on the latest samples to identify malware? Because if you always need the latest samples you might as well just have your program check the files directly against your database no? Or am I missing something?
Can we talk about those flawless freehand bell curves?!
No. Arcane magic is not computable (as of yet).
Was trying to remember how to call those curves, thanks - the name rings a
@@zzzaphod8507RINGS A WHAT??
@@ilyaSyntax bell
In my experience, the biggest hurdle I faced while using ML for malware detection or behavior detection was choosing and extracting the features. Often the selected features overlap between malicious and benign software (eg. sequence of APIs). Unlike static and dynamic detection which works on heuristics written by an experienced analyst, ML models learn these heuristics on their own during training. And most of the time these heuristics learned by the ML model do not actually make sense. At the end of the day ML models work on pattern detection. It is really difficult to make the model learn actual features that are responsible for behavior rather than some random reoccurring features in the dataset. As a result, we end up getting high FP.
Sounds like fun, things were simpler back in the day..
Hence, ML is not what it is sold to be
Really interesting, thanks
I think there might be a mistake in the diagram at 17:10. The red slice should be test data and the remaining slices should be used for training.
In any case, great video once again.
I would argue Machine Learning is already very prevalent in industry. As someone who has worked in Malware detection for both Microsoft and Amazon, we leverage large tree models and even large language models for detection.
It depends. Massive organizations that don't focus on tech as a core competency can sometimes be very, very slow at adopting the best tools because it is hard for them to even understand what the best tools are, or come up with a framework for comparing tools. Especially governments.
Look at cybersecurity vendors (Fortinet, Palo Alto) - they apply that. It’s natural that MS or Amazon as infrastructure companies have no tradition in this.
@@ZandarKoad I can say, at least for the companies I've consulted for, they use some form of SIEM or other IDS system. A lot of the ML side of things are handled by the SIEM vendors, while they just have to comb through alerts and identify FPs.
Yup! Totally agree that you can detect previously detected 'models' of current threats, but you are still unable to detect an emerging threat using ML. It is an 'informational problem' that this professor clearly discusses.
I could talk to that guy over a pint for like three hours. He's oversimplifying here for a general viewer but this topic is fascinating. Thanks for the video.
There is another way to also think about this issue, but it is one that is not talked about as much and that is separation of data systems and data itself from public and private data. because of the increase in online usability and transparency much of the data is exposed to all these forms of attack, also the monetisation of data & proprietary IP creates a reason to profit from it on both sides of the data fence. if you cannot access it directly, it is less likely to be stolen, if the stored information is not valuable, it becomes pointless to steal it. If the identity requirements are removed/reduced, the identity is less value. everything is a trade off. pattern matching machine algorithms (ML & AI) is limited by the algorithms parameters.
Well said - all in the name of convenience for the user and exploitation by anyone who handles the data.
Using ML to group different types of malicious applications into different families makes the process of malware detection more adaptive, yet we are still getting zero days where a malicious application succeeds by appearing benign.
In the medical sciences, there have been many problems, discovered later, where the features used by ML did not accurately predict on new data. This is because researchers let the ML program determine its own features, and the ML program lacked domain expertise. This has resulted in many new companies heavily investing in PhD researchers to prepare the data and relevant features to then run in the model.
In cybersecurity, we will still need the human element for similar reasons.
Can you give some examples of this? I am curious to read about it
I'm on a team that releases a free open source app. For a while, every time we released a new version we would get a handful of false positive reports from users whose virus scanners tripped on it. Seems like some of the companies just give up and flag everything that isn't in their whitelist when faced with an essentially unsolvable task.
Nope, they don't give up - they have FP's that they sadly don't handle - and this is part of the "lazy" way that was described in the signature approach. Ie - they use to badly written indicators and leave the detection engine with to much weight on that portion. Sometimes it's the odd coding from the program as well..
I assumed this was going to be about malware that uses machine learning. Terrifying.
Prevalence data and diversity of behaviour are two important crieteria. It's difficult to mount an adversarial attack on models that are behaviour dependent. These modern ML approaches to cyber security use static and dynamic behaviour encoding to stop malware. Cylance ML models are an example of it.
Thanks for uploading in 4K
Its not quite over-fitting, it's just trained for different threats. The problem is that the patterns would change, as if a panda suddenly didn't mean panda but dog, and the ML system cannot adapt to that.
Maybe a more fitting imagery would be if you had a few images of pandas in your training data, and the ML system would recognize them as pandas very well, but now the context changed and dogs are now also pandas. So it should recognize dogs as pandas but it doesn't, as it has either been trained to recognize dogs as dogs, or not trained on them at all, and the image look so different that it has no way of linking the dog to the panda.
It doesn't help that a lot of false positives are generated by detectors actively equating software piracy with malware. In many cases the techniques are similar, so the issue cannot entirely be dismissed, but even when the techniques are exclusive to piracy, detectors often have a high motivating factor to keep identifying piracy techniques as false positives for "malware", particularly those companies which write both detectors and high-profile commercial software such as Microsoft itself, or who are incentivized by them.
I feel like many areas of modern ML, including this one, either do or could benefit greatly from continual learning (which, from my understanding, is synonymous with iterative online learning; if they're different, I'd appreciate an explanation of how!). Now, if only we could make that practically efficient on the massive networks of hundreds of billions of parameters or more 😁
I'd recommend reading more on ML and what scale is being worked on right now. . .from your comment I felt like you think a billion "parameters" is too much of a challenge, which it isn't. I'd recommend you check out *huggingface
@@prashantd6252 well training billion params is not a problem. Spending 5k$ for AWS/Azure/Google processing power is a problem.
I like those markers/pens. :)
So machine learning models, such as classifiers. Require a labeld dataset for supervised trained.
So there is datasets of malware? Maybe like vx underground vault?
Actually, there is ways to safely implement this. Using it as a trigger value and not the decision engine. Drillning down into the actual detection tree - there's that many different ways of compromise but can be handled, and they are still limited, in short keeping track of execution, persistence and escalation is first step with this as a possible helper.
"EDR/XDR" can be quite sufficient in spanning into a larger chain of "observant" behaviour, ie, the detection engine itself does not have to utilize it, but acting and piecing data together does have elevation from this field.
I do however agree that taking on the whole chain of compromise things gets really tricky.
Static and/or dynamic binary analysis is such a small portion in the whole part of the indicator chain, but training something to the actual portions, be it a buffer overflow etc etc, it can be used in my opinion.
It seems to me that the hunt for bells, whistles and bling in applications leads to an enhanced attack surface which allows malware.
I wrote a secure interface (a long time ago), it was doable because the range of API calls I had to intercept was very limited and I could parse all possible legit parameters and reject the rest. The code was documented and could be checked by my peers.
Move to a GUI based environment with more levels of abstraction and the operating system being invoked the whole time for sound or video or malice - no chance.
Security starts from the operating system (disclaimer - Windows user - I do hope the antivirus people know their stuff).
Were you bragging dude? 😂
MLearns evaluates Malware as an Adversarial code execution that's malicious.identity That's detection relies on behavior that is itself a signature representation unique for recognizing it has been deployed. How is a behavior signature not like a fingerprint?
I agree, it is like fingerprints. However, every itteration just like fingerprints are different to an extent that you can't only rely on it.
It would be one facet of detection, like the MO (modus operandi) in a crime.
Fingerprinting is "specific" .. I like the MALICIOUS.IDENTITY object ! very handy, you could call it a signature but that wouldn't really be accurate. A specific code execution "process" occurring on the CPU is what is being detected, right ?
It seems more interesting to write infections with ML that create detection nets
oh right, check out Christopher Domas talk "The future of RE Dynamic Binary Visualization", I'd bet you'd have much better luck feeding the data in with various transformations, like a Hilbert curve, giving it a semantic structure to deal with.. just might even work with an image recognition algorithm then too.. maybe
Heard 20 seconds of the video, and… yes, he’s Italian as me.
Stepping aside from this inside joke, great content!
Very fun to learn about.
Just wait until chatgpt can write better malicious software
If only it understood what it's writing...
Fantastic!
It's heuristics -- educated guessing -- as the halting problem is still out there, so you can guess but you'll never be able to prove if a target is malware or not.
I guess that's only when you treat it as a black box, in a white box you could know what it is
i don't know why i watch it full even though i dont understand it
I can't express in words how much all the empty shelves in this video bother me. Why have all these shelves if you're not going to use them!?
You can not use a computer to detect malware. It is mathematically impossible to do that reliable, since it requires the halting problem to be solvable on a PC, which it isn't.
It has been a while since I've seen BASIC code! 😂
What API said...
Is there a point in talking about this when windows 11 became a malware.
Just waiting on chatgpt to write some good malware
I have a question
Cross-Validation is a method that supports a machine learning model that can surf on all data (with n-folds you can split train or validate). In time, I'm confused about accuracy we need to "test-set" to check again your model right? Because your model which you trained by cross-validation method can overfit.
I think cross-validation is used when you have a small data and we need to set of data-test that is checking again. If you have enough data you don't need cross-validation, right?
sorry for my English
Normally cross validation is used for setting hyperparameters to a machine learning model. First you would split your data set into training and test set, say 70/30. Thereafter, you use k-fold cross validation on the training set. What will happen is that a model will be trained k times. (k is a number you choose, the higher k, the better estimates you get for your hyperparameters but the more time you spend cross validating the model as it needs to be retrained)
Each time the model is trained, during k-fold cross validation the training dataset, the 70% of all the data you had from the beginning, will be split again. Lets say its split 90/10. The model will then be trained on this 90% and evaluated on the remaining 10% of the validation data. After repeating this k times, we select the hyperparameter value which scored the highest on the 10% validation data.
Now to prevent overfitting, we run the model again on the completely unseen test data, the 30% from the original data that we had kept away during training.
@@SuperCaptain4 thanks you so much i got it.
We need MLware that uses ML to penetrate and replicate across systems. Imagine a GPT-powered worm. Self-generating zero days. I recommend open source LLM's like BLOOM to get started.
The computation requirements to run GPT would have to be much less than today, as not all servers have enough computation power to run a model. On the other hand, I can imagine a trained IA model that could analyze the binaries / source code and create zero day approaches based on the input.
You could use the same method as actual viruses and randomly mutate the code 1mil times on all already infected system, until some variant actually works, which is then sent outward to penetetrate new hosts. It's incredibly slow, but requires less computing than GPT.
Give us closed captions please! His accent is difficult
0th
Hey sir