Improving efficacy of AI/ML model to predict exploitability score for vulnerabilities

In an earlier blog, I talked about how machine learning can help predict exploitability score for a vulnerability. In this blog, I will elaborate on some of the finer aspects before comparing with an alternative means available.

From a machine learning angle, it is important to identify & include relevant features and use a balanced dataset for training the model (no new learning there :)). I don’t intend to do a deep dive about these aspects as there is already enough material highlighting their importance. However, I would like to draw attention to another important aspect which is testing efficacy of the trained model with an extreme negative dataset.

Let me explain in more detail here – around 5% of the total vulnerabilities out there in the wild have an exploit. Thus inherently this dataset is heavily skewed. While training the machine learning model (Neural Net in our case), we created a balanced dataset. This produced pretty good results (from an efficiency / accuracy angle). However, we realized that given that only 5% vulnerabilities are actually exploitable, it was prudent to re-certify the trained model with an extreme negative dataset i.e. a dataset containing older vulnerabilities which have not seen any “known” exploits over the last 4+ years. Note ensuring a long enough window (last 4+ years) is important since you don’t want to classify something very recent as not exploitable already.

We realized that the test results with this extreme negative dataset provided us with some profound insights to include/exclude certain features which otherwise seemed to be less relevant in the earlier balanced dataset results. We will talk about features in a minute, but the key learning was to include appropriate testing with negative dataset to understand how the model performs.

In the earlier blog, we mentioned about certain features like: CVSS Vector fields (Access Vector, Access Complexity, etc.), CVSS score and more. There are possibly many more features that can be considered for inclusion like the attack vector (Common Weakness Enumeration [CWE]), vendor or origin of software, product name, product version, patch history, and more. At times it might seem that some of the features do not play an important role especially during testing with balanced data set. However, tests with extreme negative dataset may help provide a different perspective altogether.

Earlier in this blog when I mentioned about an alternative means available to predict the exploitability score, I was referring to Exploit Prediction Scoring System – a joint research from Cyentia Research Institute and Kenna Security released at BlackHat 2019. Here is a gist of the research conducted – dataset in the research comprised of around 25K vulnerabilities between June 1, 2016 till June 1, 2018. This dataset was processed and analyzed using machine learning to arrive at 16 key factors to predict exploitability. The key factors can be bucketed into:

Source of origin of the software i.e. vendor
Type of attack or CWE
Information related to exploit
Count of related references tracked in the CVE

The research arrives at a formula which uses these key factors to arrive at the exploitability score.

While I like the simplicity of the formula which enables users to incorporate it in a standard spreadsheet (an online EPSS calculator is available here), there are some drawbacks / concerns as below:

The approach to simplifying was based on multiple stages of filtering. Each stage of filtering reduced the count of variables involved. For example the raw count of vendors originally was 3374 which reduced as follows through each of the three stages of filtering (3374 —> 171 —> 16 —> 15). These stages of filtering aided in simplification, though at the cost of loss of some information or detail.
An inherent drawback is that the weights in the formula for the 16 key factors cannot be frozen in time and these will need to revised at a regular cadence. Without these much required revisions, the outcome from EPSS cannot be relied upon. Note revising the weights is not straightforward, as it would require re-training the model.

Attenu8 comprises of multiple AI/ML models, all of which are periodically retrained along with a much required feedback loop to eliminate any incorrect predictions over a period of time.

Tagged #AI #exploits #machine-learning #ML