Part 5: Fuchikoma v2 – Cybots AI

How to Train a Machine Learning Model to Defeat APT Cyber Attacks

Part 5: Fuchikoma v2 - Jab, Cross, Hook, Perfecting the 1-2-3 Combo

This is Part 5 in a multi-part series of articles about how CyCraft Senior Researcher C.K. Chen and team step-by-step used open-source software to successfully design a working threat hunting machine learning model. We suggest starting from Round 1: Introducing Fuchikoma.

How to Train a ML Model to Defeat APT Cyber Attacks

Round 1: Introducing Fuchikoma

Round 2: Fuchikoma VS CyAPTEmu: The Weigh In

Round 3: Fuchikoma v0: Learning the Sweet Science

Round 4: Fuchikoma v1: Finding the Fancy Footwork

Round 5: Fuchikoma v2: Jab, Cross, Hook, Perfecting the 1-2-3 Punch Combo

Round 6: Fuchikoma v3: Dodge, Counterpunch, Uppercut!

In preparation for the second round of MITRE ATT&CK evaluations, C.K. Chen and team went about designing an emulation of an APT attack, which they named CyCraft APT Emulator or CyAPTEmu for short. CyAPTEmu’s goal was to generate a series of attacks on Windows machines. Then, a proof of concept threat hunting machine learning (ML) model was designed to specifically detect and respond to APT attacks. Its name is Fuchikoma.

Fuchikoma v0: the baby years

The thought experiment Fuchikoma v0 model gave insight into the four main challenges when designing a threat hunting ML model: having a weak signal, imbalanced data sets, a lack of high-quality data labels, and the lack of an attack storyline.

Fuchikoma v1: entering childhood

Fuchikoma v1 resolved the first challenge: having a weak signal. An analysis unit (AU) builder was introduced into the ML pipeline; each process creation event was altered into an AU–a mini process tree that links the original process creation event to its parent and three tiers of child processes. TF-IDF was used for vectorization of the command lines of the processes into the Unit2Doc. Now that each event had useful and contextual information as an AU, ML algorithms could group similar AUs into clusters, leaving our investigators with significantly less labeling to be done. Unfortunately, k-means (the chosen ML algorithm) wasn’t enough on its own. While clustering may yet prove to be useful, highlighting outliers was chosen as it could address some of the drawbacks of clustering. It was decided that Fuchikoma v2 would need more components to its pipeline.

Fuchikoma v2: fighting its way through hormones and high school

As discussed in Part 3, only 1.1 percent of the process creation events in our dataset were malicious. Since the number of attack activities is an incredibly low percentage of one day’s total events, C.K. Chen and team reasoned malicious activities would be abnormal compared to other activities.

Therefore, an anomaly detection component was applied to find the most abnormal AUs. Different algorithms specializing in anomaly detection were implemented: Local Outlier Factor (LOF), IsolationForest (IF), and DBScan.

LOF (Local Outlier Factor) is essentially a score that tells how likely a particular data point is an outlier/anomaly (or in Fuchikoma’s terms, malicious). More specifically, LOF measures the local deviation of the density of a given sample with respect to its neighbors. Samples that have a substantially lower density than their neighbors are considered outliers.

IF (IsolationForest) isolates each point in a data set and splits them into outliers or inliers (anomalous or normal). This split depends on how long it takes to separate the points. Outliers should be easier to separate since fewer conditions are needed to separate them from inliers. IF could be useful for huge data sets of multiple dimensions, such as Fuchikoma’s.

DBScan (Density-Based Spatial Clustering and Applications with Noise) groups together points that are close to each other based on a distance measurement and a minimum number of points. DBScan is especially useful for data that contains clusters of similar density, such as Fuchikoma’s benign clusters. DBScan also highlights outliers in low-density regions, which helps Fuchikoma locate those pesky malicious events even accurately.

Some false alerts might still occur and need manual analysis; however, benign clusters, the larger group in our dataset (98.9 percent), should no longer need to be investigated. This would significantly reduce the investigating team’s workload and have the added benefit of resolving the second challenge, imbalanced data sets, as the investigators would now only be labeling a small set of abnormal data.

The results were encouraging.

IF was the terrible first date. LOF met your friends. DBScan met your parents–the keeper.

As mentioned in Part 2 of this series, the goal of CyCraft APT Emulator (CyAPTEmu) is to generate attacks on Windows machines in a virtualized environment. CyAPTEmu will send two waves of attacks; each utilizing a different pre-constructed playbook. Empire was used to run the first playbook, modeled after APT3. Metasploit was used to run the second playbook, which C.K. and team called Dogeza.

DBScan outperformed LOF and IF in the APT3 Playbook, but how would DBScan prove against C.K. and team’s custom Dogeza playbook?

DBScan outperformed LOF and IF in both the APT3 and Dogeza playbook

If you’re confused by Fuchikoma’s charts above, don’t be; you just probably aren’t too familiar with a confusion matrix (also known as an error matrix), which is the backbone of statistical classification in ML. A true positive (TP) is when our friend Fuchikoma correctly identifies a malicious event. A true negative (TN) is when Fuchikoma correctly identifies a benign event.

In ML, there are performance metrics that help evaluate and improve ML models, such as recall, precision, and accuracy. All of these are technical terms and are used in different cases, similar to pentesting, vulnerability accessing, or red team evaluations; all important but are required for different tasks.

One performance metric you will see a lot is the F1-score, also known as the F-score or F-measure. The F1-score is used when the cost of false positives or false negatives is high. Not being able to identify malicious activities (false negatives) could be disastrous for an organization and, at worst, could cost hundreds of millions in breach recovery and fines. At the same time, we don’t want to overwhelm our team of security analysts with false positives, as spending time on false positives could also be disastrous when facing an attack.

The F1-score is used to evaluate ML models whose goals are to keep false positives and false negatives to a minimum if not zero. F1-scores range from 1 (a model with perfect precision and recall) to 0 (a model that makes you question what you’ve been doing with your life). In terms of Fuchikoma, C.K. Chen and team represented the F1-score as a percentage. Fuchikoma v2 scored 99.11 percent on its F1-score.

Now that we’ve had a simplistic introduction to the basics of ML, let’s check back in with our team of elite investigators at Section 9 and hear their evaluation of Fuchikoma v2.

Section 9, Ghost in the Shell: S.A.C. 2nd GiG (2002, Production i.G.)

Challenge One: Weak Signal [RESOLVED]
Single events in isolation do not contain enough information to determine if they are a threat or not. Data needs to be contextual in order for it to be useful. Analysis units, which contain contextual child and parent process information, were added into the ML pipeline and are then clustered and labeled later in the pipeline.

Analysis Unit consisting of TF-IDF vectorized command lines

In our boxing ring metaphor, this means Fuchikoma can relate everything it sees to everything else. The wooden stool in CyAPTEmu’s corner isn’t related to camera flashing in the background. The backward swing of our opponent’s arm is definitely related to the punch that is now quickly speeding towards us. Ouch, was that a glove contacting my face? While Fuchikoma is able to relate events to each other through contextual data within each AU, our friend Fuchikoma still has trouble determining causality.

Challenge Two: Imbalanced Data Sets [RESOLVED]

As stated before, a typical workday in an organization’s environment could see billions of diverse events. Only a tiny portion of which (1.1 percent in the training data) would actually be related to a real attack. This massive imbalance in data sets (normal versus malicious) created two big problems: (1) inefficient labeling time and (2) a less than ideal amount of malicious samples. However, due to prioritizing anomaly detection, benign clusters (98.9 percent of the training data) no longer needed to be labeled–dramatically reducing the size of data needed to be labeled and the time needed to label said data.

Fuchikoma v2: DBScan outperformed LOF and IF in both the APT3 and Dogeza playbook.

Fuchikoma has come a long way in the boxing ring. Fuchikoma v0 was too busy labeling everything to significantly participate in the fight. That’s an arm. That’s a glove. That’s a punch. Those are stars spinning around my head.

Fuchikoma v1 was less distracted as it didn’t need to generate that many labels; however, finding CyAPTEmu out of all the noise still proved too much to handle. Remember that only 1.1 percent of everything happening in this boxing ring is malicious. Fuchikoma v1 was practically fighting against a mostly invisible opponent.

Fuchikoma v2 is now finally able to eliminate the 98.9 percent “benign noise”, and focus solely on the remaining 1.1 percent of malicious activity. However some malicious activity will go unseen as it’s identical to benign activity (e.g., “netstat” or “whoami”), and some false positives will still occur. Fuchikoma v2 will see all of the punches thrown, but will still get hit a few times, and block when it’s unnecessary. Training makes all the difference!

Challenge Three: High-Quality Labels
One of Fuchikoma’s key objectives is to automate the alert verification process; however, both Fuchikoma v0 and Fuchikoma v1 could only verify very specific attacks — pretty much exact command lines. As attackers tend to use variations of a theme and string combinations of specific attacks together. While the addition of anomaly detection in Fuchikoma v2 did not completely resolve this challenge, anomaly detection did dramatically reduce the number of events to be analyzed.

Fuchikoma v2 would be able to identify the first jab it encounters. A second or third jab at a different or speed could land and knock Fuchikoma out. However, as mentioned by our other investigators, Fuchikoma v2 is far more focused than its predecessors. While similar attacks could land, Fuchikoma should be able to see most of the attacks and less of the noise.

Challenge Four: No Storyline
Detecting one piece of malware in isolation isn’t enough to fully understand from a forensic perspective what malicious activity is occurring on your network. Worse yet, security analysts might miss something when presented with a smattering of isolated events. Our Section 9 investigating team demands an automated attack storyline to increase their SOC efficiency.

It’s important to note that Fuchikoma v2 doesn’t directly detect attacks per se but accurately detects abnormal events, which attacks would generate. An external intelligence (our Section 9 SOC team) would still need to analyze these abnormal groups to determine if an attack was taking place.

We’ve heard from our investigators; however, how did Fuchikoma really do in the ring?

We are the champions! We are the champions! No time for losers. ‘Cause we are the champions of the world!

Fuchikoma did it! Knocking down CyAPTEmu was quite the challenge to overcome, but one knockdown isn’t enough to win the match. In the next article in this series, we’ll continue exploring the steps C.K. Chen and team took in the development of Fuchikoma v3 and discuss integrating community detection to construct our long-desired attack storyline. We will not only breakdown how Fuchikoma v3 performed against both APT3 and Dogeza but discuss the possible future applications of our friend, Fuchikoma.