Machine Learning Evasion

February 20, 2020 | 4 min read

BlueVoyant

“Life in the SOC” is a Blog Series that shares experiences of the BlueVoyant SOC defending against the current and prevalent attacks encountered by our clients. The blogs discuss successful detection, response and mitigation actions that can improve your defensive capabilities.

Machine Learning (ML) is a subset of the overall concept of Artificial Intelligence (AI). It implies a system has been enabled to learn by using processed data to make accurate predictions about new data based on identified characteristics. Security solution vendors are using Machine Learning to be more proactive in identifying emerging and unknown threats. ML can help security products analyze patterns and learn from them to prevent similar attacks and respond to changing behavior.

This is achieved by supplying the learning engine with large amounts of data. However, there needs to be quality in the data itself. The data needs to be organized and structured in a way that provides rich context to the algorithms used to develop patterns. This data goes beyond the threats themselves. It’s also about everything that happens around that data. Information about the systems, applications, protocols, and even detection sensors.

It is for these reasons that discussions around the use of AI and ML in cybersecurity are so popular. Some even view ML as the silver bullet for malware detection.

ML is a valuable tool in threat detection technology, but it is not a perfect tool.

One of the significant challenges researchers and the ML community needs to confront is malware classification and detection. Identifying malicious programs used by threat attackers is complicated. They employ advanced techniques such as polymorphism, impersonation, compression, and obfuscation to evade detection. Other challenges include limited domain expertise which results in lack of labeled samples and numerous labeling errors, imbalanced data sets, attacker-defender games, difficulty in identifying malicious sources, the tragedy of metrics, and more.

The lack of sufficient datasets is not the only issue when it comes to ML. As the security field continues to adopt AI and ML, so have Advanced Persistent Threat (APT) groups and nation-state actors. They read the same articles that security experts read. They buy and use many of the products security teams employ or promote. Their success is dependent on knowing the exact same things security defenders know in order to defeat those defenses.

For example, on July 18, 2019, Motherboard wrote an article about how researchers tricked a security company's proprietary AI-based antivirus into thinking that malware was legitimate software. The article goes on to explain the method did not involve altering the malicious code, as attackers generally do to evade detection. Instead, the researchers developed a “global bypass” method that works with almost any malware to fool the learning engine. It involves simply taking strings from a non-malicious file and appending them to a malicious one, tricking the system into thinking the malicious file is benign.

The researchers verified this tactic against the WannaCry ransomware, that crippled hospitals and businesses around the world in 2017, the more recent Samsam ransomware, and the popular Mimikatz hacking tool. In nearly all cases, they were able to trick the learning engine.

Martijn Grooten, editor of Virus Bulletin, which conducts tests and reviews of malware detection programs, called the reverse-engineering research impressive and technically interesting. He was not surprised by the findings. “This is how AI works. If you make it look like benign files, then you can do this,” Grooten stated to Motherboard. “It mostly shows that you can’t rely on AI on its own.... AI isn’t a silver bullet.... I suspect it’ll get better at this kind of thing over time.”

Another example was published in a November ESET blog discussing a recent "Evasion Competition" put on by VMRay, Endgame, and MRG-Effitas in August 2019. In the competition, Jakub Debski, ESET's Chief Product Officer, was able to manipulate a header entry in the malware samples that were acting as a particularly strong feature for one of the MLs classifiers.

The trick was to apply a customized UPX packer and then run a fuzzing script to trick the classifier about the “benign-looking” character of the malware. UPX is a common enough packer that many machine learning engines would fail to flag.

Theoretically, there are several ways in which ML could be defeated. There are gradient-based attacks, confidence score attacks, hard label attacks, surrogate model attacks, etc. However, during this Evasion Competition, a majority of the contestants used one of the following:

  • Appending extra data to the executable, also known as overlay
  • Adding new sections to the executable, and it is even better if these sections are from known benign files
  • Packing the samples with a packer

This was a real-time competition. It would not be a stretch to think that threat actors are already using these evasion techniques today. It's almost a guarantee that as much as AI/ML is maturing in the cybersecurity space, evasion tactics using advanced methods are being developed by APTs and nation-states in parallel.