A Breakthrough Tool for Automating Anti-Malware

The Challenge: Manual Process to Build Anti-Malware Rules Is Inefficient

The government’s Laboratory for Physical Sciences (LPS) is a unique agency where scientists from academia, industry, and government collaborate on research that advances the physics and engineering behind information science and technology. Booz Allen has long supported LPS’ research into advanced computing, machine learning, and cybersecurity, including the role that machine learning can play in addressing malware threats quickly and effectively.

In the battle against malware, it is important for cybersecurity analysts to identify and classify it. To do this, analysts group malware into families that share common code and traits. They often use a software tool called Yara, which works by searching for sequences of specific characters or bytes that are unique to known families of malware. Logical rules, known as Yara rules or “signatures,” are also written into the tool instructing it how to apply those character sequences.

Yara rules are used in many situations, such as when responding to cybersecurity incidents, determining whether devices or networks have been compromised, and improving an organization’s defenses through proactive malware detection.

A longstanding problem with Yara rules, however, is that cybersecurity analysts need to build them manually. This manual process is tedious and highly time-consuming, even for seasoned cybersecurity pros. In many cases, it might take hours or days to write an effective Yara tool for certain classes of malware. For highly complicated cases, cybersecurity analysts may simply give up on creating the needed sequences and rules, because they have too many other tasks to do and not enough time. This is problematic given the amount of malware that exists (more than 1.3 billion malware have been identified) and the number of cyberattacks that occur. 

The Approach: Using Machine Learning and Other Tools to Automate Yara Rules

In 2020, a team of Booz Allen cybersecurity researchers developed a novel way to use machine learning and other innovations to automate the process of building a Yara rule. The solution, called AutoYara, is a highly configurable tool that produces effective, accurate Yara rules in minutes or seconds—dramatically reducing the time typically needed. Moreover, AutoYara is highly compact so it can be deployed on a typical laptop or in a remote-network environment.

A Java-based software package, AutoYara incorporates three key innovative approaches:

  • KiloGrams are exceptionally large n-grams or groups of computer-code characters, developed through machine learning algorithms. Booz Allen—working in partnership with researchers at LPS, the Department of Defense (DOD), the University of Maryland, and Elastic, an endpoint security firm—developed a groundbreaking method to find and use n-grams for malware analysis that can exceed a thousand characters or bytes without overtaxing computational resources. This new method—outlined in the 2019 peer-reviewed paper, KiloGrams: Very Large N-Grams for Malware Classification, published in KDD Workshop on Learning and Mining for Cybersecurity—enables cybersecurity analysts to leverage very large n-grams and create general-purpose signatures compatible with industry-standard tools like Yara.
  • Biclustering is a class of machine learning algorithm that helps decide the rules for how n-grams are associated (or clustered) together in the AutoYara tool to identify and classify specific malware. The biclustering algorithms within the AutoYara tool make it possible to develop Yara rules quickly while maintaining high effectiveness.
  • Bloom filters represent large amounts of data without the need for storing that data. Bloom filters are instrumental in accelerating the tool’s performance, while also keeping the tool compact enough (approximately 200 megabytes) to be usable on traditional-sized laptops. For the AutoYara tool, for example, Bloom filters act like compressed representations of a roughly one-terabyte malware database.

Booz Allen did not originally set out to develop an automated Yara tool. Rather, the journey to develop AutoYara began with Booz Allen’s groundbreaking work to develop a new algorithm that would produce KiloGrams for practical malware analysis. Once the ideas behind KiloGrams were fleshed out, the Booz Allen cybersecurity research team realized it could apply that innovation to malware identification and classification, and the Yara tools used to do that. 

The Solution: Automation Bolsters Fight Against Malware

By adding the biclustering and Bloom filter components to the concept (and after more than a year of engineering iterations), the Booz Allen team was able to build the AutoYara tool and refine its performance and practicality in the field. In September 2020, LPS made the AutoYara tool available for downloading on its LPS GitHub website.

In summary, AutoYara was the result of Booz Allen’s ability to apply an innovative mindset to client problems, combined with deep expertise in cybersecurity research, machine learning research, and engineering.

Real-world testing by malware analysts indicates that AutoYara can reduce the time that analysts spend constructing Yara rules by between 44 percent and 86 percent. This allows the analysts to spend their time instead on the kinds of advanced malware that current tools cannot handle. Our test results demonstrate that AutoYara can help reduce analyst workload by producing rules with useful true-positive rates while maintaining low false-positive rates—sometimes performing as well or better than human analysts. This is valuable at a time when cybersecurity experts and analysts are increasingly in short supply at many organizations.

The Booz Allen team presented its work on AutoYara in the article, Automatic Yara Rule Generation Using Biclustering, at the 13th ACM Workshop on Artificial Intelligence and Security (AISec'20), where it won the Award for Best Paper.

Contact Us