SeMaMatch: Semantic Malware Matching

Lead Research Organisation: University College London
Department Name: Computer Science

Abstract

The flood of malware samples is predicted to grow into a deluge in
2012, making the problem of maintaining a database of malware
signatures ever more difficult. For each new sample, it is important
to determine the threat that it poses.

In response to this, dynamic malware analysis
tools have been designed that execute the sample in a sandbox,
monitoring the actions of a sample. If these actions are similar
to those of malware that has been already indexed in the database,
then one might draw conclusions regarding provenance and severity
of the threat posed. If the sample does not match against known
malware, then it can be subject to manual scrutiny, using a dissembler
such as IDA Pro.

This Linnaean approach to malware analysis is both natural and
convenient: it is natural to group malware into families that share
common attributes; and it is provides a convenient way of assessing
threat. Yet the whole methodology is predicated on the accuracy
with which samples are characterised by their signatures. If a
sample is assigned a signature that does not express its behaviour,
then samples that are behaviourally distinct can be erroneously
grouped together. Conversely, samples which behave the same, but
appear different, can be accidentally placed in different groups.

The main problem with dynamic malware analysis tools is that they
execute the binary for a limited time, typically considering just
one path through the binary. This limits the actions that can be
observed, rendering the signature inaccurate for programs that
reveal their true behaviour later. In addition, the dynamic approach
can miss infrequent actions or logic bombs. The dynamic approach is
also susceptible to timing attacks that detect a tracer to turn off
some action. Above all, the signatures are based solely and only
on those actions that are encountered during the trace.

More static approaches have been applied too, at one extreme using
the call graph of the binary itself for classification, and at the
other deploying model checking techniques to search the paths through
call graph for signature behaviours that characterise known malware
families. Yet graph matching techniques are sensitive to control-flow
obfuscation and model checking requires the signature behaviours
to be known up-front and distilled into a temporal formula or an
automata.

A middle ground is offered by abstract interpretation since it
provides a way to systematically consider all paths, while monitoring
a program for actions that inform the construction of the signature.
Abstract interpretation provides a way to break the dichotomy between
the purely dynamic and the purely static approach to malware analysis
into a graduated continuum. Formally, purely static approach
(a.k.a. a static analysis) can be derived from the purely dynamic
approach (a.k.a. a tracer) by compositing a sequence of abstractions:
if all n abstractions are applied the result is the static analysis;
but if the first m < n abstractions are applied the result is a
hybrid. The challenge is to find the hybrid that provides sufficient
path coverage to undercover logic bombs yet is sufficiently robust
to be used by practitioners in the security sector. The proposed
project will discover this sweet point by following two complementary
lines of inquiry. Concrete traces will be abstracted to cover more
paths and more actions (at UCL). Static analyses, which covers all
paths, will be refined to avoid paths and actions that do not
actually occur (at Kent). Thus UCL will add missing information
to signatures (converging on the ideal signature from below) whilst
Kent will remove excess information from signatures (converging on
the ideal signature from above). By reflecting on the relative
merits of these approaches, we will draw conclusions on how to
construct robust signatures for malware classification and thereby
advance the whole field.

Publications


10 25 50
Menendez, H. (2016) Extending the SACOC algorithm through the Nystrom method for Dense Manifold Data Analysis in International Journal of Bio-Inspired Computation
Menéndez H (2016) Medoid-based clustering using ant colony optimization in Swarm Intelligence
 
Description We have shown that compression programs and an initial zoo of malware can be used to detect malware with 98% accuracy. We have performed a series of experiments that establish the robustness of this claim. We have developed a tool (EnTS) that is more lightweight than using compression and has equivalent accuracy. Again, we have robust experimental evidence for our accuracy claim. We have engineered a new tool (EEE) that has successfully attacked the classification ability of EnTS as well as compression methods.
Exploitation Route These findings are sufficiently strong as to warrant both further research and commercialisation.
Sectors Aerospace, Defence and Marine,Agriculture, Food and Drink,Chemicals,Construction,Creative Economy,Digital/Communication/Information Technologies (including Software),Energy,Manufacturing, including Industrial Biotechology,Retail,Security and Diplomacy,Transport,Other
 
Title EnTS ML database 
Description This dataset contains the Entropy Profiles for a subset of Kaggle malware Competition files, VirusShare Win32 malware using different packing systems and Windows benign-ware. The data is a training and test set for new machine learning techniques on entropy time series for malware detection and classification. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact None as yet. The data is not openly available until 
URL https://data.mendeley.com/datasets/rxnx8rzwph/draft
 
Title EnTS: Entropy Time Series Analysis tool 
Description This tool generates a simplified entropy profile of a file as a fixed length time series. 
Type Of Technology Webtool/Application 
Year Produced 2016 
Impact None yet. Note that, since the associated paper is under anonymous submission, we cannot openly publish the URL of the tool except via application to the PI. This will change once the paper is accepted. 
URL https://www.mendeley.com/sign-in/?routeTo=https%3A%2F%2Fapi.mendeley.com%2Foauth%2Fauthorize%3Fredir...
 
Description COW 27: workshop on Malware 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Workshop for academics involved in research on malware. Attended by some industrial people and government representatives. Main outcomes were knowledge and information sharing as well as some new collaborations and reinforcement of existing collaborations.
Year(s) Of Engagement Activity 2013
URL http://crest.cs.ucl.ac.uk/cow/27/
 
Description COW 41: workshop on Software Engineering and Computer Science using information theory 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact Workshop for software engineering researchers to learn about different applications of information theory to software engineering and computer science problems. The previous RA gave a talk on the Journal of Computer Security submission, "Detecting Malware with Information Complexity".
Year(s) Of Engagement Activity 2015
URL http://crest.cs.ucl.ac.uk/cow/41/
 
Description Malware MSc course 2014-2015 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact Presented overview of Journal of Computer Security submission entitled "Detecting Malware with Information Complexity" in the UCL MSc malware module. Later asked students to carry out a coursework based on available tools.
Year(s) Of Engagement Activity 2014
 
Description Malware MSc course 2015-16 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact Presented an overview of the ideas in the Journal of Computer Security submission entitled "Detecting Malware with Information Complexity" to a UCL MSc module.
Year(s) Of Engagement Activity 2015
 
Description Malware MSc module 2016-2017 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Knowledge and Technology Transfer
Year(s) Of Engagement Activity 2016