Trojans in Artificial Intelligence (IARPA) [SRI funding: 7.22M]

Feb 27, 2020

IARPA TrojAI

Abstract: The IARPA TrojAI program aims to defend an artificial intelligence (AI) system from intentional, malicious attacks, known as Trojans, by developing technology to detect these attacks in a completed AI system. By building a detection system for these attacks, engineers can identify backdoored AI systems before deployment and prevent them from being used. This will mitigate risk arising from AI system failure during mission critical tasks.

Learning Formal Methods

Publications

Detecting Trojaned DNNs using counterfactual attributions

We target the problem of detecting Trojans or backdoors in DNNs. Such models behave normally with typical inputs but produce targeted mispredictions for inputs poisoned with a Trojan trigger. Our approach is based on a novel intuition that the trigger behavior is dependent on a few ghost neurons that are activated for both input classes and trigger pattern. We use counterfactual explanations, implemented as neuron attributions, to measure significance of each neuron in switching predictions to a counter-class. We then incrementally excite these neurons and observe that the model’s accuracy drops sharply for Trojaned models as compared to benign models. We support this observation through a theoretical result that shows the attributions for a Trojaned model are concentrated in a small number of features. We encode the accuracy patterns by using a deep temporal set encoder for trojan detection that enables invariance to model architecture and a number of classes. We evaluate our approach on four US IARPA/NIST-TrojAI benchmarks with high diversity in model architectures and trigger patterns. We show consistent gains over state-of-the-art adversarial attack based model diagnosis (+5.8%absolute) and trigger reconstruction based methods (+23.5%), which often require strong assumptions on the nature of the attack.

Karan Sikka, Indranil Sur, Anirban Roy, Ajay Divakaran, Susmit Jha

Dual-Key Multimodal Backdoors for Visual Question Answering

The success of deep learning has enabled advances in multimodal tasks that require non-trivial fusion of multiple input domains. Although multimodal models have shown potential in many problems, their increased complexity makes them more vulnerable to attacks. A Backdoor (or Trojan) attack is a class of security vulnerability wherein an attacker embeds a malicious secret behavior into a network (e.g. targeted misclassification) that is activated when an attacker-specified trigger is added to an input. In this work, we show that multimodal networks are vulnerable to a novel type of attack that we refer to as Dual-Key Multimodal Backdoors. This attack exploits the complex fusion mechanisms used by state-of-the-art networks to embed backdoors that are both effective and stealthy. Instead of using a single trigger, the proposed attack embeds a trigger in each of the input modalities and activates the malicious behavior only when both the triggers are present. We present an extensive study of multimodal backdoors on the Visual Question Answering (VQA) task with multiple architectures and visual feature backbones. A major challenge in embedding backdoors in VQA models is that most models use visual features extracted from a fixed pretrained object detector. This is challenging for the attacker as the detector can distort or ignore the visual trigger entirely, which leads to models where backdoors are over-reliant on the language trigger. We tackle this problem by proposing a visual trigger optimization strategy designed for pretrained object detectors. Through this method, we create Dual-Key Backdoors with over a 98% attack success rate while only poisoning 1% of the training data. Finally, we release TrojVQA, a large collection of clean and trojan VQA models to enable research in defending against multimodal backdoors.

Matthew Walmer, Karan Sikka, Indranil Sur, Abhinav Shrivastava, Susmit Jha

Task-agnostic detector for insertion-based backdoor attacks

Textual backdoor attacks pose significant security threats. Current detection approaches, typically relying on intermediate feature representation or reconstructing potential triggers, are task-specific and less effective beyond sentence classification, struggling with tasks like question answering and named entity recognition. We introduce TABDet (Task-Agnostic Backdoor Detector), a pioneering task-agnostic method for backdoor detection. TABDet leverages final layer logits combined with an efficient pooling technique, enabling unified logit representation across three prominent NLP tasks. TABDet can jointly learn from diverse task-specific models, demonstrating superior detection efficacy over traditional task-specific methods.

Lyu, Weimin, Xiao Lin, Songzhu Zheng, Lu Pang, Haibin Ling, Susmit Jha, Chao Chen

TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored Models

We present a Multimodal Backdoor Defense technique TIJO (Trigger Inversion using Joint Optimization). Recent work https://arxiv.org/abs/2112.07668 has demonstrated successful backdoor attacks on multimodal models for the Visual Question Answering task. Their dual-key backdoor trigger is split across two modalities (image and text), such that the backdoor is activated if and only if the trigger is present in both modalities. We propose TIJO that defends against dual-key attacks through a joint optimization that reverse-engineers the trigger in both the image and text modalities. This joint optimization is challenging in multimodal models due to the disconnected nature of the visual pipeline which consists of an offline feature extractor, whose output is then fused with the text using a fusion module. The key insight enabling the joint optimization in TIJO is that the trigger inversion needs to be carried out in the object detection box feature space as opposed to the pixel space. We demonstrate the effectiveness of our method on the TrojVQA benchmark, where TIJO improves upon the state-of-the-art unimodal methods from an AUC of 0.6 to 0.92 on multimodal dual-key backdoors. Furthermore, our method also improves upon the unimodal baselines on unimodal backdoors. We present ablation studies and qualitative results to provide insights into our algorithm such as the critical importance of overlaying the inverted feature triggers on all visual features during trigger inversion. The prototype implementation of TIJO is available at https://github.com/SRI-CSL/TIJO.