CipherBank

Exploring the Boundary of LLM Reasoning Capabilities through Cryptography Challenges

Yu Li1,2, Qizhi Pei1,3, Mengyuan Sun1,2, Honglin Lin1,4,
Chenlin Ming1,5, Xin Gao1, Jiang Wu1, Conghui He1, Lijun Wu1

1Shanghai AI Laboratory 2Wuhan University 3Renmin University of China ,
4Beijing University of Posts and Telecommunications 5Shanghai Jiao Tong University

Introduction

🧠 Large language models (LLMs) have demonstrated remarkable capabilities, especially the recent advancements in reasoning, such as o1 and o3, pushing the boundaries of AI. Despite these impressive achievements in mathematics and coding, the reasoning abilities of LLMs in domains requiring 🔒 cryptographic expertise remain underexplored.
🌉 To bridge this gap, we introduce CipherBank, a comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs in cryptographic decryption tasks. CipherBank comprises 2,358 📊 meticulously crafted problems, covering 262 unique plaintexts across 5 domains 🌐 and 14 subdomains 🏷️, with a focus on privacy-sensitive and real-world scenarios that necessitate encryption. From a cryptographic perspective, CipherBank incorporates 3 major categories of encryption methods, spanning 9 distinct algorithms, ranging from classical ciphers to custom cryptographic techniques.
🔥 We evaluate state-of-the-art LLMs on CipherBank, e.g., GPT-4o, DeepSeek-V3 , and cutting-edge reasoning-focused models such as o1 🧠 and DeepSeek-R1. Our results reveal significant gaps in reasoning abilities not only between general-purpose chat LLMs and reasoning-focused LLMs but also in the performance of current reasoning-focused models when applied to classical cryptographic decryption tasks, highlighting the challenges these models face in understanding and manipulating encrypted data.
💡 Through detailed analysis and error investigations, we provide several key observations that shed light on the limitations and potential improvement areas for LLMs in cryptographic reasoning. These findings underscore the need for continuous advancements in LLM reasoning capabilities.

Dataset Composition

CipherBank Overview

Overview of CipherBank

5 Main Domains

  • Personal Privacy Data
  • Enterprise Sensitive Data
  • Public Safety Data
  • Financial Asset Data
  • Internet Records

14 Subdomains + 89 Realistic Tags (e.g., passport number, tax ID, medical history)

9 Encryption Algorithms

Across 3 categories:

Substitution: ROT13, Atbash, Vigenère, Polybius
Transposition: Reverse, SwapPairs
Custom: DualAvgCode, ParityShift, WordShift

5 Difficulty Tiers

From T1 (Basic) to T5 (Expert)

Model Evaluation

Rank Model Type Model Avg Score Substitution Transposition Custom
ROT Atbash Polybius Vigenère Reverse SwapPairs DualAvgCode ParityShift WordShift
1 Closed-source Chat Model Claude-Sonnet-3.5-1022 45.14 83.21 75.19 72.90 1.91 63.93 6.87 4.96 58.21 39.12
2 Reasoning Model o1-2024-12-17 40.59 59.92 79.01 79.39 7.25 14.89 32.14 50.38 12.39 29.90
3 Reasoning Model DeepSeek-R1 25.91 73.28 58.78 44.27 0.38 10.69 0.38 24.05 12.98 8.40
4 Reasoning Model o1-mini-2024-09-12 20.07 46.18 68.32 46.95 1.53 5.15 0.38 2.93 7.63 1.53
5 Reasoning Model gemini-2.0-flash-thinking 13.49 40.46 17.18 21.76 1.15 22.90 1.15 0 7.63 9.16
6 Open-source Chat Model DeepSeek-V3 9.86 32.44 14.88 2.29 0.76 28.47 0.38 0.38 1.14 8.02
7 Closed-source Chat Model gemini-1.5-pro 9.54 55.34 0.76 0.38 0.76 10.31 0.76 0.38 0.76 16.41
8 Closed-source Chat Model GPT-4o-2024-08-06 8.82 38.17 3.05 0.38 0.76 25.19 2.29 0 1.14 8.40
9 Closed-source Chat Model gemini-2.0-flash-exp 8.65 35.88 3.05 1.53 0.38 29.39 1.53 0 0.76 5.34
10 Closed-source Chat Model GPT-4o-2024-11-20 6.40 26.46 6.99 0.13 0.76 15.27 0.76 0.25 0.89 6.11
11 Closed-source Chat Model GPT-4o-mini-2024-07-18 1.00 3.69 2.03 0 0.51 2.16 0 0.38 0 0.25
12 Reasoning Model QwQ-32B-Preview 0.76 1.53 0.38 1.91 0 0 0 0.38 0.38 2.29
13 Open-source Chat Model Qwen2.5-72B-Instruct 0.55 1.15 0 0 0 0 0.38 1.15 0 2.29
14 Open-source Chat Model Llama-3.3-70B-Instruct 0.42 2.67 0.38 0 0 0 0 0 0.76 0
15 Open-source Chat Model Llama-3.1-70B-Instruct 0.38 1.15 0.38 0 0.38 0 0 0.38 0.38 0.76
16 Open-source Chat Model Mixtral-8x22B-v0.1 0.30 0.38 0 0 0 0.76 0 0.38 0 1.15
CipherBank Overview

Model Performance - Accuracy vs. Levenshtein Similarity

Key Findings

BibTeX

@misc{cipherbank2024,
    title={Exploring the Boundary of LLM Reasoning Capabilities through Cryptography Challenges},
    author={Yu Li, Qizhi Pei, Mengyuan Sun, Honglin Lin, Chenlin Ming, Xin Gao, Jiang Wu, Conghui He, Lijun Wu},
    year={2024},
    publisher={GitHub},
    howpublished={\url{https://github.com/Goodman-liyu/CipherBank}}
}