From bugs to bypasses: adapting vulnerability disclosure for AI safeguards
Summary
This NCSC blog (co-authored with the AI Security Institute) examines how established cyber security practices — especially vulnerability disclosure and bug-bounty approaches — can be adapted to help find and mitigate “safeguard bypasses” in frontier AI systems (general-purpose models such as ChatGPT, Gemini, Llama and Claude).
It explains what safeguard bypasses are (jailbreaking, agent hijacking, indirect prompt injection), outlines the potential benefits and costs of public disclosure programmes (SBBPs and SBDPs), proposes best-practice principles for running those programmes, and highlights open research questions about patching, transferability, severity metrics and incentive design.
Source
Key Points
- Safeguard bypasses are techniques that defeat model protections (for example jailbreaking, agent hijacking and prompt injection).
- NCSC and AISI see value in transferring vulnerability disclosure tooling and practices to AI security, including secure development lifecycles, triage and remediation planning.
- Safeguard Bypass Bounty Programmes (SBBPs) and Safeguard Bypass Disclosure Programmes (SBDPs) crowdsource testing and can help measure how hard systems are to misuse and keep safeguards strong post-deployment.
- Programmes must be built on strong security foundations; otherwise reports may be mishandled and trust undermined.
- Practical design choices matter: clear, narrow scope; appropriate launch timing and duration; reproducible reports (logging, shareable contexts, trusted test access); and mixed incentives beyond cash.
- Running public programmes carries overheads (triage, management) and should complement — not replace — deeper security evaluations.
- Open questions remain: how to “patch” AI reliably, how to judge severity and generalisability, how to share cross-company findings safely, and what incentive models best suit safeguard research.
- The NCSC and AISI call for interdisciplinary research and sector collaboration to test programme designs and close knowledge gaps.
Context and relevance
This is timely for decision-makers, developers and researchers working on frontier AI safety. As models become more capable, misuse risks rise; adapting proven cyber practices (disclosure programmes, secure lifecycles, triage) offers a pragmatic route to detect and reduce harms during deployment. The blog sits at the intersection of governance, technical risk management and community-driven testing and is relevant to anyone responsible for AI risk assessments, incident response or product safety.
Why should I read this?
Quick and useful: if you build, buy or regulate powerful AI, this explains why bug-bounty style disclosure could help spot real-world bypasses — what to do to run one well, what to avoid, and what we still don’t know. It saves you time by summarising practical principles and the big open questions you’ll want answers to before launching or trusting a public programme.
Author’s take
Punchy and short: this isn’t academic waffle — it’s a practical call to action. NCSC + AISI recommend using proven cyber practices but warn that AI brings new twists (detection-bypass behaviour, transferable attacks, murky severity metrics). Read the details if you care about preventing misuse as models scale; the guidance helps you design programmes that actually improve safety rather than create noise.
Authors: Dr Kate S (Technical Director for Security of AI Research, NCSC) and Dr Robert Kirk (Research Scientist, Safeguards, DSIT)