AI in the Crosshairs: Understanding and Securing Your Organization’s AI Models

Enterprises are racing to ship AI assistants, code copilots, and decision-support tools. But the very properties that make modern AI powerful: its probabilistic outputs, reliance on vast data, and dependence on external content can also open doors to new classes of attacks. 

Traditional security controls were built for deterministic software with fixed inputs and predictable outputs. AI systems behave differently, and adversaries are learning to exploit the gaps. Frameworks like NIST’s AI Risk Management Framework (AI RMF) and its 2024 Generative AI profile explicitly call out the need to adapt risk practices to AI’s unique failure modes, from data poisoning to model misuse. 

In parallel, security researchers and governments have catalogued adversarial tactics specifically for AI. MITRE’s ATLAS knowledge base, for example, maps techniques for attacking AI systems (from training data manipulation to model exfiltration) much like MITRE ATT&CK did for enterprise networks. That’s a strong signal: AI now has its own threat landscape and it’s evolving quickly. 

Real-world examples of AI system vulnerabilities

1) Prompt injection and indirect prompt injection

Prompt injection happens when an attacker crafts content (including hidden text on web pages, documents, emails, or tool outputs) that causes an AI system to ignore its original instructions and follow the attacker’s. National Cyber Security Centre UK (NCSC) have warned that prompt injection can echo the impact of classic input attacks (think “the SQL injection of the future”), especially when LLMs are wired to tools or data sources. Microsoft and others have publicly described layered defences for indirect prompt injection where the malicious instructions live in external content the model later ingests. 

A late-2024 investigation also demonstrated how hidden content on the open web could manipulate LLM-powered summarization, underscoring the risk of trusting model outputs without guardrails or filtering. 

2) Model and data poisoning

Poisoning attacks insert tainted examples into training or fine-tuning data so a model behaves normally until a trigger appears. “Nightshade,” a University of Chicago research project, showed that even a relatively small number of poisoned samples can degrade or subvert text-to-image models, and that effects can “bleed” to related concepts. That’s not a hypothetical, it’s a concrete demonstration of feasible poisoning. 

3) AI supply-chain malware (malicious models and artifacts)

In 2024, researchers discovered backdoored and malicious AI models uploaded to public repositories. Because many frameworks load models via Python Pickle (which can execute arbitrary code), simply deserializing an untrusted model can lead to compromise. This is a new kind of software-supply-chain risk: the “dependency” isn’t a library, it’s the model file itself. Multiple independent reports and advisories documented hundreds of such malicious uploads.

Why your existing scanners miss AI risks

Most application scanners don’t understand models. SAST/DAST tools excel at finding classic flaws (XSS, injection into SQL/LDAP, unsafe deserialization) in deterministic code paths. But AI risks often live outside traditional code such as in training data provenance, fine-tuning pipelines, retrieval indexes, model weights, and the behaviour of chained agents. The “vulnerability” may not be a bug. It could be a behaviour that emerges under certain prompts, or a poisoned vector that only triggers on a specific pattern. Industry analyses and government guidance stress that AI risks frequently manifest in model content/behaviour and supply-chain artifacts, not as conventional code vulnerabilities.

Non-determinism breaks test oracles. Because model outputs vary with temperature, context window, and retrieval state, you can’t rely on a single “expected” output. That makes conventional regression testing insufficient on its own and highlights the need for adversarial evaluation, red-teaming, and behaviour-based guardrails specific to LLMs. Community standards like the OWASP Top 10 for LLM Applications and government guidelines (CISA/NCSC) call for these AI-aware practices.

AI supply chains are opaque. Where did your model come from? What training data was used? Was the artifact integrity-checked? Traditional SBOM scanners don’t parse model graphs or vector stores; you need AI bills of materials (AI BOMs), artifact signing, and repository hygiene that reach into the ML stack. Recent reporting on AI supply-chain threats and malicious model uploads underscores why. 

Regulatory momentum: accountability is arriving

EU AI Act. Europe’s AI Act entered into force on 1 August 2024, with obligations phasing in by risk category over the next several years (general-purpose AI obligations beginning sooner, high-risk obligations largely by 2026–2027). Even organizations outside the EU can be in scope if they place AI systems on the EU market or their outputs are used in the EU. Implementation timelines published by the EU and Parliament research services make clear: compliance dates are legally binding and moving forward. 

SEC expectations. In the U.S., the SEC has already taken action against firms for misleading “AI-washing” claims in 2024. Separately, the Commission’s 2023 cybersecurity rules require public companies to disclose material cybersecurity incidents promptly on Form 8-K if an AI incident rises to materiality, it can fall within that regime. Together, these moves raise the bar on accurate AI disclosures and governance. 

Emerging standards. ISO/IEC 42001 (Dec 2023) is the first AI management-system standard, offering a governance scaffold much like ISO 27001 does for information security. NIST’s AI RMF and its Generative AI profile (July 2024), plus the SSDF Community Profile for GenAI development, provide risk and secure-development practices to operationalise. 

Bottom line: regulators and standards bodies now expect AI-specific risk management, not just generic cybersecurity hygiene.

How to assess AI risk across custom and third-party models

A practical assessment should map to the system you actually run. Use this layered view:

1) Data layer

  • Provenance & consent. Can you trace where training/fine-tuning data came from? Are licences and consents documented? (EU AI Act obligations and ISO 42001 both push hard on documentation and governance. 
  • Poisoning exposure. Evaluate ingestion paths for untrusted content. Run targeted tests using known poisoning methods (e.g., prompt-specific poisoning demonstrations) and monitor for anomalous behaviours triggered by rare tokens or inputs. 

2) Model artifacts and supply chain

  • Artifact integrity. Require signed model artifacts and avoid unsafe serialisation formats where possible (or load them in sandboxed, network-isolated environments). Verify hashes and vendor attestations. The 2024 wave of malicious model reports shows why. 
  • AI BOM (AI BOM). Maintain metadata on base model, fine-tuning datasets, checkpoints, guardrails, and evals. Incorporate reviews into change control.

3) Application layer (LLM apps, agents, RAG)

  • Prompt/Tool isolation. Treat prompts as untrusted inputs. Use allow-listed tools and scopes. Never let LLMs execute commands or browse the web without policy constraints and auditing. Government guidance emphasises “secure by design” for AI features. 
  • Context integrity. Sanitize retrieved content (strip scripts/metadata, neutralize HTML/Markdown), and deploy “prompt shields” or input filters to catch adversarial patterns before the model sees them.

4) Behavioural assurance

  • Adversarial evaluation & red-teaming. Test jailbreaks, prompt injection, data exfiltration, and tool abuse. Align your test sets with common LLM risks (e.g., OWASP LLM Top 10) and MITRE ATLAS tactics. 
  • Guardrail efficacy monitoring. Track bypass rates and drift. Non-determinism means you need continuous evals, not one-time certification.

5) Third-party models and gateways

  • Contractual controls. Require suppliers to disclose model lineage, training sources, fine-tuning data handling, eval results, and mitigation practices aligned to NIST AI RMF / ISO 42001. Map their obligations to your own AI Act exposure if operating in or serving the EU. 
  • Runtime controls. Even when using a vendor-hosted model, route calls through your own policy enforcement (input/output filters, data masking, PII scrubbing) and log every prompt/response for audit.

Building an enterprise AI readiness checklist

Use this as a starting point and tailor by sector and risk appetite.

Governance & accountability

  • Adopt a framework. Declare NIST AI RMF as your baseline risk model, map controls to ISO/IEC 42001 requirements and your existing Information Security Management System.

  • Define ownership. Name a cross-functional AI risk owner (security + data + legal + product). Establish an AI change-control board for new use cases.

  • Documentation & traceability. Maintain an AI use-case register, model cards, data cards, eval reports, and third-party attestations valuable for audits and, in the EU, for AI Act conformity assessments. 

Data controls

  • Data minimisation & masking. Keep sensitive data out of prompts. If you must include it, apply masking/tokenization upstream.

  • Provenance tracking. Record sources and licences for training/fine-tuning, institute human review for high-risk datasets.

  • Poisoning defences. Filter training data for anomalies. Consider canaries and influence-function analysis during training to detect suspicious gradients or triggers. (Research and standards bodies highlight poisoning as a top risk.) 

Model & supply-chain security

  • Artifact integrity. Require signatures/attestations for models, quarantine and scan artifacts and avoid loading unvetted Pickle-based models in production.

  • AI BOM & SBOM. Maintain model and software bills of material. Pin versions and verify hashes.

  • Repository hygiene. Restrict who can upload models internally. Enforce mandatory reviews for fine-tune datasets and prompts.

Application hardening

  • Least-privilege agents. Scope tool access tightly; use stable APIs with clear contracts.

  • Prompt handling. Treat all external context as untrusted. Strip active content, enforce output schemas, implement token budgets and allow-lists for tool calls. Government and vendor guidance on prompt shields and secure-by-design AI supports this approach.

  • RAG hygiene. Curate your retrieval corpora. Quarantine and review new sources before they become retrievable and periodically re-embed to mitigate drift.

Evaluation, monitoring, and response

  • Adversarial testing. Run regular red-team exercises using prompt-injection and jailbreak corpora, measure bypass rates, test cross-tool exfiltration (e.g., “send email,” “fetch URL”). Align scenarios to OWASP LLM Top 10 and ATLAS.

  • Guardrails in depth. Combine deterministic filters (policy, regex, schemas) with probabilistic detectors (toxicity, PII, jailbreak heuristics).

  • Telemetry. Log inputs/outputs, tool invocations, and safety flags. Route high-risk outputs to human-in-the-loop review.

  • Incident playbooks. Extend your IR plan to AI-specific events (prompt-injection abuse, model exfiltration, poisoning discovery). Consider materiality: if an AI incident is material, U.S. public companies have disclosure obligations under the SEC’s 2023 rules.

Compliance & external alignment

  • Map to the EU AI Act. Classify systems by risk, plan for documentation/technical file requirements, and set timelines against the Act’s phased obligations.

  • Adopt secure-development guidance. Align GenAI development with NIST’s SSDF Community Profile for AI and with multi-agency secure AI guidance. 
  • Claims discipline. Validate marketing/IR statements about AI capabilities. You don’t want to be the next “AI-washing” headline. 

Best-practice guardrails you can deploy now

  1. Gate and sanitize context. Build a “content firewall” in front of your models: strip scripts/HTML, normalise Unicode, remove hidden text, and quarantine untrusted sources before retrieval. Use pattern-based and ML-based prompt shields to detect adversarial input.

  2. Template and constrain outputs. Use structured output (JSON schemas), allow-lists for tools/APIs, and strong post-processing to reject responses that violate policy or format. This reduces the blast radius of successful jailbreaks.

  3. Defense-in-depth against indirect prompt injection. Separate model instructions from retrieved content, prepend immutable system prompts server-side, and inject “do-not-follow external instructions” rules. Add cross-checks (secondary classifiers or rule-based verifiers) on high-risk actions. Industry blogs from large vendors outline layered mitigations here.

  4. Secure your model supply chain. Prefer formats that avoid arbitrary code execution, insist on signed artifacts, and load third-party models in sandboxes. The 2024 malicious-model incidents are a clear warning.

  5. Adopt recognized frameworks. Use NIST AI RMF to structure risk, ISO/IEC 42001 for governance and continual improvement, and OWASP LLM Top 10 as a testing baseline.

Final thoughts for leaders

AI is now part of your attack surface. Regulators expect you to treat it that way, and adversaries already do. You don’t need to halt innovation, but you do need an AI-aware risk program:

  • Anchor governance in NIST AI RMF and ISO/IEC 42001.

  • Classify systems against the EU AI Act timelines if you operate in or serve the EU.

  • Extend your SDLC with GenAI-specific secure-development practices.

  • Test like an attacker using OWASP LLM Top 10 and MITRE ATLAS playbooks.

  • Harden the supply chain around data, models, and retrieval.

  • Instrument runtime with guardrails, monitoring, and incident playbooks.

Enterprises that do this will move faster because they’ll ship AI with confidence.

Don't miss these stories: