Without guardrails, your LLM will be jailbroken, will leak PII, will produce content that violates policy. Production guardrails are layered: input validation → prompt construction → output validation → user-facing rendering. Each layer catches different attacks.
Input layer
Detect: prompt injection ('ignore previous instructions'), PII (regex + NER), profanity, off-topic queries. Block or rewrite before reaching the LLM. Tools: NeMo Guardrails, LlamaGuard, regex+spaCy.
Prompt construction
Use a fixed system prompt with strict format. Wrap user input in delimiters: 'User query (untrusted): <<{query}>>'. Instruct the model to refuse if input tries to override instructions. Still bypassable but raises the bar.
Output layer
Validate: PII leakage (was user's SSN in the context? did it appear in output?), schema compliance, prohibited content (LlamaGuard or moderation API). Re-prompt or fall back to safe response if violated.
Rendering layer
Even safe text becomes dangerous in HTML context. Always render LLM output as plain text or sanitize through DOMPurify. LLMs CAN be tricked into emitting <script> tags — assume hostile output.
Audit trail
Log every prompt + response + guardrail decision keyed by user_id. Required for incident response and regulatory audit. Watch storage cost; some teams sample non-flagged interactions and keep flagged 100%.