Risk metrics

Risk Metrics

Risk Metrics provide visibility into potential security issues by analyzing each query for fidelity, jailbreak attempts, prompt leakage, and profanity. Our proprietary analytics agent processes conversations to detect these risks, enabling you to improve persona design, protect sensitive instructions, and maintain safe user experiences.

⚠️
Disclaimer:
Risk Metrics are evaluated using lightweight models such as GPT-4o-mini. These models may not be 100% accurate, but they provide useful signals to help you monitor trends and identify areas requiring attention.

Currently available risk metrics are:

Fidelity
Jailbreak
Prompt leakage
Profanity

Fidelity

Fidelity measures whether the agent is staying true to its defined persona. A high number of persona failures may suggest that your persona requires refinement.

Good: The agent response is aligned with the defined persona.
Persona failure: The response did not align with the persona.

Jailbreak

Jailbreak tracks whether users are attempting to override restrictions placed on the agent.

No event: No jailbreak attempt detected.
Jailbreak attempt: The user attempted to jailbreak the agent, but it was not successful.
Jailbreak: The user successfully bypassed restrictions.

Prompt leakage

Prompt leakage tracks attempts to reveal the system prompt or instructions given to the agent.

No event: No leakage attempt detected.
Prompt leakage attempt: A user attempted to reveal the system prompt, but it was not successful.
Prompt leaked: The system prompt was successfully exposed.

Profanity

Profanity tracks when NSFW or abusive user inputs are detected and blocked by filters.

No event: No profanity detected.
Detected: Profanity was detected in the user query.