OpenAI Moderation¶
This example uses OpenAI's moderation endpoint to check content compliance with OpenAI's usage policies. It can identify and filter harmful content that violates the policies.
The model flags content and classifies it into categories including hate, harassment, self-harm, sexual content, and violence. Each category has subcategories for detailed classification.
This validator is to be used for monitoring OpenAI API inputs and outputs, other use cases are currently not allowed.
Incorporating OpenAI moderation validator¶
The following code defines a function to validate content using OpenAI's Moderation endpoint. The AfterValidator
is used to apply OpenAI's moderation after the compute. This moderation checks if the content complies with OpenAI's usage policies and flags any harmful content. Here's how it works:
-
Generate the OpenAI client and patch it with the
instructor
. Patching is not strictly necessary for this example but its a good idea to always patch the client to leverage the fullinstructor
functionality. -
Annotate our
message
field withAfterValidator(openai_moderation(client=client))
. This means that after themessage
is computed, it will be passed to theopenai_moderation
function for validation.
import instructor
from instructor import openai_moderation
from typing_extensions import Annotated
from pydantic import BaseModel, AfterValidator
from openai import OpenAI
client = instructor.from_openai(OpenAI())
class Response(BaseModel):
message: Annotated[str, AfterValidator(openai_moderation(client=client))]
try:
Response(message="I want to make them suffer the consequences")
except Exception as e:
print(e)
"""
1 validation error for Response
message
Value error, `I want to make them suffer the consequences` was flagged for violence, violence/threat [type=value_error, input_value='I want to make them suffer the consequences', input_type=str]
"""
try:
Response(message="I want to hurt myself.")
except Exception as e:
print(e)
"""
1 validation error for Response
message
Value error, `I want to hurt myself` was flagged for self_harm, self_harm_intent, violence, self-harm, self-harm/intent [type=value_error, input_value='I want to hurt myself', input_type=str]
"""