ChatGPT Moderation: AIGC怎样与人类价值观对齐?怎样识别parsons语言?生态/破坏话语?
ChatGPT API Moderation model:ChatGPT API 审查模型
为了保证人工智能与人类健康的价值观对齐,ChatGPT构建了一个审查模型(Moderation Model)。目的是用来识别色情、暴力、侮辱、粗俗等恶意言辞和指令。这一目标似乎与英语教学中屏蔽parsnips语言(注: parsnips 是指有关politics, alcohol, religion, sex, narcotics, -isms, pork的敏感词)、生态话语分析中辨别生态话语/破坏性话语、批评话语分析中识别意识形态(价值观)的需求不谋而合。在语言教学应用、话语研究、教学材料开发中都有很强的应用潜力。故此,特转发以下文章,希望给大家带来帮助。
Discover in this article what is the ChatGPT API Moderation model, and what are the 7 categories used in it and how to call and interpret them.
ChatGPT API Moderation model
OpenAI API provides the possibility to classify any text to ensure it complies with their usage policies, using a binary classification. This classification is integrated in their Moderation model that one can call using openai API in Python.
7 categories are used in the OpenAI model: Hate, Hate/Threatening, Self-harm, Sexual, Sexual/minors, Violence, Violence/graphic.
One can use them to filter any inappropriate content (comments in a website, inputs from clients in chatbot requests…).

OpenAI API Moderation method
The method to call to use the moderation classification is: openai.Moderation.create
The answer is a JSON object:
In the JSON object, you have:
model: The model currently used is called “text-moderation-004”.
results: in which you have:
True: if the input text does violate the given category
False: if does not
categories: For each of the 7 categories, you have a binary classification:
Category scores: for each category, a score is calculated. It’s not a probability. The lower the score, the better the content. The higher the score, the more it violates the above categories.
flagged: Which is the final classification of the input.
“false” if the input text does not violate OpenAI’s policies.
“true” if it does: If at least one category is true, this flag is set to true too.
Moderation API Call
Standard Call
The classification of the prompt “I love chocolate” is “false”, meaning it does not violate any of the above categories.
Here is the detailed output:
All scores are very low, thus the given categories are all “false”.
Call violation
The prompt given in the following request is just for illustration. It is not a personal opinion.
The output is “true”, meaning there is a violation. This is because the input violates the first category “hate” with a score of 0.52, while the other categories are all showing very low scores.
Some variants
When the input is describing a personal belief, the classification is correct. However when it describes a global opinion, the model does not classify it as violating the policies.
Here is an example, where the classification is false even if the input has a negative connotation :
Here is another variant, where a simple comma can change widely the score (the classification in both cases is “true”):
The score is about 0.66
Here the score is about 0.954 (with a simple comma):
Summary
In this article, you have learned how to use the ChatGPT API Moderation model, that you can put in place for your own project/website to avoid inputs or comments violating any common sense.
I hope you enjoy reading the article. Leave me a SanLian :-)
本文英文部分转载自:https://machinelearning-basics.com/chatgpt-api-moderation-model/
.