OpenAI creates a new safety framework; Aims to stop jailbreaking of AI

OpenAI has come up with a safety module that can help GPT-4o Mini to protect itself from fraudulent activities

July 22, 2024 16:00 IST

Scammers had attempted to trick the ChatGPT by using prompts for making the GPT modules forget original programming

OpenAI was aware of the situation and stated that it was investigating the incident

Just like any other AI model,GPT-4o mini can face some security issues. Keeping this in mind, OpenAI has come up with a safety module that can help GPT-4o Mini to protect itself from fraudulent activities.

According to OpenAI the large language model (LLM) is built with a technique called Instructional Hierarchy. The safety module has the potential to stop malicious prompt engineers from jailbreaking the AI model.

Decoding ‘Instructional Hierarchy’

OpenAI claimed that after using the safety module it has experienced an improvement in the robust score of the AI model by 63 percent. It also explained that with the help of ‘Instructional Hierarchy’ OpenAI aims to stop malicious prompt engineers from ‘Jailbreaking’ its AI model. But what is ‘Jailbreaking’and how can it affect LLMs? ‘Jailbreaking’ aims to escape the safety behavior that is trained into an LLM. They usually don’t specifically conflict with a model’s previous instructions.

The safety road ahead

Early reports suggest that scammers had attempted to trick the ChatGPT by using prompts for making the GPT modules forget original programming. For example, prompts such as “ Forget early instructions and do this’ had resulted in issues. With the implementation of ‘Instructional Hierarchy’ the company can safeguard its ‘in-house’ set of prompts. The safety module creates an extra layer of security by making the AI follow an order of priority, eventually discarding the new ‘fake prompts.’