Just like any other AI model,GPT-4o mini can face some security issues. Keeping this in mind, OpenAI has come up with a safety module that can help GPT-4o Mini to protect itself from fraudulent activities.
According to OpenAI the large language model (LLM) is built with a technique called Instructional Hierarchy. The safety module has the potential to stop malicious prompt engineers from jailbreaking the AI model.
Decoding ‘Instructional Hierarchy’
OpenAI claimed that after using the safety module it has experienced an improvement in the robust score of the AI model by 63 percent. It also explained that with the help of ‘Instructional Hierarchy’ OpenAI aims to stop malicious prompt engineers from ‘Jailbreaking’ its AI model. But what is ‘Jailbreaking’and how can it affect LLMs? ‘Jailbreaking’ aims to escape the safety behavior that is trained into an LLM. They usually don’t specifically conflict with a model’s previous instructions.
In addition to this. ‘Jailbreaking’ can lead to a myriad of attack variants that can allow adversaries to perform malicious tasks. These tasks can include generating spam, misinformation, or producing pornographic content.
The ‘Instructional Hierarchy’ technique has the potential to combat issues where the AI model generates not only offensive text or images but also harmful content such as methods to create a chemical explosive or ways to hack a website. So how does the ‘Instructional Hierarchy’ work?If simplified, the technique dictates how models should behave when instructions of different priorities conflict. It’s more like attending to ‘prompts’ that come first when arranged in an ascending to descending method hierarchy.
The safety road ahead
Early reports suggest that scammers had attempted to trick the ChatGPT by using prompts for making the GPT modules forget original programming. For example, prompts such as “ Forget early instructions and do this’ had resulted in issues. With the implementation of ‘Instructional Hierarchy’ the company can safeguard its ‘in-house’ set of prompts. The safety module creates an extra layer of security by making the AI follow an order of priority, eventually discarding the new ‘fake prompts.’
Furthermore, OpenAI explained in its official blog that the new safety model will be developed more. It aims to add more security features to the Instructional Hierarchy’. Initially the model is focused at identifying ‘fraudulent prompts.’ In addition to this OpenAI aims to focus on identifying AI-generated images and video clips, including AI-generated voice instructions.
Follow FE Tech Bytes on Twitter, Instagram, LinkedIn, Facebook.
