Google has officially launched Gemini 1.5 Flash-8B, the latest production-ready variant of its Flash model, designed to deliver significant improvements in performance and cost-effectiveness for developers. This new model is claimed to boast a 50% reduction in price compared to its predecessor- Gemini 1.5 Flash.
Google describes Flash-8B as a smaller and faster variant of the Gemini 1.5 Flash model, nearly matching its performance across multiple benchmarks. It excels in tasks like chat, transcription, and long-context language translation.
Gemini 1.5 Flash-8B comes with an increase in rate limits, allowing developers to send up to 4,000 requests per minute. This enhancement effectively doubles the previous limits, enabling higher-volume tasks and applications to run more smoothly. Moreover, users can expect lower latency on smaller prompts, which further enhances the model’s responsiveness and usability, says company in its developer blog.
The development of Flash-8B has been guided by extensive feedback from developers. It is capable of doing tasks such as chat interactions, transcription, and long-context language translation, making it a versatile tool for a wide range of applications.
The company claims that model has been optimised based on developer feedback, showcasing Google’s commitment to creating tools that meet the needs of users and enhance their building capabilities.
With this stable release, Google DeepMind is also emphasising the cost efficiency of its offerings. The new pricing structure is set at $0.0375 per million input tokens and $0.15 per million output tokens for prompts under 128K. For cached prompts, the cost is $0.01 per million tokens.
Access to Gemini 1.5 Flash-8B is available for free via Google AI Studio and the Gemini API. For those on the paid tier, billing for this new model will commence on October 14.
Google sees potential for Flash-8B in high-volume multimodal applications and long-context summarisation tasks, indicating its versatility in various AI applications.