The Toronto-based startup, Taalas, has launched its HC1 chip, and the world of artificial intelligence can’t keep calm! The HC1 chip is a groundbreaking AI inference accelerator chip that hard-wires an entire large language model (LLM) — for example, Meta’s Llama 3.1 8B, directly into silicon. This has vast implications for the future of generative AI processing, eradicating the ‘thinking’ times for AI.

Announced in mid-February 2026, the “Hardcore AI” architecture promises dramatically faster performance, lower costs, and superior power efficiency compared to traditional GPUs and other accelerators, potentially reshaping high-volume AI inference workloads. Founded just 2.5 years ago, Taalas has raised over $200 million (including a recent $169 million round) and spent around $30 million to develop this technology. 

The HC1 is a custom ASIC (Application-Specific Integrated Circuit) built on TSMC’s 6nm process, measuring 815 square mm with 53 billion transistors. Unlike flexible GPUs or general-purpose ASICs, it embeds the full model, parameters, and weights into hardware, eliminating much of the overhead associated with loading and processing models dynamically.

Taala HC1 promises breakthrough performance and efficiency

Taalas reports that the HC1 achieves over 17,000 tokens per second per user on Llama 3.1 8B (with measured results around 14,357–16,960 tokens/s depending on conditions), delivering near-instantaneous responses, even for detailed queries like a month-by-month WWII history in just 0.138 seconds.

Some of the key benchmarks include:

– 10x faster than the Cerebras wafer-scale engine (currently the fastest available inference platform for this model).

– Two orders of magnitude faster than high-end GPUs like Nvidia’s H200 or B200 in comparable settings.

– 10x cheaper overall inference costs: 0.75 cents per million tokens for Llama 3.1 8B (measured on silicon), versus 20–49 cents on GPUs.

– Power consumption is rated at 12–15 kW per rack (versus 120–600 kW for GPU racks), with individual HC1 cards drawing about 200W in a 2.5 kW dual-socket server configuration supporting multiple cards.

– Air-cooled design, no high-bandwidth memory (HBM) required, and PCIe compatibility for easy server integration.

Taalas HC1 chip
Taalas reports that the HC1 achieves over 17,000 tokens per second per user on Llama 3.1 8B (with measured results around 14,357–16,960 tokens/s depending on conditions)

The chip supports configurable context windows and fine-tuning via low-rank adapters (LoRAs), retaining some flexibility despite its specialised nature. Model updates involve changing just two metal layers, enabling a rapid two-month turnaround from new model release to hardened silicon — far quicker than traditional ASIC development cycles.

Taalas offers both Inference as a Service (via a beta chatbot demo at chatjimmy.ai and API access) and hardware sales. The current HC1 serves as a technology demonstrator for developers to test sub-millisecond latency and near-zero-cost inference.

The roadmap includes:

  • A mid-sized reasoning LLM still on HC1 silicon in spring 2026.
  • HC2 (second-generation platform) for frontier-scale models (terabyte-class) by winter 2026, using multi-chip designs, higher density, and advanced formats like 4-bit floating-point.

CEO Ljubisa Bajic (co-founder and former Tenstorrent executive) stated, “Our first product was brought to the world by a team of 24 team members, and a total of just $30 million spent… We decided to release it as a beta service anyway—to let developers explore what becomes possible when LLM inference runs at sub-millisecond speed and near-zero cost.”

What can HC1 do for the AI market?

The HC1 targets data centers with dominant single-model workloads, where specialisation yields massive gains in throughput, cost, and energy use, potentially supporting far more simultaneous queries per dollar. Early users have called the performance “insane,” and analysts suggest it could drive broader adoption of purpose-built inference silicon if proven at scale.

However, trade-offs include limited flexibility (tied to one fixed model per chip), the need for multiple SKUs as models evolve, and shorter hardware lifecycles requiring frequent upgrades. While ideal for high-volume, stable deployments, it may not suit diverse or rapidly changing AI needs where general-purpose GPUs remain dominant.