Kernelize Inc. launches to bridge the CUDA moat

Erica Lindberg

•

May 2025

Open Core Ventures (OCV) is proud to announce the launch of Kernelize, a platform to target AI accelerators built on the open source project Triton. Kernelize bridges the gap between AI software and hardware by automating compiler backend generation. Kernelize’s vision is to democratize AI performance by enabling Triton to become the all-powerful, all-purpose hardware accelerator platform.

Founder and CTO Simon Waters is a respected compiler expert who, most recently, led the AMD team working on the open-source Triton compiler. One of the original creators behind Catapult C Synthesis, an industry-leading software tool that simplifies the design of computer chips, Simon has been bridging the gap between software and hardware for over 25 years.

Bridging the CUDA moat

AI models demand an enormous amount of computing power. Training popular LLMs requires thousands of computer chips to run for months and costs millions. The high cost of AI doesn’t stop with training—as AI becomes more sophisticated and ubiquitous, the demand for computing power increases. To keep compute costs down and performance up, developers need to find ways to optimize for their hardware.

AI-specialized computer chips, or “AI accelerators,” are specifically designed to boost processing performance and speed to handle complex AI computations. Faster, more powerful AI chips reduce the time and cost to train AI models, but it’s only part of the resource battle. “Chip manufacturers need software platforms that make AI accelerators accessible to developers,” said Simon Waters, founder and CTO of Kernelize, an AI accelerator platform.

Nvidia GPUs—one of the most power-hungry and expensive chips—are the most widely used despite the availability of several alternative accelerators. Even when a new chip offers superior performance, Nvidia GPUs dominate the market because developers are locked into their proprietary CUDA programming platform. "Nobody wants to write CUDA, but it's what everyone knows and how you get the best performance out of GPUs,” said Simon. “Developers need a language that allows AI optimization engineers to experiment with their own optimizations from the model’s perspective.”

Triton, the open source language and compiler, was created to do just that. Triton provides a simple language to describe algorithmic kernels in Python. It’s best known for developing advanced optimizations for flash attention, a method that boosts performance and reduces memory usage for the attention mechanism. By optimizing how data is loaded, processed, and stored in GPU memory, Triton enables faster training and inference speeds for large language models (LLMs). Its memory efficiency allows for larger context windows and the LLM’s ability to handle more complex inputs. However, Triton is currently limited in its ability to interact with different types of hardware and only targets specific GPUs, severely limiting developers’ hardware choices.

Without standardization, AI developers must rewrite their algorithms for each new chip they want to utilize. The cost of switching is prohibitive because developers have to learn a custom software language for each accelerator (GPU, NPU, TPU) to achieve optimal performance. “Every chip has its custom software stack,” explains Simon. “Nobody wants to switch programming languages because they would have to rewrite all their algorithms just for that chip.”

Kernelize is changing this by offering an automated compiler backend that can deploy AI/ML seamlessly across alternative hardware accelerators, rather than building custom solutions for each chip. “My vision is for Triton to become the ubiquitous AI kernel compiler through hardware target adaptation,” said Simon.

Unlocking Triton

Nvidia’s market dominance has far-reaching consequences for AI accessibility and development costs. It leaves organizations developing AI solutions with limited options—they must either invest heavily in Nvidia hardware or accept suboptimal performance. The bottleneck means that everyday users face slower responses and higher subscription prices for AI services, while developers struggle to deploy advanced AI features in affordable applications.

Kernelize’s standardization would allow AI chip manufacturers to more easily integrate with the Triton ecosystem, giving their hardware immediate compatibility with the tools developers already use and know. "Most new AI chip architectures must build an entire compiler stack from the ground up", said Simon. "This is an extremely difficult task, requiring a large software team with extensive compiler expertise, just to enable basic support. Kernelize will automate this process through a standardization agent trained to optimize for the Triton language."

Kernelize's success would dramatically shift AI performance by giving developers a path to optimize AI workloads across different hardware without writing CUDA or rewriting their code for each accelerator. Smaller chip manufacturers could compete on chip performance rather than ecosystem, as Kernelize would provide them instant software compatibility.

Activate startup founder mode

After a storied career in software development, including the development of the industry-standard high-level synthesis tool Catapult C, Simon is excited to return to a startup environment. "I've always had an entrepreneurial spirit, and I like the way startups can move and shake and do things fast, and get results," he said.

When approached by OCV, Simon found it to be the most interesting prospect he had heard in a while. “I was already interested in the opportunity to have my own venture,” he said. “Then I researched OCV, Sid, the whole open core idea, and met the team. I've been very impressed."

Building on Triton, Simon aims to make advanced AI optimization more accessible, providing businesses with a tangible opportunity to enhance their AI models and achieve substantial performance improvements. “Triton has already enabled training of much larger models,” said Simon. “Every time you shrink things down or you make things more efficient, people just do more. I hope to help make ‘more’ possible.”

‍

Introducing External Secrets: Enterprise Secrets Management

Erica Lindberg

•

Aug 2024

Introducing Locust load testing: Event-driven, test-as-code performance testing suite

Erica Lindberg

•

Aug 2024