Researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. This fundamentally redesigns neural network operations that are currently accelerated by GPU chips. The findings, detailed in a recent preprint paper from researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have deep implications for the environmental impact and operational costs of AI systems.
Matrix multiplication (often abbreviated to “MatMul”) is at the center of most neural network computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations in parallel. That ability momentarily made Nvidia the most valuable company in the world last week; the company currently holds an estimated 98 percent market share for data center GPUs, which are commonly used to power AI systems like ChatGPT and Google Gemini.
In the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar performance to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per second on a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU’s power draw). The implication is that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” they write.
The paper doesn’t provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, in our experience, you can run a 2.7B parameter version of Llama 2 competently on a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM in only 13 watts on an FPGA (without a GPU), that would be a 38-fold decrease in power usage.
The technique has not yet been peer-reviewed, but the researchers—Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian—claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment on resource-constrained hardware like smartphones.
Doing away with matrix math
In the paper, the researchers mention BitNet (the so-called “1-bit” transformer technique that made the rounds as a preprint in October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights in language models, successfully scaling up to 3 billion parameters while maintaining competitive performance.
However, they note that BitNet still relied on matrix multiplications in its self-attention mechanism. Limitations of BitNet served as a motivation for the current study, pushing them to develop a completely “MatMul-free” architecture that could maintain performance while eliminating matrix multiplications even in the attention mechanism.