Researchers maintain the status quo of AI by eliminating matrix multiplication in LLMs.

to enlarge / Illustration of a brain inside a light bulb.
WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

The researchers claim to have developed a new way to run AI language models more efficiently by eliminating matrix multiplication from the process. It essentially redesigns neural network operations that are currently accelerated by GPU chips. The findings, detailed in a recent preprint paper by researchers at the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University, could have profound effects on the environmental impact and operational costs of AI systems.

Matrix multiplication (often abbreviated to “MatMul”) is at the heart of most neural network computational tasks today, and GPUs are particularly good at performing math quickly because they can perform a large number of multiplication operations in parallel. can perform That ability momentarily made Nvidia the world's most valuable company last week. The company currently holds a 98 percent market share for data center GPUs, which are commonly used to power AI systems such as ChatGPT and Google Gemini.

In the new paper, titled “Scalable MatMul-free Language Modeling,” the researchers describe building a customizable 2.7 billion parameter model without using MatMul that offers similar performance to traditional large language models (LLMs). . They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per second on a GPU that was accelerated by a custom-programmed FPGA chip that consumes about 13 watts of power (the GPU's power draw do not count). This means that a more efficient FPGA “paves the way for the development of more efficient and hardware-friendly architectures,” he writes.

The paper doesn't provide power estimates for traditional LLMs, but this post from UC Santa Cruz estimates around 700 watts for a traditional model. However, in our experience, you can competently run the 2.7B parameter version of Llama 2 on a home PC with an RTX 3060 (which consumes about 200 watts peak) powered by a 500 watt power supply. So, if you could theoretically run LLM on an FPGA (without GPU) in just 13 watts, that would be a 38-fold reduction in power consumption.

The technique has not yet been peer-reviewed, but the researchers — Rui Jie Zhou, Yu Zhang, Ethan Seferman, Tyler Shives, Yiqiao Wang, Dustin Richmond, Peng Zhu, and Jason Ashraghian — claim their work is currently Challenges the paradigm. Matrix multiplication operations are indispensable for building high-performance language models. They argue that their approach can make large language models more accessible, efficient, and sustainable, especially for deployment on resource-constrained hardware such as smartphones.

Eliminating Matrix Arithmetic

In the paper, the researchers cite bitnet (the so-called “1-bit” transformer technique that made the rounds as a preprint in October) as an important precursor to their work. According to the authors, BitNet demonstrated the feasibility of using binary and ternary weights in language models, successfully scaling up to 3 billion parameters, while maintaining competitive performance.

However, they note that BitNet still relies on matrix multiplication in its self-healing mechanism. The limitations of BitNet served as a motivation for the present study, which prompted them to develop a completely “MatMul-free” architecture that improves performance by eliminating matrix multiplication even in attention mechanisms. can maintain

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment