Meta warns of bit flipping, other hardware faults cause AI errors • Register

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Meta points out another reason AI can produce garbage: hardware errors that corrupt data.

As noted in a paper published last week and written on June 19, hardware failures can corrupt data. There's no reward for meta – phenomena like “bit flips” that see data values ​​change from zero to one have also been attributed to cosmic rays hitting memory or hard disks.

Metalabels such “undetected” hardware errors—we'll assume they mean errors that aren't caught and dealt with on the fly—as “silent data crypts” (SDCs). . Its researchers suggest that when these kinds of errors occur in AI systems, they cause “parameter corruption, where the parameters of the AI ​​model are corrupted and their original values ​​are changed.”

This can result in incorrect, strange, or just generally bad output.

“When this happens during AI estimation/servicing, it can potentially lead to incorrect or degraded model output for users, which ultimately affects the quality and reliability of AI services,” according to Meta. said the boffins.

As we said, bit flips are nothing new – Meta has documented their prevalence in our infrastructure, and dealing with these undetectable bugs is difficult at the best of times. In their latest paper, Meta's eggheads suggest that the AI ​​stack complicates matters further.

“The increasing complexity and heterogeneity of AI hardware systems makes them increasingly susceptible to hardware failures,” the paper states.

What to do? Meta suggests measuring hardware vulnerabilities so that AI system builders understand the least risks.

So its boffins proposed the “Parameter Vulnerability Factor” (PVF) — “a new metric we've introduced with the goal of standardizing the quantification of an AI model's vulnerability to parameter corruption.”

PVF is apparently “adaptable to different hardware fault models” and can be adapted to different models and tasks.

“Furthermore, PVF can be extended to the training phase to evaluate the impact of parameter corruption on the model's convergence ability,” asserts Meta's team.

The paper explains that Metta has replicated incidents of silent corruption using “DLRM” – a tool used by the social media giant to generate personalized content recommendations. Under some circumstances, Meta's technicians found that four out of every thousand results would be wrong just because of a bit flip.

Presumably this is on top of the usual accuracy, or lack thereof, by LLMs.

The paper concludes by recommending that operators of AI hardware designers consider PVF, to help them balance fault protection with performance and efficiency.

If this all sounds a little familiar, then your déjà vu is spot on. PVF builds on the Architectural Vulnerability Factor (AVF) – an idea described last year by researchers at Intel and the University of Michigan in the US. ®

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment