Greek team investigating ‘glitch’ in computer chips
One in 1,000 chips cause calculation errors: A team of University of Athens researchers led by Professor Dimitris Gizopoulos investigate for Meta
Even the microchips in our computers are fallible, making “silent” errors that go unnoticed. This is where Athens University professor of computer architecture Dimitris Gizopoulos and his research team come in, having been assigned the task by Meta of assessing the scope of the problem and containing it.
Users rely on microchips to do their jobs properly when they look for information on a search engine, check their accounts on e-banking and perform thousands of other such tasks, rarely, if ever, checking to see if the result, the account balance, or any other information is correct.
This trust was justified by the findings of experts who did the checking for us, and found that the possibility of an error was practically infinitesimal. That was, at least, until Meta (then Facebook) came along in February 2021 and published a report stating that so-called “silent errors” were much more frequent, appearing in one out of a thousand computer chips. The finding caused a small earthquake in the tech community that only grew when Google went on to confirm the findings.
Meta did not waste any time in sending out a call to the international academic community in search of an answer to the problem.
The Athens University proposal was among five approved by Meta, with the others being from top-flight American universities Stanford, Carnegie Mellon and Northeastern, as well as Canada’s University of British Columbia.
“It is a considerable achievement that is telling of the high caliber of our universities and researchers, even in the field of cutting-edge technology,” says Professor Gizopoulos, noting that his team’s proposal was one of 62 from 54 universities around the world submitted to Meta.
“There are many reasons why a computer may not work properly. It could be that the microprocessor is not designed properly or has not been put together properly; it could be environmental like radiation and temperatures; or it could also be wear and tear from intense, long-term use,” he explains.
“All four of these reasons can create problems in how programs work but this is not something new. What we knew, though, was that the problems were mostly related to the various external drives. We didn’t know that the problem was as extensive in the central processing unit (CPU).”
‘It’s nightmarish. It’s fascinating. It could even be a movie scenario’
Gizopoulos notes that computer hardware and software is equipped with code that spots and fixes errors and problems all the time, without the user being aware of them. They also send the user a “message,” like a blank screen, indicating that something is wrong and needs to be addressed. This is not the case, however, with the CPU, where errors are practically undetectable at the hardware level, hence “silent.”
“If I add five plus seven in Excel and it doesn’t give me 12, I know at once,” says Gizopoulos. “But we don’t use Excel for such simple arithmetic. I may type in 11,356 multiplied by 145.8 and it will give me a result that may determine whether I buy a car or tell me how much money I have in the bank. Do you ever check the result Excel gives you?”
Meta and Google identified the problem because, unlike most users, they don’t have one laptop working for a few hours a day, but have tens of millions of machines, with 8-core CPUs, working 24/7 on myriad programs.
“The realization that one in a thousand CPU chips present errors is mind-blowing. This means that in any office where there are thousands of chips at work, many are faulty and no one knows which ones,” says the Greek academic.
“If the program running on the computer does not use the wrong unit of arithmetic, then all of its calculations will be correct. If, however, it does and the computer keeps carrying out the same action, the result it produces is always wrong, so the mistake is hard to spot,” adds Gizopoulos.
The simulations
“The rate of error occurrence depends on the hardware, the software and the conditions,” explains Gizopoulos. “It depends on the room temperature, the age of the machine, the altitude and other factors. We are talking about arithmetic operations that simply produce incorrect results, do not crash the computer, and no detection or correction code detects the error. It’s nightmarish. It’s fascinating. It could even be a movie scenario. Our main goal is to measure the extent of the problem and create tests that will detect the faulty chips. We are trying to simulate the problem in collaboration with chip manufacturers Intel and AMD and devise clever tests so that when you use the chip in multiple machines, you can detect errors and avoid further use of the erroneous results they produce. The research collaboration has been ongoing for several months and is just one piece of the puzzle in solving this problem,” he says.
Gizopoulos cannot be certain about the size of the problem. “We may only be seeing the tip of the iceberg,” he warns. “The problem is definitely much bigger. Meta and Google had at least one customer who came back and told them that the computation they provided was incorrect because they checked it. How many others who never complained could be out there?”
And why does that matter for the average person, one might ask. Why do Gizopoulos and his team need to take on the role of detectives for a problem identified in the data centers of Meta and Google? The answer is quite simple: We all rely on applications developed by these two companies. Moreover, we use the same microprocessor chips in our everyday devices on a large scale, including mobile phones, tablets, laptops and desktop computers. These chips inevitably “age” and experience wear and tear, being susceptible to environmental conditions. Consequently, they can silently produce the same errors as the chips found in Meta and Google.
Tough equation
Knowing that one in 1,000 chips can potentially produce calculation errors compels us to re-evaluate numerous aspects of our lives, given our increasing reliance on digital technologies. The fact that it is challenging to pinpoint which chips are problematic further intensifies concerns. “A simple way to verify is running computations on two different processors and comparing the results,” says the professor. However, employing such methods places a burden on device performance and energy consumption. “Therefore, detecting this issue in the CPUs we use daily is by no means simple or inexpensive. In applications with demanding reliability requirements, the associated costs are high. For example, airplanes employ three CPUs operating in parallel to execute the same task. Similarly, banks ensure the verification of calculation results by leveraging their surplus computational capacity. Nevertheless, certain applications face a substantial risk with limited avenues for mitigation. The problem is mainly about scale. When the scale and usage conditions of the CPU involve significant magnitude, pressure and intricate, time-consuming computations, the potential for issues arises. In supercomputers, the impact can be significant, as seen in vaccine research, for instance. In the event of computation errors, serious mistakes can occur, leading to potentially catastrophic outcomes during emergencies.”
According to Gizopoulos, there is no need to panic over “silent errors.” He explains that the problem will never be completely solved due to the growing complexity of processor design and construction techniques. However, those who are willing to invest in finding solutions will eventually succeed. Currently, the focus is on accurately assessing the magnitude of the problem and taking measures to mitigate it. The rest of us hope that, just like in the endings of many detective movies, the detective will uncover the culprits, allowing us as users and viewers to continue sleeping peacefully.