The cost of training machines is becoming a problem


The cost of training machines is becoming a problem

The cost of training machines is becoming a problem


Increased complexity and competition are part of it


The fundamental assumption of the computing industry is that number-crunching gets cheaper all the time. Moore’s law, the industry’s master metronome, predicts that the number of components that can be squeezed onto a microchip of a given size—and thus, loosely, the amount of computational power available at a given cost—doubles every two years.


For many comparatively simple AI applications, that means that the cost of training a computer is falling, says Christopher Manning, an associate director of the Institute for Human-Centered AI at the University of Stanford. But that is not true everywhere. A combination of ballooning complexity and competition means costs at the cutting edge are rising sharply.

坦福大学以人为本人工智能研究所副所长克里斯托弗曼宁(Christopher Manning)表示,对很多相对简单的人工智能应用程序来讲,根据摩尔定律,训练一台计算机的成本在下降。但这并不适用于所有情况。不断增长的复杂度加上愈发激烈的竞争,意味着最前沿的机器学习成本正在急剧上升。

Dr Manning gives the example of BERT, an AI language model built by Google in 2018 and used in the firm’s search engine. It had more than 350m internal parameters and a prodigious appetite for data. It was trained using 3.3bn words of text culled mostly from Wikipedia, an online encyclopedia. These days, says Dr Manning, Wikipedia is not such a large data-set. “If you can train a system on 30bn words it’s going to perform better than one trained on 3bn.” And more data means more computing power to crunch it all.

曼宁博士举了BERT的例子。BERT 是谷歌在2018年建立的人工智能语言模型,并在谷歌搜索引擎上使用。该模型有超过3亿5千万个内部参数,对于数据的需求极大。训练词库包含了33亿个单词,其中大部分单词采集自网络百科---维基百科。曼宁博士表示,现在来看,维基百科的数据集并不算大。如果用300亿个词汇来训练一个系统,结果会比使用30亿个词汇训练的系统表现得更好。而更多的数据意味着需要更多的运算力来消化它。

OpenAI, a research firm based in California, says demand for processing power took off in 2012, as excitement around machine learning was starting to build. It has accelerated sharply. By 2018, the computer power used to train big models had risen 300,000-fold, and was doubling every three and a half months (see chart). It should know—to train its own “OpenAI Five” system, designed to beat humans at “Defense of the Ancients 2”, a popular video game, it scaled machine learning “to unprecedented levels”, running thousands of chips non-stop for more than ten months.


Exact figures on how much this all costs are scarce. But a paper published in 2019 by researchers at the University of Massachusetts Amherst estimated that training one version of “Transformer”, another big language model, could cost as much as $3m. Jerome Pesenti, Facebook’s head of AI, says that one round of training for the biggest models can cost “millions of dollars” in electricity consumption.

有关机器训练成本的具体数据非常少。但马萨诸塞大学阿默斯特分校的研究者于2019年发布了一篇论文,该论文指出训练一款大型语言模型变形金刚的成本可达300万美元。脸书的AI负责人杰罗姆·佩森蒂(Jerome Pesenti表示,对最大的模型而言,一轮训练下来,电力成本就有几百万美元

Help from the cloud


Facebook, which turned a profit of $18.5bn in 2019, can afford those bills. Those less flush with cash are feeling the pinch. Andreessen Horowitz, an influential American venture-capital firm, has pointed out that many AI startups rent their processing power from cloud-computing firms like Amazon and Microsoft. The resulting bills—sometimes 25% of revenue or more—are one reason, it says, that AI startups may make for less attractive investments than old-style software companies. In March Dr Manning’s colleagues at Stanford, including Fei-Fei Li, an AI luminary, called for the creation of a National Research Cloud, a cloud-computing initiative to help American AI researchers keep up with spiraling bills.

脸书当然负担得起这笔开销,它在2019年一年的盈利额就有185亿美元;但资金没那么充裕的公司就捉襟见肘了。美国颇有影响力的风投公司安德森·霍洛维兹基金(Andreessen Horowitz)指出,很多AI初创公司会从像亚马逊和微软这样的云计算公司租借计算机处理服务,最终的租用费有时会达到营业额25% ,甚至更高。这也是相比老牌软件公司AI初创企业投资吸引力不足的原因之一。三月,曼宁博士的同事、包括AI专家李飞飞在内,呼吁创建国家研究云端国家研究云端是一个云计算项目,旨在帮助美国AI研究者面对猛涨的运算服务费用时能继续把研究开展下去。

The growing demand for computing power has fueled a boom in chip design and specialized devices that can perform the calculations used in AI efficiently. The first wave of specialist chips were graphics processing units (GPUs), designed in the 1990s to boost video-game graphics. As luck would have it, GPUs are also fairly well-suited to the sort of mathematics found in AI.


Further specialisation is possible, and companies are piling in to provide it. In December, Intel, a giant chipmaker, bought Habana Labs, an Israeli firm, for $2bn. Graphcore, a British firm founded in 2016, was valued at $2bn in 2019. Incumbents such as Nvidia, the biggest GPU-maker, have reworked their designs to accommodate AI. Google has designed its own “tensor-processing unit” (TPU) chips in-house. Baidu, a Chinese tech giant, has done the same with its own “Kunlun” chips. Alfonso Marone at KPMG reckons the market for specialised AI chips is already worth around $10bn, and could reach $80bn by 2025.

进一步实现专业化是有可能的,很多公司也正跻身这一行业,提供专用AI芯片。去年12月,大型芯片制造公司英特尔以20亿美元的价格收购了以色列Habana Labs公司。英国公司Graphcore成立于2016年,其估值在2019年时为20亿美元。而已进入该行业的公司,如最大的GPU制造商Nvidia,为了适应AI,已经修改了他们的设计。谷歌内部自造了TPU芯片。中国科技巨头百度同样自研出了昆仑芯片。毕马威(KPMG)的阿方索·马龙(Alfonso Marone)认为,专用AI芯片的市场规模已达到100亿美元左右,且有望于2025年之前到达800亿美元。

“Computer architectures need to follow the structure of the data they’re processing,” says Nigel Toon, one of Graphcore’s co-founders. The most basic feature of AI workloads is that they are “embarrassingly parallel”, which means they can be cut into thousands of chunks which can all be worked on at the same time. Graphcore’s chips, for instance, have more than 1,200 individual number-crunching “cores”, and can be linked together to provide still more power. Cerebras, a Californian startup, has taken an extreme approach. Chips are usually made in batches, with dozens or hundreds etched onto standard silicon wafers 300mm in diameter. Each of Cerebras’s chips takes up an entire wafer by itself. That lets the firm cram 400,000 cores onto each.

Graphcore的联合创始人奈杰尔·图恩(Nigel Toon)表示:计算机架构需遵循其处理的数据的结构。” AI工作负荷最基本的特征便是以令人尴尬的方式同时运转也就是说它们可以被切割为成千上万个部分,每个部分都可同时被处理。例如, Graphcore的芯片有1200多个单独的数字运算,还可以连接起来进行更强大的运算 加利福尼亚一家初创公司Cerebras采取了一种极端的方式。芯片通常是成批制造,再把数十或数百批芯片蚀刻在直径300毫米的标准硅片上Cerebras公司的芯片都是一批芯片占据一整块硅片。这样一来,该公司的每个芯片都可以塞进400,000运算核。

Other optimisations are important, too. Andrew Feldman, one of Cerebras’s founders, points out that AI models spend a lot of their time multiplying numbers by zero. Since those calculations always yield zero, each one is unnecessary, and Cerebras’s chips are designed to avoid performing them. Unlike many tasks, says Mr Toon at Graphcore, ultra-precise calculations are not needed in AI. That means chip designers can save energy by reducing the fidelity of the numbers their creations are juggling. (Exactly how fuzzy the calculations can get remains an open question.)

其他的优化项目同样重要。Cerebras的创始人之一安德鲁·费尔德曼(Andrew Feldman)指出,人工智能模型耗费大量时间做数字乘以零的运算。由于这些计算结果总会是0,因此每次计算都是无用功,而Cerebras芯片的设计就可以规避这些无用的计算。图恩先生表示,AI其他领域不同,不需要超精确的计算。这也就是说,芯片设计师可以通过降低数字处理的精确性来节省能源。

All that can add up to big gains. Mr. Toon reckons that Graphcore’s current chips are anywhere between ten and 50 times more efficient than GPUs. They have already found their way into specialised computers sold by Dell, as well as into Azure, Microsoft’s cloud-computing service. Cerebras has delivered equipment to two big American government laboratories.


“Moore’s law isn’t possible any more”


Such innovations will be increasingly important, for the AIfuelled explosion in demand for computer power comes just as Moore’s law is running out of steam. Shrinking chips is getting harder, and the benefits of doing so are not what they were. Last year Jensen Huang, Nvidia’s founder, opined bluntly that “Moore’s law isn’t possible any more”.

随着摩尔定律逐渐失效,由人工智能激起的对计算能力的需求激增,此类的创新将越来越重要。缩小芯片的难度日益加大,带来的收益也不如从前。去年Nvidia的创始人黄仁勋(Jensen Huang)直言不讳地表示:摩尔定律已经失效了

Quantum solutions and neuromantics


Other researchers are therefore looking at more exotic ideas. One is quantum computing, which uses the counter-intuitive properties of quantum mechanics to provide big speed-ups for some sorts of computation. One way to think about machine learning is as an optimisation problem, in which a computer is trying to make trade-offs between millions of variables to arrive at a solution that minimises as many as possible. A quantum-computing technique called Grover’s algorithm offers big potential speed-ups, says Krysta Svore, who leads the Quantum Systems division at Microsoft.

因此,其他研究者正另辟蹊径,考虑一些更加新奇的想法。其中一项便是量子计算,即运用量子力学的反直觉属性来提供巨大的加速度,进而支持某些运算。机器学习可以被视为一个优化问题,也就是一台计算机在数百万的变量中进行取舍,以得出一个最小解的问题。微软量子系统部门的负责人克里斯塔斯沃尔(Krysta Svore)表示,一种名为格罗弗算法的量子计算技术能够提供巨大的潜在加速 度。

Another idea is to take inspiration from biology, which proves that current brute-force approaches are not the only way. Cerebras’s chips consume around 15kW when running flat-out, enough to power dozens of houses (an equivalent number of GPUs consumes many times more). A human brain, by contrast, uses about 20W of energy—about a thousandth as much—and is in many ways cleverer than its silicon counterpart. Firms such as Intel and IBM are therefore investigating “neuromorphic” chips, which contain components designed to mimic more closely the electrical behaviour of the neurons that make up biological brains.


注:brute-forcephysical strength, rather than intelligence and careful thinking

For now, though, all that is far off. Quantum computers are relatively well-understood in theory, but despite billions of dollars in funding from tech giants such as Google, Microsoft and IBM, actually building them remains an engineering challenge. Neuromorphic chips have been built with existing technologies, but their designers are hamstrung by the fact that neuroscientists still do not understand what exactly brains do, or how they do it.


That means that, for the foreseeable future, AI researchers will have to squeeze every drop of performance from existing computing technologies. MrToon is bullish, arguing that there are plenty of gains to be had from more specialised hardware and from tweaking existing software to run faster. To quantify the nascent field’s progress, he offers an analogy with video games: “We’re past Pong,” he says. “We’re maybe at Pac-Man by now.” All those without millions to spend will be hoping he is right. 



那么除了程序语言还有算法优化以外,文章又提到了量子运算。量子运算的思维框架和传统的二进制记忆储存的机器不同:机器的内存可以同一时间存在于多个状态,用量子位元来表示。这样讲貌似很难理解,我用更通俗的话解释一下。如果你有两个比特 (bit), 那么在一个时间点,你可以表达四种状态中一种状态: 

00), (01),(10), (11








