By / par Rob Thacker (CASCA President)
(Cassiopeia – Winter / hiver 2021)
First off, I want to thank all the members of CASCA as well as staff that devote their time to the operation of the Society. We cannot function without you, CASCA exists largely on the back of volunteer labour. The sum total of our employed staff is about 1/40th that of the AAS, because quite frankly we don’t even add up to one full time staff member. Everything that you help accomplish is quite literally above and beyond!
This report is incomplete in that I had hoped to provide a more detailed update on some internal discussions and unfortunately because one meeting got pushed so close to the holiday break and that I need get approval on a couple of things, I will have to let that go until January. My apologies.
I want to wish you good health over the holiday season. I sincerely hope you can take time to replenish reserves and rejuvenate, at least to whatever extent is possible.
On High-Performance Computing and Sustainability
Forgive me the indulgence of talking about something that is close to my heart on two accounts, high performance computing and sustainability. Following detailed analyses of emissions in Australian astronomy [1], motivated by prior discussions [e.g., 2], I’ve heard a number of people express concerns about the energy cost of high-performance computing (HPC) in astronomy. While I cannot possibly suggest a complete strategy in a short discussion, I can at least outline some of the key concerns and provide some useful background. This is an opinion piece though, rather than a detailed analysis.
I will not spend any time talking about improvements in algorithms or software. I fully accept these are absolutely fundamental areas in which energy consumption improvements are possible. Indeed, we’re already witnessing a detailed discussion about this in astronomy [3]. A wider view of the whole issue across algorithms, software and hardware is already spurring detailed thinking within hardware design circles [4].
From the outset I think we need to be clear that energy usage isn’t a discussion that happens in isolation and draws on value judgments. If it were possible to build computing centres running entirely on renewable energy, then the actual energy usage might be considered moot. Indeed, this has become a strategy for countries (e.g., Scotland) to offer data centre companies access to renewable energy to reduce their carbon footprint. Of course, in practice such approaches disguise the fact that other energy usage continues to rely on traditional more polluting energy sources. After all, getting to “net-zero” means doing so across multiple sectors. Collective action is needed, and we have to play our part in that given the emissions intensive nature of our profession [2].
Global awareness of energy consumption in data centres and HPC is growing, many of you have likely heard concerns about Bitcoin’s incredible energy usage. That said, it is fair to say that there is a recognition of the potential impact of this demand, over the past 15 years there has been a steady interest in improving energy efficiency in computing. That’s been achieved through multiple approaches, but I’ll focus on just two: data centre design and chip/server design.
Starting with chip/server design, if we look at energy efficiency increases over several decades the results are quite staggering. Comparisons are normally provided going back to the earliest machines, but let’s consider a Cray-1, from the 1970s, arguably what many consider the first supercomputer. With a performance of 160 Mflops, and a power draw of 115kW, its flops per watt work out to 1400. To provide a comparison we need to consider entire systems including their multiple overheads. Fortunately, Dr Wu Feng has been pushing for these numbers for many years and is the custodian of the “Green500” alternative to the “Top500” list of supercomputers. The most efficiency systems available today achieve 39.4×109 flops per watt, albeit using hardware optimized for Linpack calculations (more on that in a moment). An improvement of 28 million in 50 years! For comparison, the peak speed of a single CPU die, and I think 48 core Fujitsu A64FX vector processor is the most apt comparison, has only increased by a factor of around 50,000 relative to the Cray-1.
This dramatic increase in hardware energy efficiency has resulted for multiple reasons. Firstly, for many years “Dennard Scaling” meant that each time circuitry was reduced in size by a factor of 0.7x, the overall power requirement would reduce by 50% (essentially power was proportional to the area of the transistor). This allowed CPU designers to increase frequencies for many years, but by around 2005 the scaling broke down due to fundamental issues with “leakage” of power through the transistor substrate. Thus, the era of increasing clocks speeds came to a halt.
Realizing that clock speeds could no longer increase, designers turned their efforts to packaging more cores on to a single die, not a trivial feat when one considers the memory design needed. Equally importantly, they also needed to begin improving the energy efficiency of individual CPU cores. Remarkably through a series of power management and packaging advances, the energy efficiency of CPUs has improved by a factor of at least 20 since 2014.
One could make a strong argument that since 2010, energy efficiency has become the key design parameter of CPUs, to a considerable extent replacing overall performance concerns. That’s definitely the case for mobile devices. Around 2011, HPC experts were widely discussing that being able to build exascale systems with a “reasonable” power draw of 20 MW would require achieving 50×109 flops per watt, the most power efficient systems are not far from this value today, but the largest machines are still a factor of three or four away.
While I’ve focused on CPUs, quietly hiding in the background is another potential revolution in efficiency, specifically the rapid development of customized hardware/accelerators. Smartphones are already moving toward more and more special accelerators with hardware designed for specific operations. We’ve all seen general purpose GPU programming take off, and optimized matrix hardware is now common – indeed it is a significant part of the improvement in flops per watt on the Linpack results for the Top500. All of these approaches improve energy efficiency by tailoring hardware to specific calculations. The rise of freely and easily available CPU instruction set architectures (RISC-V, for example) may lead to domain specific architecture becoming increasingly more common place than it is today.
The third factor that has helped improve flops per watt in HPC systems is optimized design of the cooling system and integrated system management. There are also well documented examples of waste heat being used for heating in some jurisdictions, to the point where papers have been written about which locations would be most appropriate for this approach to maximize the benefits of the exported heat. Data centres that are the “hidden” power behind the cloud use these kinds of optimizations extensively.
That 28 million improvement in energy efficiency since the 1970s also benefits our desktops. Modern processors, especially those based on mobile designs, have far lower idle power than those from even 2016. Unless you really need to do “big compute” or GPGPU on your desktop, you can buy a mini pc that idles at a handful of watts but with 8 cores has enough power to take on some pretty heavy calculations. And as always, turning things off is a good thing, although systems are getting better and better at doing that themselves.
While energy efficiency is clearly better in modern HPC systems, we do face the issue of unmet demand. Analyses can always be made more complex. Simulations can always be made bigger for better resolution. In practice, like any other facility, the amount of time is limited. For HPC, at least in Canada, we’ve also been limited by practical issues around power supply, I doubt we will ever see academic systems using much more than 5 MW here. Commercial data centres in Canada with close to ten times that power requirement already exist, although we need to put those in the context of serving large communities.
Astronomy is moving towards the increasing use of HPC as are multiple research areas. We have a duty to understand how that fits in the wider picture of sustainability (and I’m not even going to open the offset discussion). My own bias is that, among other things, we need to think carefully about:
- Quotas and envelopes. To a certain extent they will happen naturally with equipment maintained by the Alliance (formerly NDRIO/Compute Canada).
- Optimizing calculations by using the right tools – not always easy as there is inertia in individual knowledge and skills. To a limited extent review of HPC applications used to handle this, for more open access systems that may not be possible.
- Being aware of creating accidentally unsustainable situations. This happened with Compute Canada through the growth of the user base. The same thing could happen with interfaces that disguise overall computing use.
I have to have a somewhat optimistic view that ultimately, we will transition to largely renewable energy generation and that these concerns will become less significant. But until such time as we have a reasonable control on emissions, it’s important to remember that the energy usage of our field is fundamentally a value question.
Wishing you all a Happy Holidays and may you share many moments of joy with friends and loved ones,
Rob
[1] “The imperative to reduce carbon emissions in astronomy,” Stevens, A. R. H., et al, Nature Astronomy, 4, 843, (2020).
[2] “Astronomy in a Low-Carbon Future,” Matzner, C. D., et al., Astro-ph/1910.01272
[3] “The ecological impact of high-performance computing in astrophysics,”, Zwart, S. P., Nature Astronomy, 4, 819, (2020).
[4] “There’s plenty of room at the Top,” Leiserson, C., et al, Science, 368, 1, (2020).