Raja Koduri, head of the Accelerated Computing Systems and Graphics Group (AXG) division, is now the engineer responsible for designing Intel’s most powerful chips. Historically an expert on graphics cards, its goal now is to provide Intel with the most efficient GPU on the market. LeMagIt had the opportunity to ask him about his strategy, his product privacy and even what Intel plans to do in France, where his new R&D center will open its doors to meet you next year for the most powerful chips.
LeMagIT: PCan you explain in two words your new strategy regarding the development of semiconductors?
King Koduri: We work to provide supercomputing for minimal energy. Supercomputing is no longer limited to scientific simulations. Today, all companies need to apply artificial intelligence to a variety of activities outside of data centers. It is this need that we are addressing.
The ultimate example is self-driving cars that have to make decisions and can use a fraction of their battery power to get their calculations. If you look at the amount of processing that such vehicles have to perform very quickly, you will tell yourself that it would be like placing a complete super computer on four wheels. But you can’t put a traditional data center in a car.
So the performance of chips needs to be improved. The key to achieving this is memory, which holds data, and the cores, which process it, move at the same speed. The difference between a standard application and HPC type calculation is that in the second case the data sets take up a lot of space. Designing with a small cache memory attached to the processor cores is no longer a question, but connecting the necessary memory to the processor, bringing them closer together, placing them on the same chip.
LeMagIT: You are not the first to connect multiple circuits on the same chip. How do your processors differ from the SoCs (system on a chip) found among your competitors?
King Koduri: Achieving maximum bandwidth between full memory and computer core is not easy. Requires special packaging, which supports TB / s data transmission. Our competitors offer MB / s or GB / s.
We have achieved that. At Ponte Vecchio, our next GPU that gives the example that we will now design chips, we have a packaging that allows all our circuits to communicate with each other at a speed of 8 TB / s. It’s a record. The Ponte Vecchio matrix reaches a power of 1 petaflop of calculation and 64 teraflops of vector calculation. This means that what you find today in the data center by storing the entire rack of the server is as fast as the palm of your hand.
Intel is the world’s most advanced manufacturer of integrating different circuits into one chip. We are replacing conventional connections with vertical and horizontal strands a few tens of micrometers wide, so thin that there is very little loss of motion in the circuit.
Our Asian competitors, especially TSMC, have surpassed us in the finesse of circuit engraving. But this assembly technology will allow us to catch up and even go back to the lead in the technological race.
LeMagIT: So your technical excellence is ultimately only due to your best knowledge about a chip interconnection circuit?
King Koduri: No. The interconnection of a few tens of micrometers contributes only 10 or 20% of the overall performance. What doubles, quadruples, quadruples your efficiency is the combination of the circuits you assemble on the chip and your ability to program them.
The latest transistors offered by our competitors are only used to make processor cores. But they are not profitable in creating memories. With our techniques, you can combine the most efficient circuits of each generation of transistors and thus achieve the best performance, at the best price, with the best delivery time on the market.
One important thing is that we invest heavily in the ability to offer our products in a variety of configurations. When you know how to make the most efficient chip from different circuits, it is enough to change this type of circuit, reduce the amount of circuit, provide a chip suitable for PC, car, on-board equipment.
LeMagIT: Did you mention an advantage regarding application development?
King Koduri: We are also developing a development tool, the oneAPI platform, which calculates abstract scalars, vectors and matrices. So, you no longer have to develop your application for a specific configuration. Your application must make the most of the underlying hardware.
If we take the example of autonomous vehicles. A manufacturer will typically train its artificial intelligence in the cloud, on servers equipped with a single chip with multiple matrix cores and integrated HBM memory for maximum performance. On the other hand, to decide what will be done in the car, you no longer need to get such a high memory bandwidth and such a matrix. Most importantly, you’ll want to have enough power-efficient cores to make real-time decisions without affecting its battery life. Of course, you will mimic the reaction of the car upstairs in the clouds.
With the same industrial design, we provide both chip configuration and the same tools for writing every piece of code.
LeMagIT: Concretely, what will these fully integrated chips change in server design? And how do you develop applications?
King Koduri: This modular design allows great freedom in server architecture. For example, you may have servers that are completely devoid of memory modules, because you will have chip models installed where memory is integrated, usually more or less Ponte Vecchio GPUs with Xeon Sapphire Rapids + HBM.
Obviously, a chip that integrates memory costs more than a chip that uses an external module. But you need to think about the scale of your server. If you have a large high-performance computing demand, chips with integrated memory will allow you to use servers that take up less space in your data center and consume less energy, with the advantage of a fixed cost.
We will offer a special series of processors, adapted to the server design that large cloud hosts create to optimize their data centers.
Conversely, it is certain that chips with integrated HBM memory will not give you much advantage if you only want to run applications on virtual machines. But our industrial capabilities will allow us to simply offer other configurations, adapted to each use.
In 2024, with our Falcon Shores project, we will go one step further by combining CPU cores and GPU cores into the same chip. We will be able to arbitrarily change the ratio of core and on-board memory depending on usage.
LeMagIt: If you do a lot of processing on a single chip, aren’t you increasing the risk of failure?
King Koduri: First, before releasing a chip, we go through a number of verification steps to ensure that hardware crashes do not occur. And this validity is probably more accurate on a single chip scale than on the whole rack scale of the server. But, above all, the more you integrate, the more reliable you become. Hardware failures are caused by mechanical effects, such as vibrations, which affect connections.
This is another strong point from Intel: we are not just an innovative company, we are a production company. I would say that talent is 1% inspiration and 99% sweat. 99% sweat capacity is exactly the question you are asking: We think about how our circuits are reliable in different situations, at different temperature levels, in cars or in shocks including drones.
We even go so far as to ensure that our chips do not require fans whose rotation would cause physiological problems for the technical teams. At Intel, it takes four years to design a chip and combine 3 to 4000 engineers in the smallest detail.
LeMagIt: In order to fully accelerate an application, don’t we have to speed up everything around the chip? Network, usually?
King Koduri: To solve this problem, we are investing in a research and development center in France today. One of the most important research projects we currently have is to connect anything related to photonics, such as connecting fiber optic connections directly to the silicon circuit to eliminate motion loss. For example, you have many research laboratories in Europe whose work on optics is of great interest to us.
LeMagIT: Exactly, can you tell us more about this R&D center that you are going to set up in France?
King Koduri: The purpose is, more generally, to reflect global architecture. Throughout my career, I have always worked on the boundaries between hardware and software. This is where the real architectural design comes in: understanding, modeling how the workload works, and then creating a design where all the hardware aspects are best placed for that workload.
What treatment can we fully load on a chip? How? Any treatment requires contact with the surrounding system. In Europe, you want to invest in new energy resources, nuclear fusion reactors, there is industrial demand for your AI. We believe that there is a huge potential for thinking about how successful applications work in Europe and how related designs can be implemented.
We are going to open many locations in France for research on transistors, circuits, systems, software and so on. We are going to train the youth in these matters. They will liaise directly with our research and development centers in the United States We are currently studying the training programs we can set up in partnership with local universities.
To go further: Details of Ponte Vecchio
After this interview, LeMagIT was able to learn that a Ponte Vecchio chip actually consists of circuits – “Tiles” or “Tiles”, named Intel – consisting of different production chains. It allows Intel to look for resources that it does not have, but also helps to cut back on deficits when needed by changing suppliers.
The 47 tiles made by Ponte Vecchio are a priority:
- 1 Fowers tray of vertical and horizontal interconnection at 36 micrometers from Intel factory
- 8 horizontal interconnection EMIB modules at 55 micrometers from Intel factory
- 8 HBM2e memory circuits each 16 GB, 10 nm engraved by Intel
- 16 counting circuits each consisting of 8 Xe-HPC cores, 5 nm engraved by TSMC
- 2 Xe-Link routing circuit, 7 nm engraved by TSMC
- 12 L2 cache circuit each 34 MB.