Hacker News

rbanffy

Telum II at Hot Chips 2024: Mainframe with a Unique Caching Strategy chipsandcheese.com

jfindley9 hours ago

It's a shame the article chose to compare solely against AMD CPUs, because AMD and Intel have very different L3 architectures. AMD CPUs have their cores oranised into groups, called a CCX, each of which have their own small L3 cache. For example the Turin-based 9755 has 16 CCXs each with 32MB of L3 cache. Far less cache per core than the mainframe CPU being described. In contrast to this, Intel uses an approach that's a little closer to the Telum II CPU being described - a Granite Rapids AP chip such as 6960P has 432 MB of L3 cache shared between 72 physical cores, each with its own 2MB L2 cache. This is still considerably less cache, but it's not quite as stark a difference as the picture painted by the article.

This doesn't really detract from the overall point - stacking a huge per-core L2 cache and using cross-chip reads to emulate L3 with clever saturation metrics and management is very different to what any x86 CPU I'm aware of has ever done, and I wouldn't be surprised if it works extremely well in practice. It's just that it'd have made a stronger article IMO if it had instead compared dedicated L2 + shared L2 (IBM) against dedicated L2 + shared L3 (intel), instead of dedicated L2 + sharded L3 (amd).

rayiner8 hours ago

Granite Rapids is also a better example because it's an enterprise processor with a huge monolithic die (almost 600 square mm).

A key distinction, however, is latency. I don't know about Granite Rapids, but sources show that Sapphire Rapids had an L3 latency around 33 ns: https://www.tomshardware.com/news/5th-gen-emerald-rapids-cpu.... According to the article, the L2 latency in the Tellum II chips is just 3.8 ns (about 21 clock cycles at 5.5 GHz). Sapphire Rapids has an L2 latency of about 16 clock cycles.

IBM's cache architecture enables a different trade-off in terms of balancing the L2 versus L3. In Intel's architecture, the shared L3 is inclusive, so it has to be at least as big as L2 (and preferably, a lot bigger). That weighs in favor of making L2 smaller, so most of your on-chip cache is actually L3. But L3 is always going to have higher latency. IBM's design improves single-thread performance by allowing most of the on-chip cache to be lower-latency L2.

[deleted]8 hours agocollapsed

elzbardico8 hours ago

It must be fun being a hardware engineer for IBM mainframes: Cost constraints for your designs can be mostly be left aside, as there's no competition, and your still existing customers have been domesticated to pay you top dollar every upgrade cycle, and frankly, they don't care.

Cycle times are long enough so you can thoroughly refine your design.

Marketing pressures are probably extremely well thought, as anyone working on Mainframe marketing is probably either an ex-engineer or almost an engineer by osmosis.

And the product is different enough from anything else, that you can try novel ideas, but not different enough that your design skills are useless elsewhere, and you can't leverage other's advancement idea.

bgnn7 hours ago

They fund a lot of R&D in house and let people try crazy new ideas. Too bad it's just for a niche product.

They have a famous networking (optical and wireline) group doing a lot of state-of-the-art research, and they deploy these in their mainframe products.

There's no other company like it. They are, in a sense, exact opposite of Apple where all HW engineers are pushed for impossible deadlines and solutions that save the day. Most in house developed IP is then not competitive enough and at the end doesn't end up in production (like their generations of 5G modems and the IPs in these like data converters etc).

erk__8 hours ago

There is also no other place that will just implement conversions between utf formats, compression in hardware or various hashed and other crypto in hardware like they do.

Philpax7 hours ago

I'm pretty sure that at least a few of those are implemented in modern x86 / ARM CPUs? As an immediate example: https://en.wikipedia.org/wiki/AES_instruction_set

oasisaimlessly5 hours ago

Another, introduced around the same time (SHA instruction set): https://en.wikipedia.org/wiki/SHA_instruction_set

sillywalk3 hours ago

Nitpick: the SPARC M7/M7 have hardware compression.

rbanffyop7 hours ago

> Cost constraints for your designs can be mostly be left aside, as there's no competition

I don’t think they neglect the migration of workloads to cloud platforms. Mainframes can only cost so much before it’s cheaper to migrate the workloads to other platforms and rewrite some SLAs. They did a ton of work on this generation to keep power envelope in the same ballpark of the previous generation, because that’s what their clients were concerned about.

jonathaneunice12 hours ago

Virtual L3 and L4 swinging gigabytes around to keep data at the hot end of the memory-storage hierarchy even post L2 or L3 eviction? Impressive! Exactly the kind of sophisticated optimizations you should build when you have billions of transistors at your disposal. Les Bélády's spirit smiles on.

zozbot2348 hours ago

Virtual L3 and L4 looks like a bad deal today since SRAM cell scaling has stalled quite badly in recent fabrication nodes. It's quite possible that future chip designs will want to use eDRAM at least for L4 cache if not perhaps also L3, and have smaller low-level caches where "sharing" will not be as useful.

dragontamer8 hours ago

Does it?

Recycling SRAM when it becomes more precious seems like a more optimal strategy rather than having the precious SRAM sit idle on otherwise sleeping cores.

adgjlsfhk12 hours ago

sram scaling appears to have recovered in the new nodes with gaafet and backside power delivery

exabrial12 hours ago

What languages are people still writing mainframe code in? In 2011 working for a prescription rx processor, COBOL was still the name of the game.

rbanffyop12 hours ago

There's also lots of Java as well, and IBM is making a big effort on porting existing Unix utilities to z/OS (which is a certified UNIX). With Linux, the choices are the same as with other hardware platforms. I assume you'll find lots of Java and Python running on LinuxONE machines.

Running Linux, from a user's perspective, it feels just like a normal server with a fast CPU and extremely fast IO.

jandrewrogers4 hours ago

How fast is “extremely fast”? Normal x86 Linux servers drive multiple 100s of GB/s of I/O these days. Storage is only slow because cloud.

rbanffyop4 hours ago

I never benchmarked it, but the latency feels very low. Mainframes don't have any local storage, so anything it's using will be a box on a separate rack (or spanning multiple racks).

jiggawatts11 hours ago

> extremely fast IO.

I wonder how big a competitive edge that will remain in an era where ordinary cloud VMs can do 10 GB/s to zone-redundant remote storage.

Cthulhu_11 hours ago

GB/s is one metric, but IOPS and latency are others that I'm assuming are Very Important for the applications that mainframes are being used for today.

imtringued6 hours ago

IOPS is the most meaningless metric there is. It's just a crappy way of saying bandwidth with an implied sector size. 99% of software developers do not use any form of async file IO and therefore couldn't care less. The async file IO support in postgres has been released a month ago. It's that niche of a thing that even extremely mature software that could heavily benefit from it hasn't bothered implementing it until last month.

jiggawatts6 hours ago

Microsoft SQL Server has been using async scatter/gather IO APIs for decades. Most database engines I've worked with do so.

Postgres is weirdly popular despite being way, way behind on foundational technology adoption.

rbanffyop4 hours ago

> Microsoft SQL Server has been using async scatter/gather IO APIs for decades. Most database engines I've worked with do so.

Windows NT has asynchronous IO since its VAX days ;-)

> Postgres is weirdly popular despite being way, way behind on foundational technology adoption.

It's good enough, free, and performs well.

jiggawatts41 minutes ago

I constantly hear about VACUUM problems and write amplification causing performance issues bad enough that huge users of it were forced to switch to MySQL instead.

[deleted]2 hours agocollapsed

FuriouslyAdrift10 hours ago

Latency is much more important than thoughput...

inkyoto11 hours ago

Guaranteed sustained write throughput is a distinguished feature of the mainframe storage.

Whilst cloud platforms are the new mainframe (so to speak), and they have all made great strides in improving the SLA guarantees, storage is still accessed over the network (plus extra moving parts – coordination, consistency etc). They will get there, though.

RetroTechie10 hours ago

On-site.

Speed is not the only reason why some org/business would have Big Iron in their closet.

bob102912 hours ago

You can do a lot of damage with some stored procedures. SQL/DB2 capabilities often go overlooked in favor of virtualizing a bunch of Java apps that accomplish effectively the same thing with 100x the resource use.

exabrial11 hours ago

Hah, anecdote incoming, but 100x times a resource usage is probably accurate. Given, 100x of a human hair is still just a minuscule grain of sand, but that’s the scale margins Mainframe operators work in.

As one grey beard said it to me: Java is loosely typed and dynamic compared to colbol/db2/pl-sql. He was particularly annoyed that the smallest numerical type a ‘byte’ in Java was quote: “A waste of bits” and that Java was full of “useless bounds checking” both of which were causing “performance regressions”.

The way mainframe programs are written is: the entire thing is statically typed.

PaulHoule10 hours ago

I knew mainframe programmers were writing a lot of assembly in the 1980s and they probably still are.

BugheadTorpeda65 hours ago

It's one of the last platforms that people still write a lot of hand written assembly for. Part of this is down to the assembler being very ergonomic and there being a very capable macro system available with macros provided for most system services. Part of it is due to the OS predating C becoming popular, so there are no headers files (just assembler macros) for many older system services (you can thunk them into C compatible calls, but it's sometimes more of a headache than just writing in assembler). Definitely C is becoming more popular lately on the platform, but you will still find a lot of people programming in assembler for a living in 2025. Its probably the only subfield of programming that uses handwritten assembly in that way outside of embedded systems.

rbanffyop7 hours ago

> the entire thing is statically typed.

Not always, but they do have excellent performance analysis and will do what they can to avoid slow code in the hottest paths.

thechao9 hours ago

When I was being taught assembly at Intel, one of the graybeards told me that the greatest waste of an integer was to use it for a "bare" add, when it was a perfectly acceptable 64-wide vector AND. To belabor the the point: he used ADD for the "unusual set of XORs, ANDs, and other funky operations it provided across lanes". Odd dude.

dragontamer8 hours ago

Reverse engineering the mindset....

In the 90s, a cryptography paper was published that more quickly brute forced DES (standard encryption algorithm back then) using SIMD across 64-bit registers on a DEC Alpha.

There is also the 80s Connection Machine which was a 1-bit SIMD x 4096-lane supercomputer.

---------------

It sounds like this guy read a few 80s or 90s papers and then got stuck in that unusual style of programming. But there were famous programs back then that worked off of 1-bit SIMD x 64 lanes or x4096 lanes.

By the 00s, computers have already moved on to new patterns (and this obscure methodology was never mainstream). Still, I can imagine that if a student read a specific set of papers in those decades... This kind of mindset would get stuck.

formerly_provenan hour ago

That's bit-slicing (not the hardware technique).

uticus9 hours ago

> ...it was a perfectly acceptable 64-wide vector AND.

sounds like "don't try to out-optimize the compiler."

thechao8 hours ago

In 2025, for sure. In 2009 ... maybe? Of course, he had become set in his ways in the 80s and 90s.

pjmlp11 hours ago

RPG, COBOL, PL/I, NEWP are the most used ones. Unisys also has their own Pascal dialect.

Other than that, there are Java, C, C++ implementations for mainframes, for a while IBM even had a JVM implementation for IBM i (AS/400), that would translate JVM bytecodes into IBM i ones.

Additionally all of them have POSIX environments, think WSL like but for mainframes, here anything that goes into AIX, or a few selected enterprise distros like Red-Hat and SuSE.

BugheadTorpeda68 hours ago

It sounds like you are referring to AS/400 and successors (common mistake, no biggie) rather than the mainframes being referred to here that are successors of System 360 and use the Telum chips (as far as I am aware they have never been based on POWER, like IBM i and AIX and the rest of the AS/400 heritage). RPG was never a big thing on mainframes for instance. I've never come across it in about 10 years of working on them professionally. Same with NEWP, I've never heard of it. And Java is pretty important on the platform these days and not an attempt from the past. It's been pushed pretty hard for at least 20 years and is kept well up to date with newer Java standards.

Additionally, the Unix on z/OS is not like WSL. There is no virtualization. The POSIX APIs are implemented as privileged system services with Program Calls (kind of like supervisor calls/system calls). It's more akin to a "flavor" kinda like old school Windows and OS/2 than the modern WSL. You can interface with the system in the old school MVS flavor and use those APIs, or use the POSIX APIs, and they are meant to work together (for instance, the TCPIP stack and web servers on the platform are implemented with the POSIX APIs, for obvious compatibility and porting reasons).

Of course, you can run Linux on mainframes and that is big too, but usually when people refer to mainframe Unix they are talking about how z/OS is technically a Unix, which I don't think it would count in the same way if it was just running a Unix environment under a virtualization layer. Windows can do that and it's not a Unix.

sillywalk5 hours ago

Nitpick:

Almost 20 years ago IBM had the eclipz project to share technology between its POWER (System i and System P) and Mainframe (System Z) servers. I'm not sure if counts as "based on", but

"the z10 processor was co-developed with and shares many design traits with the POWER6 processor, such as fabrication technology, logic design, execution unit, floating-point units, bus technology (GX bus) and pipeline design style."[0]

The chips were otherwise quite different, and obviously don't share the same ISA. I also don't know if IBM has kept this kind of sharing between POWER & Z.

[0] https://en.wikipedia.org/wiki/IBM_z10

rbanffyop3 hours ago

They must share some similarities - it's IBM and both POWER and Z have access to the same patents and R&D. Apart from that, they are very different chips, for very different markets.

Also, I'm sure there are many little POWER-like cores inside a z17 doing things like pushing data around. A major disappointment is that the hardware management elements are x86 machines, probably the only x86 machines IBM still sells.

BugheadTorpeda69 hours ago

For applications or middlewares and systems and utilities?

For applications, COBOL is king, closely followed by Java for stuff that needs web interfaces. For middleware and systems and utilities etc, assembly, C, C++, REXX, Shell, and probably there is still some PL/X going on too but I'm not sure. You'd have to ask somebody working on the products (like Db2) that famously used PL/X. I'm pretty sure a public compiler was never released for PL/X so only IBM and possibly Broadcom have access to use it.

COBOL is best thought of as a domain specific language. It's great at what it does, but the use cases are limited you would be crazy to write an OS in it.

fneddy7 hours ago

There is a thing called LinuxONE. That’s Linux for IBM mainframes. So basically you can run anything that runs on Linux (and can be compiled on s390x) on a mainframe.

fneddy7 hours ago

And if you are really interested. There is the LinuxONE community cloud by the marist college. If you do open source and want to add support for mainframe / s390x you can get a free tir from them.

specialist9 hours ago

Most impressive.

I would enjoy an ELI5 for the market differences between commodity chips and these mainframe grade CPUs. Stuff like design, process, and supply chain, anything of interest to a general (nerd) audience.

IBM sells 100s of Z mainframes per year, right? Each can have a bunch of CPUs, right? So Samsung is producing 1,000s of Telums per year? That seems incredible.

Given such low volumes, that's a lot more verification and validation, right?

Foundaries have to keep running to be viable, right? So does Samsung bang out all the Telums for a year in one burst, then switch to something else? Or do they keep producing a steady trickle?

Not that this info would change my daily work or life in any way. I'm just curious.

TIA.

detaro8 hours ago

It's something they'll run a batch for occasionally, but thats normal. Fabs are not like one long conveyor belt where a wafer goes in at the front, passes through a long sequence of machines and falls out finished at the end. They work in batches, and machines need reconfiguring for the next task all the time (if a chip needs 20 layers, they don't have every machine 20 times), so mixing different products and having products is normal. Low-volume products are going to be more expensive of course due to per-batch setup tasks being spread over fewer wafers.

In general scheduling of machine time and transportation of FOUPs (transport boxes for a number of wafers, the basic unit of processing) is a big topic in the industry, optimizing machine usage while keeping the number of unfinished wafers "in flight" and the overall cycle time low. It takes weeks for a wafer to flow through the fab.

bob10298 hours ago

It is non-trivial to swap between product designs in a fab. It can take many lots before you have statistical process controls dialed in to the point where yields begin to pick up. Prior performance is not indicative of future performance.

BugheadTorpeda65 hours ago

It's probably more on the order of a thousand or so mainframe systems delivered per year, based on average lifespan of one of these systems and the ~10,000 or so existing mainframe customers.

Then you have to consider that each CPC drawer for a mainframe has 4 sockets. And there are multiple CPC drawers per maxed out system (I believe up to 4, so 16 sockets per mainframe). And some larger customers will have many mainframes setup in clusters for things like parallel sysplex and disaster recovery.

So probably more on the order of ~10's of thousands of these chips getting sold per year.

So definitely not high volume in CPU manufacturing terms. But it's not miniscule.

iamwpj5 hours ago

It's probably 100k in one run and then they are stored for future use.

rbanffyop3 hours ago

And since they can sell the machine fully configured, sales of single CPUs or drawers must not be a big thing. They will keep stock for replacements, but I agree we won't see them doing multiple batches often. By now they must be well into the development of the z18 and z19 processors. This year Hot Chips I'd expect them showing the next POWER and, in 2026, the next Z, hitting GA in 2027 or 2028.

belter8 hours ago

The mainframe in 2025 is absolutely at the edge of technology. For some algorithms in ML where massive GPU parallelism is not a benefit, could even get a strong comeback.

I got so jealous of some colleagues, once even considered getting into mainframe work. CPU at 5.5 Ghz continuously, (not peak...) massive caches, really, really non stop...

Look at this tech porn: "IBM z17 Technical Introduction" - https://www.redbooks.ibm.com/redbooks/pdfs/sg248580.pdf

xattt6 minutes ago

Imagine a Beowulf cluster of those!

bell-cot9 hours ago

Interesting to compare this to ZFS's ARC / MFUvsMRU / Ghost / L2ARC / etc. strategy for (disk) caching. IIR, those were mostly IBM-developed technologies.

msbah9 hours ago

[flagged]

source