The start-up company Tachyum has raised $ 25M in a Series-A funding round for a new processor design it calls the Prodigy Universal Processor. Prodigy is supposedly faster in single-threaded code than Xeon, with smaller CPU cores than ARM. It can be used to simulate human brain-sized neural networks in real time. It outperforms CPUs, GPUs, and Google’s TPU. It can run 64 cores at an all-core frequency of 4GHz, fits into just 290mm2 of die space (half the size of AMD’s 7nm Epyc design on the same node), supports eight channels of DDR5, 72 PCIe 5.0 lanes, 2x 400G Ethernet connections, and has support for HBM3.
To say that Tachyum hasn’t proved these claims would be an understatement. Claiming to be able to beat Intel or AMD in single-threaded performance or ARM on die size and power efficiency would be eyebrow-raising in the best of circumstances. Claiming to do both simultaneously with a chip you haven’t actually even built yet requires better evidence than we’ve yet seen to take the argument seriously. The company is claiming it’ll eventually field a CPU with 128 cores at 4GHz in a single socket with 12x DDR5 controllers.
Claiming to have solved the issue of ‘slow wires’ (presumably this is a reference to RC delay) with very short wires doesn’t actually answer anything at all. Specifically, it doesn’t explain anything about how the Prodigy manages to use these very short wires in the critical path, why it’s able to deploy them when other competing CPU designs can’t, or what Tachyum has traded in exchange for short wire lengths. An all-core frequency of 4GHz in a 180W TDP raises questions about exactly how much work these chips can accomplish per clock cycle, especially given that they appear to borrow some pages from Itanium’s approach to improving hardware performance — namely, the idea that complex out-of-order execution can be shoved into the compiler and left to rot efficient optimization.
The company gave a presentation at Hot Chips last year that’s now public; we’ve reproduced some of its slides in the slideshow below.
Tachyum’s PR copy claims that Prodigy reduces data center TCO by 4x “through a disruptive hardware architecture and a smart compiler that has made many parts of the hardware found in a typical processor redundant. Fewer wires and shorter wires, due to a smaller, simpler core, translates into much greater speed and power efficiency for the processor.”
According to the Q&A session after Hot Chips, these CPUs lose 40 percent of performance when running native x86 code, which seems like a major problem for the whole “Faster than Xeon” argument. The company claims that “Binary 4.0 GHz emulated still outperforms 2.5 GHz Xeon,” which would be more of a problem for Intel (or AMD) if a 2.5GHz Xeon represented some kind of objectively difficult performance threshold. Phrases like “Out of execution in software” is a fancy way of saying: “We shoved all the work of achieving high performance into the compiler, and we’re really hoping our compiler can extract enough performance to make this work.” Intel tried exactly this strategy with Itanium. It didn’t work.
With that said, there’s a lot about Prodigy’s architecture that’s unclear right now. There are arguments in various forums about the degree to which it resembles or doesn’t resemble Itanium or whether its architecture should be more properly understood as VLIW, modified VLIW, EDGE, or something else.
Tachyum’s Prodigy, based on what we’ve seen to date, is very long on sizzle. It’s supposedly the best parallel processor and the best serial processor, despite the fact that CPUs and GPUs run very different types of code. It can match or exceed Intel’s top-end chips, yet runs within power envelopes and die sizes better than anything ARM or AMD can field.
Extraordinary claims require extraordinary evidence. We don’t have much of that just yet.