Introduction
At Hot Chips 2024, Qualcomm delivered an in-depth presentation on its latest CPU innovation, the Oryon core, marking a significant leap forward for the company.
As the first custom core from Qualcomm in years, Oryon is designed to power the next generation of Snapdragon processors, specifically the Snapdragon X Elite.
In this article, we’ll dive into the new details shared during the event, providing insights into Qualcomm’s architectural decisions and how they compare with industry standards.
Follow us on Twitter: https://x.com/TechoVedas
Branch Prediction and Frontend Advancements
Qualcomm’s Oryon core features a robust branch prediction mechanism, crucial for maintaining performance in high-throughput computing environments. The core utilizes the Tagged Geometric Length (TAGE) prediction method, which adapts dynamically to the history length required by each branch. This flexibility allows the predictor to select the most effective history length for accurate predictions.
The branch predictor in Oryon boasts an 80 KB storage budget, a substantial allocation considering the 96 KB L1 data cache. While this may seem excessive, it aligns with industry trends. For instance, AMD’s processors also allocate significant resources to branch prediction. Qualcomm’s approach ensures that the Oryon core can efficiently handle complex branching scenarios, a critical factor in maintaining high instruction throughput.
Semicon Da Nang 2024: Unlocking Vietnam’s Semiconductor Future on August 29 – techovedas
Pipeline Width and Decoder Design
Oryon’s 8-wide pipeline design is a standout feature, allowing the core to process eight instructions per cycle. Qualcomm’s decision to opt for an 8-wide core, as opposed to narrower designs, reflects a forward-looking strategy.
Gerard Williams, Qualcomm’s Senior Vice President of Engineering, likened this design choice to building a bridge to new opportunities. While the benefits of an 8-wide core may not be immediately apparent, they open the door to future performance enhancements as software and workloads evolve.
The 8-wide decoder ensures the pipeline receives a steady flow of instructions, preventing bottlenecks in the execution units. Qualcomm’s engineering team has focused on balancing the pipeline’s width with the core’s ability to sustain high performance over extended periods.
Load/Store Unit and Scheduling Innovations
Qualcomm has implemented a distributed scheduling model in the Oryon core, which contrasts with the unified schedulers used in some other architectures.
This distributed approach offers advantages in terms of selecting the oldest ready instruction, improving overall execution efficiency.
The load/store unit (LSU) in Oryon is particularly noteworthy. It features 64 entry reservation stations, providing ample capacity for scheduling load and store operations.
This design allows Oryon to handle four loads per cycle, surpassing AMD’s Zen 4, which handles three loads per cycle on the integer side and two on the vector side.
The LSU’s capacity to absorb additional operations, such as store data, further enhances its ability to maintain high throughput.
Microbenchmarking revealed that Oryon could manage 62 outstanding loads before encountering a dispatch stall at rename.
This observation suggests that Oryon’s scheduler entries are well-utilized, although there may be additional limiting factors in certain scenarios.
Data Caches and TLB Architecture
The Oryon core features a 96 KB L1 data cache, utilizing a multi-ported design that follows standard bitcell configurations. Qualcomm opted for this cache size to balance capacity with clock speed requirements. Increasing the cache size could introduce timing challenges and affect the core’s performance.
Oryon’s Translation Lookaside Buffers (TLBs) stand out as well. The 224-entry DTLB balances timing and storage capacity, focusing on minimizing address translation latency. Qualcomm’s L2 TLB is unusually large, aiming to reduce the likelihood of TLB thrashing, especially under aggressive prefetching conditions.
The L2 cache in Oryon is equally impressive, offering 12 MB of capacity with a latency of 15-20 cycles. Qualcomm’s design employs multiple slices within the L2 cache, similar to the L3 caches in Intel and AMD architectures. Each core in the Oryon cluster typically uses 2-3 MB of L2 cache, but a single core can access the entire 12 MB when needed.
Memory Bandwidth and Prefetching Strategies
Qualcomm’s Oryon core excels in memory bandwidth, achieving nearly 100 GB/s of DRAM bandwidth for a single thread. This performance is attributed to Qualcomm’s aggressive prefetching strategies, which ensure that data is ready before the core requests it. The prefetchers in Oryon analyze access patterns to generate data requests in advance, effectively hiding L2 latency in many cases.
In contrast to other architectures, where prefetching plays a supporting role, Oryon emphasizes prefetching as a key component of its overall strategy. This approach allows the core to maintain high performance even under demanding workloads.
System-Level Design and Scalability
The Snapdragon X Elite, powered by the Oryon core, features 12 cores organized into three quad-core clusters. Qualcomm initially limited each cluster to four cores due to L2 interconnect constraints, although this capability has since been expanded. The decision to include 12 cores in the Snapdragon X Elite reflects Qualcomm’s strategy of scaling performance by increasing core count, especially in devices with enhanced cooling capabilities.
This approach contrasts with Intel and AMD, which often use varying core counts to target different power levels. Qualcomm’s decision to use a consistent core count across power targets simplifies the design and allows for more predictable scaling.
Conclusion
Qualcomm’s presentation at Hot Chips 2024 provided valuable insights into the Oryon core’s architecture, showcasing the company’s commitment to innovation in the CPU space. With its advanced branch prediction, wide pipeline design, and efficient load/store unit, the Oryon core is poised to compete with the best in the industry.
As Qualcomm continues to refine and evolve the Oryon core, it will be fascinating to see how this architecture performs in real-world applications. The Snapdragon X Elite, with its Oryon-powered design, represents a significant milestone for Qualcomm and a promising glimpse into the future of high-performance computing.