Hot Chips 2024: Qualcomm Unveils Advanced Oryon Core Architecture

As the first custom core from Qualcomm in years, Oryon is designed to power the next generation of Snapdragon processors, specifically the Snapdragon X Elite.

Introduction

At Hot Chips 2024, Qualcomm delivered an in-depth presentation on its latest CPU innovation, the Oryon core, marking a significant leap forward for the company.

As the first custom core from Qualcomm in years, Oryon is designed to power the next generation of Snapdragon processors, specifically the Snapdragon X Elite.

In this article, we’ll dive into the new details shared during the event, providing insights into Qualcomm’s architectural decisions and how they compare with industry standards.

Follow us on Twitter: https://x.com/TechoVedas

Branch Prediction and Frontend Advancements

Qualcomm’s Oryon core features a robust branch prediction mechanism, crucial for maintaining performance in high-throughput computing environments. The core utilizes the Tagged Geometric Length (TAGE) prediction method, which adapts dynamically to the history length required by each branch. This flexibility allows the predictor to select the most effective history length for accurate predictions.

The branch predictor in Oryon boasts an 80 KB storage budget, a substantial allocation considering the 96 KB L1 data cache. While this may seem excessive, it aligns with industry trends. For instance, AMD’s processors also allocate significant resources to branch prediction. Qualcomm’s approach ensures that the Oryon core can efficiently handle complex branching scenarios, a critical factor in maintaining high instruction throughput.

Semicon Da Nang 2024: Unlocking Vietnam’s Semiconductor Future on August 29 – techovedas

Pipeline Width and Decoder Design

Oryon’s 8-wide pipeline design is a standout feature, allowing the core to process eight instructions per cycle. Qualcomm’s decision to opt for an 8-wide core, as opposed to narrower designs, reflects a forward-looking strategy.

Gerard Williams, Qualcomm’s Senior Vice President of Engineering, likened this design choice to building a bridge to new opportunities. While the benefits of an 8-wide core may not be immediately apparent, they open the door to future performance enhancements as software and workloads evolve.

The 8-wide decoder ensures the pipeline receives a steady flow of instructions, preventing bottlenecks in the execution units. Qualcomm’s engineering team has focused on balancing the pipeline’s width with the core’s ability to sustain high performance over extended periods.

NVIDIA Graduate Fellowship Program 2025–26: A Premier Opportunity for PhD Students | by techovedas | Aug, 2024 | Medium

Load/Store Unit and Scheduling Innovations

Qualcomm has implemented a distributed scheduling model in the Oryon core, which contrasts with the unified schedulers used in some other architectures.

This distributed approach offers advantages in terms of selecting the oldest ready instruction, improving overall execution efficiency.

The load/store unit (LSU) in Oryon is particularly noteworthy. It features 64 entry reservation stations, providing ample capacity for scheduling load and store operations.

This design allows Oryon to handle four loads per cycle, surpassing AMD’s Zen 4, which handles three loads per cycle on the integer side and two on the vector side.

The LSU’s capacity to absorb additional operations, such as store data, further enhances its ability to maintain high throughput.

Microbenchmarking revealed that Oryon could manage 62 outstanding loads before encountering a dispatch stall at rename.

This observation suggests that Oryon’s scheduler entries are well-utilized, although there may be additional limiting factors in certain scenarios.

First American Sodium-Ion Battery Factory to Deliver 50,000-Cycle Cells in Just 10 Minutes – techovedas

Data Caches and TLB Architecture

The Oryon core features a 96 KB L1 data cache, utilizing a multi-ported design that follows standard bitcell configurations. Qualcomm opted for this cache size to balance capacity with clock speed requirements. Increasing the cache size could introduce timing challenges and affect the core’s performance.

Oryon’s Translation Lookaside Buffers (TLBs) stand out as well. The 224-entry DTLB balances timing and storage capacity, focusing on minimizing address translation latency. Qualcomm’s L2 TLB is unusually large, aiming to reduce the likelihood of TLB thrashing, especially under aggressive prefetching conditions.

The L2 cache in Oryon is equally impressive, offering 12 MB of capacity with a latency of 15-20 cycles. Qualcomm’s design employs multiple slices within the L2 cache, similar to the L3 caches in Intel and AMD architectures. Each core in the Oryon cluster typically uses 2-3 MB of L2 cache, but a single core can access the entire 12 MB when needed.

Memory Bandwidth and Prefetching Strategies

Qualcomm’s Oryon core excels in memory bandwidth, achieving nearly 100 GB/s of DRAM bandwidth for a single thread. This performance is attributed to Qualcomm’s aggressive prefetching strategies, which ensure that data is ready before the core requests it. The prefetchers in Oryon analyze access patterns to generate data requests in advance, effectively hiding L2 latency in many cases.

In contrast to other architectures, where prefetching plays a supporting role, Oryon emphasizes prefetching as a key component of its overall strategy. This approach allows the core to maintain high performance even under demanding workloads.

Save the Date! SEMICON India, Electronica & Productronica to Kick Off in Noida from 11-13 Sep. – techovedas

System-Level Design and Scalability

The Snapdragon X Elite, powered by the Oryon core, features 12 cores organized into three quad-core clusters. Qualcomm initially limited each cluster to four cores due to L2 interconnect constraints, although this capability has since been expanded. The decision to include 12 cores in the Snapdragon X Elite reflects Qualcomm’s strategy of scaling performance by increasing core count, especially in devices with enhanced cooling capabilities.

This approach contrasts with Intel and AMD, which often use varying core counts to target different power levels. Qualcomm’s decision to use a consistent core count across power targets simplifies the design and allows for more predictable scaling.

Why IBM’s New Quantum Playbook is A Must-Read For Everyone Preparing for the Quantum Age – techovedas

Conclusion

Qualcomm’s presentation at Hot Chips 2024 provided valuable insights into the Oryon core’s architecture, showcasing the company’s commitment to innovation in the CPU space. With its advanced branch prediction, wide pipeline design, and efficient load/store unit, the Oryon core is poised to compete with the best in the industry.

As Qualcomm continues to refine and evolve the Oryon core, it will be fascinating to see how this architecture performs in real-world applications. The Snapdragon X Elite, with its Oryon-powered design, represents a significant milestone for Qualcomm and a promising glimpse into the future of high-performance computing.

Kumar Priyadarshi
Kumar Priyadarshi

Kumar Priyadarshi is a prominent figure in the world of technology and semiconductors. With a deep passion for innovation and a keen understanding of the intricacies of the semiconductor industry, Kumar has established himself as a thought leader and expert in the field. He is the founder of Techovedas, India’s first semiconductor and AI tech media company, where he shares insights, analysis, and trends related to the semiconductor and AI industries.

Kumar Joined IISER Pune after qualifying IIT-JEE in 2012. In his 5th year, he travelled to Singapore for his master’s thesis which yielded a Research Paper in ACS Nano. Kumar Joined Global Foundries as a process Engineer in Singapore working at 40 nm Process node. He couldn’t find joy working in the fab and moved to India. Working as a scientist at IIT Bombay as Senior Scientist, Kumar Led the team which built India’s 1st Memory Chip with Semiconductor Lab (SCL)

Articles: 2141