How in-memory Computing Could be a Game Changer in AI

Introduction

Modern computing, inspired by Von Neumann’s architecture, faces a challenge: the constant back-and-forth movement of data between memory and processing units.

As AI demands surge, energy-intensive tasks exacerbate this bottleneck. While cache memory and High Bandwidth Memory (HBM) mitigate this issue, they have limitations. But what if we perform processing of data at the same place where it is stored?

This blog explores the nuances of in-memory computing and its potential as a low power computer architecture.

How do current computers work?

When John von Neumann designed of the modern computer architecture, he had in mind the functioning of the human brain. Moreover,the brain stores all data in the subconscious memory and when we want to process/analyse that data, it brings the data from the subconscious to the conscious mind and performs logical operations.

Say you have an algebra test in which you need to perform addition of two numbers, you recall the concepts of addition of two numbers from memory and then you apply those concepts to the numbers given in the question and perform logical operations to get the result.
Modern computers work exactly in the same way. HOwever,Data is fetched from the memory (RAM) to the CPU and after processing it is written back in the memory.

The issue with current computing architecture

The Von Neumann architecture maintains a rigid division between memory and processing units, resulting in a bottleneck when handling extensive data due to the continuous back-and-forth movement of information.

Although GPUs exist which employ parallel processing for heavy workloads but they draw large amount of power.Additionally, with the advancement of AI there is a dire need in the market to produce processors which can be used to train, test and process heavy machine learning algorithms like LLM, Autonomous driving software (Link to Telsa’s Dojo supercomputer), Computer Vision, NLP etc.

Energy Consumption in transferring data

As AI continues to evolve, so does its energy consumption. Training AI models requires a lot of parameters.

These parameters are stored in memory and they need to be transferred to compute units for processing. Shuttling such huge data requires a lot of energy and bandwidth. Moreover,the traditional architecture of computers, with separate memory and CPU components doesn’t help.

Current solutions for memory bottleneck

1.Cache memory

Cache memory was invented to address the speed gap between the CPU and main memory (RAM). Cache memory is a small but ultra-fast type of computer memory located between the CPU and main memory (RAM).

It stores frequently accessed data and instructions, reducing the time needed for the CPU to access them. When the CPU requests data, it first checks the cache memory. If the data is found in the cache (cache hit), it’s retrieved quickly. If not (cache miss), the data is fetched from slower main memory.

However cache is limited in size (ranges between 8MB – 32MB). This is because it sits in the same SoC as the CPU. AMD recently achieved a 64MB cache size by stacking memory units on top of each other. These are great advancements but we must realize that stacking also has limits.

2.HBMs

High Bandwidth Memory (HBM) is a specialized type of RAM designed for blazing-fast data transfer and low latency. It plays a crucial role in high-performance computing applications like supercomputers, AI accelerators, and high-end graphics cards.

Unlike traditional RAM modules with flat chips, HBM uses a 3D stacking technology. By vertically stacking memory chips, HBM significantly reduces the physical distance data needs to travel between memory and processor.

This translates to faster transfer speeds as data doesn’t have to traverse long, winding paths on a single flat chip.

An analogy to Understand in-memory computing

In-memory computing is a computing approach where data is stored and processed primarily in the system’s main memory (RAM), rather than being fetched from secondary storage like hard drives or SSDs. This method allows for faster data access and manipulation since accessing data from memory is much quicker than fetching it from disk.

An analogy for in-memory computing could be a librarian organizing books in a library. Imagine the librarian has a small desk where they can place a few books at a time for quick reference. This desk represents the main memory. When someone asks for a specific book, the librarian can easily retrieve it from the desk since it’s right there, without having to go to the shelves (secondary storage) to find it. This is akin to accessing data stored in memory.

On the other hand, if the librarian had to fetch each book from the shelves whenever someone requested it, it would take much longer to fulfill the request, similar to accessing data from disk.

So, in-memory computing is like having a handy desk (main memory) where frequently accessed data is kept for rapid access, resulting in faster data processing and analysis compared to fetching it from slower secondary storage sources.

Crossbar architectures for in-memory computing

The solution explained above is still Von-Neumann in nature as (although at high speed) data needs to shuttle between CPU and memory.

This has energy dissipation. The solution that the world is now looking towards is processing data at the same place where it is stored i.e. in memory. This is achieved through crossbar architectures of resistive memory elements.

In-memory computing utilizing crossbar arrays of Resistive Random-Access Memories (RRAMs) is a cutting-edge approach that integrates storage and computation within the memory array itself.

This technology relies on the unique properties of RRAMs, which can store data in an analog manner by varying their resistance levels. These devices are arranged in a crossbar array configuration, forming a grid-like structure where rows and columns intersect to create individual memory cells.
By exploiting this property, analog computations can be performed directly within the memory array through matrix-vector multiplication.

Each row of the crossbar array represents a weight vector, while each column corresponds to input values (Voltages). By applying appropriate voltages, computations are executed in parallel across the array.

Future of in-memory computing – 3D crossbars

A crossbar, in computing, typically refers to a two-dimensional arrangement resistive elements forming a grid-like pattern where horizontal and vertical lines intersect. On the other hand, a 3D crossbar extends this concept into three dimensions by adding additional layers stacked on top of each other.

In the context of memory architectures and in-memory computing, a 3D crossbar consists of a three-dimensional array of memory cells or processing elements interconnected in a grid-like fashion. This architecture offers greater density and packing of memory cells or processing elements, enabling higher levels of parallelism, speed, energy efficiency.

Conclusion

In-memory computing with 3D crossbars heralds a new era in computing. By performing computations directly within memory arrays, it eliminates data movement bottlenecks, offering unparalleled speed and energy efficiency.