Developing Heterogeneous Cache Coherent SoCs – and More!


Editor’s Note:

ArterisIP provides system-on-chip (SoC) interconnect IP to accelerate SoC semiconductor assembly for a wide range of applications from automobiles to mobile phones, IoT, cameras, SSD controllers, and servers for customers such as Samsung, Huawei / HiSilicon, Mobileye (Intel), Altera (Intel), and Texas Instruments. The company is located in Campbell, CA.

Describe in detail the various design challenges faced today in developing Heterogeneous Cache Coherent SoCs.

The first thing you need to do is understand customer requirements. This includes asking the right questions, some of which may include:

  • Understand how many IPs need to connect to the heterogeneous system;
  • What kind of bandwidth does the IP require;
  • What kind of IP and what kind of features can you enable with interconnect IP.

The next step is to define heterogeneity because many people are using the heterogeneous word, but there are different meanings behind the word. Some key tasks and guidelines:

  • You must have different types of processors within the same family;
  • Then you have to accommodate different types of processors that are available on the market.
  • Different processor types also have a different cache structures.
    o An ARM CPU would use the same cache structure as another ARM core all over the processor.
  • A different CPU poses a different cache structure.
  • Accommodate different types of IPs as well:
    o CPU, GPU, and DSPs:
    o Then there are all other types the IPs that you combine into an SoC like connectivity IP, USB, SATA, etc.

It’s also important to be able to accommodate different (cache) protocol systems in terms of coherent and non-coherent protocol. Some examples:

  • Flexible snoop filter capability accommodates different cache structures of different kinds of processors.
    o Snoop filter capabilities operate in two different directions to accommodate any cache structure of any processor that is available today.
    o Another challenge: Reduce the number of memory bits that you need to perform snoop filtering.

How do you integrate IP that is not-cache coherent and achieve better performance? Provide a brief example or two?

You need to understand what the customer requirements are in terms of the mix of non-coherency and coherency requirements. Are they separated, a full merger of both domains or a customized mix? Arteris, for instance, developed a component called a non-coherent bridge. Its purpose is to drive non-coherent accesses back into the coherent domain. It also enables a differentiator between the non-coherent and coherent domains.

How to you create a cache-coherent system that is easily placed on a chip?

A few years ago, coherency systems were small and compact – a max of three to four different processors. Coherency was confined to CPU clusters, functionality was grouped under an application and all subsystems were connected to an application.

But coherency wasn’t necessarily distributed beyond a subsystem. Customer needs are changing, there is a need for greater processor performance and companies are adding more and different types of processors. In addition:

  • SoC layouts are expanding tremendously;
  • Size of processors growing larger;
  • Complex layouts affect coherency domain;
  • Coherent domain is expanding all over the chip.

So how do you handle it?

First, you must make sure the infrastructure is designed to distribute coherency system-wide. It has to be an interconnect technology that enables network packet transport and it also must accommodate a variety of topologies such as ring and mesh. The infrastructure must also be configurable and flexible because as design complexity continues to grow, designers must be able to understand which topologies are best suited for a particular chip layout. Having the proper tools that can predict where complexities might cause performance and power issues in the chip layout stage is critical to revising the layout and providing the best solution in terms of which topology might resolve these issues.

How can you optimize power consumption of complex systems?

You first need to provide power-ready IP; once this accomplished, then you need to implement some well-known techniques – these may include voltage domain, power domain, clock gating and high-level clock gating.

If power-ready it will also have connectivity to a power interface and can be controlled by an MPU in the system that will decide when to shut down the IP when not in use or not needed by the system. At the application level, this power-aware controller (MPU) can lower system power consumption by putting an IP on idle.

How long will it take to reasonably surmount some/all of the aforementioned issues?

Heterogeneous SoCs are still in development and haven’t yet matured. But processors in coherent domain now sharing data with each other. Other CPUs and GPUs have become cache coherent although I’m confident we can do a lot more.

With data sharing, this is not only between processor and GPU, but between all of the IPs of the system – it’s a concept that is in progress. This IP must be pushed a little bit farther to achieve total coherency. Today there are still not too many non-coherent IPs sharing data with coherent IPs. But we’re now starting to see applications now emerging that need coherency and this will bring new requirements.

Are these design challenges currently hindering product development in select verticals? If so, which ones?

Yes, one that comes to mind is ADAS (Advanced Driver-Assistance Systems for automotive. Automotive applications will have a lot of requirements because of the need to add performance and share data with heterogeneous processors to achieve those requirements. We’ll see the introduction of new features to this market. Other markets will include artificial intelligence and machine learning.

A decade ago, mobile application processors were driving the need to cache coherency and then data center systems started becoming the primary driver. Now the automotive market is driving the need to extend cache coherency to all of the heterogeneous processing elements in SoCs. In two or three years, a new trend will emerge to extend heterogeneous cache coherency even further – but designers will need flexibility, configurability and scalability to ensure that these systems are high-performance, low-in-latency and reasonable in terms of power consumption and cost.