“Thanks to the rise of big data and the rapid increase in computing power, machine learning technology has achieved revolutionary development in recent years. In machine learning tasks such as image classification, speech recognition, and natural language processing, the data is Euclidean data with a certain size and dimension and an orderly arrangement. However, in more and more realistic scenarios, data is represented by complex non-Euclidean data such as graphs.

“

**1 Overview**

Thanks to the rise of big data and the rapid increase in computing power, machine learning technology has achieved revolutionary development in recent years. In machine learning tasks such as image classification, speech recognition, and natural language processing, the data is Euclidean data with a certain size and dimension and an orderly arrangement. However, in more and more realistic scenarios, data is represented by complex non-Euclidean data such as graphs. Graph not only contains data, but also the dependencies between data, such as social networks, protein molecular structure, e-commerce platform customer data, and so on. The increase in data complexity has brought severe challenges to traditional machine learning algorithm design and its implementation technology. In this context, many new Graph-based machine learning algorithms-GNN (Graph Neural Network), are constantly emerging in academia and industry.

GNN has very high requirements for computing power and memory, and the software implementation of its algorithm is very inefficient. Therefore, the industry has a very urgent need for GNN hardware acceleration. We know that the traditional CNN (Convolutional Neural Network) hardware acceleration scheme has many solutions; however, the hardware acceleration of GNN has not been fully discussed and researched. At the time of writing this article, Google and Baidu could not search for it. Chinese research on GNN hardware acceleration. The motivation for writing this article is to combine the latest foreign GNN algorithms, acceleration technology research, and the author’s discussion of GNN’s FPGA acceleration technology, and present it to readers in the form of a panorama.

**2. Introduction to GNN**

The architecture of GNN has many similarities to traditional CNN at the macro level, such as convolutional layer, Polling, activation function, machine learning processor (MLP) and FC layer, etc., will be applied in GNN. The figure below shows a relatively simple GNN architecture.

Figure 1: Typical GNN architecture

However, the Graph data convolution calculation in GNN is different from the 2D convolution calculation in traditional CNN. Taking Figure 2 as an example, the process of convolution calculation for the red target node is as follows:

Graph convolution: Use neighbor function to sample the features of surrounding nodes and calculate the mean value. The number of neighbor nodes is uncertain and disordered (non-Euclidean data).

2D convolution: use the convolution kernel to sample the features of the surrounding nodes and calculate the weighted average. The number of neighbor nodes is determined and ordered (Euclidean data).

Figure 2: Graph convolution and 2D convolution

**3. Introduction to GraphSAGE algorithm**

Academia has conducted a lot of research and discussion on the GNN algorithm, and has proposed a considerable number of innovative implementation methods. Among them, GraphSAGE proposed by Stanford University in 2017 is an inductive representation learning algorithm for predicting the dynamic addition of unknown node types in large graphs. It is especially optimized for graphs with a huge number of nodes and rich node features. As shown in the figure below, the GraphSAGE calculation process can be divided into three main steps:

Figure 3: Visual representation of the GraphSAGE algorithm

Adjacent node sampling: used to reduce complexity, generally 2 layers are sampled, and several nodes are sampled in each layer

Aggregation: used to generate the embedding of the target node, that is, the low-dimensional vector representation of the graph

Prediction: Use embedding as the input of the fully connected layer to predict the label of the target node d

In order to realize GraphSAGE algorithm acceleration in FPGA, we need to know its mathematical model in order to map the algorithm to different logic modules. The code shown in the figure below illustrates the mathematical process of this algorithm.

Figure 4: Mathematical model of GraphSAGE algorithm (Source: http://snap.stanford.edu/graphsage)

For each target node x to be processed_{v}_{,}GraphSAGE does the following:

1) Sampling the nodes in the subgraph through the neighbor sampling function N(v)

2) Aggregate the features of the neighboring nodes to be sampled, the aggregation function can be mean(), lstm() or polling(), etc.

3) Combine the aggregation result with the output representation of the previous iteration, and use W^{k}Do convolution

4) Non-linear processing of convolution results

5) Iterate several times to end the processing of all neighboring nodes of the current k-th layer

6) Normalize the iterative result of the k-th layer

7) Iterate several times to end the processing of all K-layer sampling depths

8) Final iteration result z_{v}Is the input node x_{v}Embedding

**4. GNN accelerator design challenge**

The GNN algorithm involves a large number of matrix calculations and memory access operations. It is very inefficient to run this algorithm on a traditional x86 architecture server, which is manifested in slow speed and high energy consumption.

The application of new GPUs can bring significant benefits to the computing speed and energy efficiency ratio of GNN. However, the shortcomings of GPU memory scalability make it incapable of processing massive nodes of Graph; GPU’s instruction execution method also causes the calculation delay to be too large and uncertain, and it is not suitable for scenarios that require real-time calculation of Graph.

The existence of various design challenges as mentioned above makes the industry urgently need a GNN acceleration solution that can support highly concurrent real-time computing, huge memory capacity and bandwidth, and expandable in the data center.

**5. FPGA design scheme of GNN accelerator**

The Speedster7t series of high-performance FPGAs launched by Achronix are optimized for data center and machine learning workloads, eliminating several performance bottlenecks in CPUs, GPUs, and traditional FPGAs. Speedster7t FPGA is based on TSMC’s 7nm FinFET process, and its architecture uses a revolutionary new 2D network-on-chip (NoC), an original machine learning processor matrix (MLP), and utilizes a high-bandwidth GDDR6 controller, 400G Ethernet and PCI Express Gen5 interfaces , While guaranteeing ASIC-level performance, it also provides users with flexible hardware programmability. The following figure shows the architecture of the Speedster7t1500 high-performance FPGA.

Figure 5: Achronix Speedster7t1500 high-performance FPGA architecture

The above-mentioned various characteristics make the Achronix Speedster7t1500 FPGA device provide a perfect solution for the various challenges faced in the GNN accelerator design.

Table 1: GNN design challenges and Achronix’s Speedster7t1500 FPGA solution

**5.1 GNN accelerator top-level architecture**

This GNN accelerator is designed for GraphSAGE, but its architecture has a certain versatility and can be applied to other similar GNN algorithm acceleration. Its top-level architecture is shown in the figure below.

Figure 6: Top-level architecture of GNN accelerator

The GNN Core in the figure is the core part of the algorithm implementation, and its design details will be discussed below; RoCE-Lite is a lightweight version of the RDMA protocol, used for remote memory access through high-speed Ethernet to support Graph computing for massive nodes , Its design details will be discussed in a follow-up article of this official account; 400GE Ethernet controller is used to carry the RoCE-Lite protocol; GDDR6 is used to store the high-speed access data required in the GNN processing process; DDR4 is used as a spare high-capacity memory, which can Used to store relatively low frequency data, such as Graph to be pre-processed; PCIe Gen5x16 provides a high-speed host interface for data exchange with server software; all of the above-mentioned modules are interconnected at a high speed through the NoC on-chip network.

**5.2 GNN Core micro-architecture**

Before starting to discuss the GNN Core microarchitecture, let’s review the GraphSAGE algorithm in Section 3 of this article. The aggregation and merging (including convolution) of the inner loop account for most of the calculation and memory access of the algorithm. . Through research, we get the characteristics of these two steps as follows:

Table 2: Comparison of aggregation and merging operations in GNN algorithm

It can be seen that the aggregation operation and the merge operation have completely different requirements for calculation and memory access. The aggregation operation involves sampling of neighboring nodes, but Graph is a non-Euclidean data type, its size and dimension are uncertain and disordered, the matrix is sparse, and the node location is random, so the memory access is irregular and it is difficult to reuse data; in the merge operation Among them, the input data is the aggregation result (low-dimensional representation of the node) and the weight matrix. Its size and dimension are fixed, and the storage location rules are linear. There is no challenge to memory access, but the calculation amount of the matrix is very large.

Based on the above analysis, we decided to use two different hardware structures to handle aggregation operations and merging operations in the GNN Core accelerator design. The functional block diagram is shown in the following figure:

Figure 7: GNN Core functional block diagram

Aggregator: Through the SIMD (Single Instruction Multiple Data Processor) array, graph neighbor nodes are sampled and aggregated. The “single instruction” can be pre-defined as mean() mean value calculation, or other applicable aggregation functions; “multiple data” means that a single mean() mean value calculation requires the feature data of multiple neighbor nodes as input, and these The data comes from the Subgraph Sampler; the SIMD array is load balanced through the Agg Scheduler; the subgraph sampler reads back the adjacency matrix and node characteristic data from GDDR6 or DDR4 through NoC.^{0}_{v}, Respectively cached in Adjacent List Buffer and Node Feature Buffer; the result of aggregation h^{k}_{N(v)}Stored in Agg Buffer.

Combinator: Perform the convolution operation of the aggregation result through the pulsation matrix PE; the convolution kernel is W^{k}Weight matrix; the convolution result is non-linearly processed by the ReLU activation function, and it is also stored in the Partial Sum Buffer to facilitate the next iteration.

After the merged result is normalized by L2BN, it is the final node characterization h^{k}_{v}.

In a typical node classification prediction application, the node represents h^{k}_{v}A fully connected layer (FC) can be used to obtain the classification label of the node. This process is one of the traditional machine learning processing methods, which is not reflected in the GraphSAGE paper, and this feature is not included in this design.

**6 Conclusion**

This article discusses in depth the mathematical principles of the GraphSAGE GNN algorithm, and analyzes the technical challenges in the GNN accelerator design from multiple dimensions. By decomposing the problems and solving them one by one at the architecture level, the author comprehensively uses the competitive advantages provided by Achronix Speedster7t1500 FPGA to create a highly scalable GNN acceleration solution with excellent performance.

The Links: **FP15R12KE3** **LMCBS078C**