HONG KONG, Aug. 02, 2022 (GLOBE NEWSWIRE) -- WIMI Hologram Academy, working in partnership with the Holographic Science Innovation Center, has written a new technical article describing their exploration of deep learning hardware technology based on in-memory computing structure. This article follows below:
The development of Extended Reality (XR, including AR and VR) technology has made the dream of realizing the integration and interaction between the real world and the virtual world come true. AR/VR is a simulated environment developed by computer software. It creates an immersive experience that appears to be real. The potential of VR/AR technology has been completely unleashed due to technological advances, especially in the last five years. Scientists from WIMI Hologram Academy of WIMI Hologram Cloud Inc.(NASDAQ: WIMI), discussed in detail the deep learning hardware technology from general-purpose to in-memory computing structure.
1.VR/AR ’ s demand for artificial intelligence
Nowadays, VR/AR has touched various industries and become an important technology for industrial development, and 3D content (including 3D model, 3D animation, and 3D interaction, etc.) is one of the core of VR/AR. However, at present, 3D content in various fields still needs a lot of manual production, and the threshold requirement for production personnel is relatively high, so the production capacity is very low, which is a major bottleneck restricting the development of related industries. Artificial Intelligence is expected to automate the production of 3D content to some extent, replacing some of the repetitive labor and improving production efficiency. The goals of both VR and AR include more natural interaction, which is one of the goals AI is trying to address. AlphaGo and AlphaZero demonstrate the intelligence of AI in certain areas that overlap with VR and AR, and are expected to compensate for the intelligence of VR and AR.
Deep Learning is a core subset of artificial intelligence. In recent years, DL is getting closer to the level of human skill in improving tasks such as image classification, understanding speech, playing video games, and translating between languages. Due to the large amount of training data and parameters required, modern Deep Neural Network requires high training costs, limiting the demand for DNN intelligent solutions for a large number of applications, such as VR/AR, etc. The increasing computing power requirements of DL have spawned the development of underlying hardware technologies.
In the following, we try to elaborate the reliance of deep learning on hardware, how deep learning works with different hardware support, and In-Memory Computing (IMC) for DL, pointing out the development direction of high-performance and low-power DL hardware.
2. Hardware dependence of deep learning
A DL model is like a huge self-organizing trial-and-error machine with millions (or even more) of changeable parameters. After feeding the machine with big data and performing tens or hundreds of millions of iterations of training cycles, the machine can find the best parameters and weights associated with the DL model. Currently, GPU (graphics processing unit) cards are the best hardware solution for DL due to their excellent parallel matrix multiplication capabilities and supported software. However, their flexibility (game support) makes them less efficient for DL, which is where other DL gas pedals ASICs (Application Specific Integrated Circuits) come in handy to provide better efficiency and performance. But both GPUs and ASICs are built on the traditional von Neumann architecture. The time and effort spent transferring data between memory and processor (the so-called von Neumann bottleneck) has become problematic, especially for data-centric applications such as real-time image recognition, natural language processing, and extended reality XR. To achieve larger acceleration factors and lower power outside of the vN architecture, non-volatile memory arrays based on IMCs such as phase-change memory and resistive random access memory have been explored. Vector matrix multiplication of IMCs replaces expensive high-power matrix multiplication operations in CPU/GPU and avoids moving weights from/to memory. Therefore, it has a great potential to have a huge impact on DL performance and power consumption.
3. Hardware for deep learning
The general DL algorithm consists of a series of operations (with neural networks for speech, language and visual processing). Although matrix multiplication dominates, optimizing performance efficiency while maintaining accuracy requires a core architecture that efficiently supports all auxiliary functions. The central processing unit (CPU) is used to handle complex tasks such as time slicing, complex control flow and branching, security, etc. In contrast, GPUs can only do one thing well. They handle billions of repetitive low-level tasks, such as matrix multiplication. GPUs have thousands of arithmetic logic units (ALUs) compared to traditional CPUs, which typically have only four or eight. However, the GPU is still a general-purpose processor that must support millions of different applications and software. For each of the thousands of ALUs, the GPU needs access to registers or shared memory to read and store the results of intermediate calculations. As the GPU performs more parallel computations on its thousands of ALUs, it also spends proportionally more energy accessing memory and also increases the footprint of the GPU for complex cabling. To solve these problems, ASICs for DL are needed, and TPUs are an example.
TPUs are matrix processors dedicated to neural network workloads, capable of processing large amounts of multiplication and addition of neural networks at extremely fast speeds while consuming less power and taking up less physical space. The key enabler is the dramatic reduction of vN bottlenecks (moving data from memory). By understanding the goals of the DNN, the TPU places thousands of multipliers and adders and connects them directly to form a large physical matrix of these operators. For the operation, first, the TPU loads the weights from memory into the matrix of multipliers and adders. Then, the TPU loads the data (characteristics) from memory. As each multiplication is executed, the result will be passed to the next multiplier as it is summed. Thus the output will be the sum of all multiplication results between data and parameters. No memory access is required at all for the entire process of massive computation and data transfer. The disadvantage of TPU is the loss of flexibility; it supports only a few specific neural networks.
4. In-memory computation for deep learning
The DNN inference and training algorithm mainly involves forward and inverse multiplication operations on vector matrices. This operation can be performed by in-memory computation on a 2D crossbar memory array that has been proposed more than 50 years ago. The weights (G) of a DNN are stored in a 1T (transistor)-1R (resistor) or 1T memory cell. By simultaneously applying a voltage input V on the rows and reading a current output I from the columns, the analog weight (G) summation is achieved by Kirchhoff's current law and Ohm's law. In an ideal crossbar memory array, the input-output relationship can be expressed as I=V-G. Vector-matrix multiplication is achieved by mapping the input vector to the input voltage V, the matrix to the conductance G, and the output to the current I. IMC vector-matrix multiplication replaces the expensive and high power matrix multiplication operation in GPUs/TPUs (digital circuits). avoids moving weights from memory, thus greatly improving the performance and power consumption of DNNs. Demonstrations of accelerated DNN training using the back-propagation algorithm report acceleration factors from 27x to 2140x (relative to CPU) with significant reductions in power consumption and area.
In addition, the PCM device acts as the DNN inference part of the synapse (weights), and the neurons of each layer drive the next layer by means of the weights WIJ and the nonlinear function f(). The input neurons are driven by pixels from consecutive MNIST (a very classical test data set in the field of machine learning) images, and the 10 output neurons identify which number appears. One limitation of IMC DNN acceleration is the deficiency of memory devices. Device features that are usually considered favorable for storage applications, such as high on/off ratios, digital per-bit storage, or unrelated features (e.g., asymmetric Set and Reset operations) are becoming limitations to accelerated DNN training. Perfect IMC DNN memory cells coupled with system and CMOS circuit designs that place specific requirements on perfect resistor devices can achieve acceleration factors of over 30,000x (relative to the CPU). There are significant benefits to developing or researching IMCs for DNNs, but there are currently no products on the market. Challenges that prevent it from happening include: 1)Defects in memory cells (cyclic endurance, small dynamic range, resistive drift, asymmetric programming). 2)Inter-layer data transfer (AD, DA conversion, digital function connection). 3)Flexible software, framework support (software re-configurable IMC DNN).
5.Summary