AI is a dynamic field with business demand for a variety of hardware, including CPUs, GPUs, and dedicated accelerators. Intel is investing in all three areas. This blog focuses on CPU AI. Businesses are choosing CPUs for AI because CPUs run enterprise applications as well as AI workloads. AI is really an end-to-end workload that starts with analytics, with training and inference just a portion of the workflow. Often, CPUs can deliver an optimal combination of performance and total cost of ownership (TCO) for mixed workloads that include analytics and AI.
Intel submitted MLPerf results on new 3rd Gen Intel® Xeon® Scalable processors, formerly codenamed “Ice Lake.” These processors are the only x86 data center CPUs with built-in AI acceleration, support for end-to-end data science tools, and a broad ecosystem of innovative AI solutions. With this processor family, Intel is making it simpler and more efficient to perform the entire analytics workflow, including the training and inference portion.
We continue to submit a wide range of MLPerf results across data types, frameworks, and usage models, including image processing, natural language processing (NLP), and recommendation systems. We do this because we know how important it is for customers to understand performance expectations from their Intel Xeon Scalable processors, to decide whether these CPUs deliver the performance and TCO for their unique needs. Not only does the new 3rd Gen Xeon processors deliver more compute and memory capacity/bandwidth than the previous generation, the processors also provide a big jump in per-socket performance – for example, up to 46 percent more compared to the previous generation on ResNet50-v1.5 in MLPerf Inference v0.7. In addition, we continue to optimize deep learning software. We have seen up to a 2.7X (v0.7:  v1.0: ) performance improvement compared to the last round of MLPerf submissions, such as the Deep Learning Recommendation Model (DLRM).
This blog highlights the software engineering behind the scenes for addressing various optimization opportunities. We focus on non-vision use cases, such as NLP and recommendation engines. The optimization techniques we describe benefit the model classes in general, beyond the specific models listed. You can find the implementation details in our latest MLPerf inference v1.0 submissions.
Software Optimization Highlights in MLPerf v1.0 submissions
We use Intel® Deep Learning Boost (Intel® DL Boost) technology, including Vector Neural Network Instructions (VNNI), in our INT8-based submissions on ResNet50-v1.5, SSD-ResNet34, 3D UNET, and BERT. We use bfloat16 (Brain Floating Point) in our Recurrent Neural Network Transducer (RNN-T) and DLRM submissions. The Intel® Low Precision Optimization Tool (Intel® LPOT) now supports low-precision inference on Intel® processors.
Apart from low precision inference, we used four types of optimization techniques for inference in the non-vision area: 1. Reducing compute by introducing sparsity; 2. Reducing memory accesses with ops fusion; 3. Optimizing network graph by reducing primitive creations; and 4. Improving hardware utilization by loading balancing with the input sizes and introducing more parallelism. This is not an exhaustive list, since software optimization for inference is a broad area with many exciting research and innovations in industry and academia.
1. Sparsity demonstrated on DLRM in Open Division
Sparsity is a promising technique to reduce the computation and memory footprints in deep learning optimization. In our open submission , we explored structured sparsity optimization by using training for sparsity. We demonstrated the success of FP32 sparse optimization on DLRM.
We proposed a 16x1 sparsity pattern to take advantage of the Intel® Advanced Vector Extensions (Intel AVX-512 architecture. For this blog, we call this sparsity pattern a tile-based sparsity. Tile-based sparsity consists of blocks of non-zeros in some regular pattern. Figure 1 shows an example with the tiles of consecutive 16 non-zero elements in red boxes, while the rest of the tiles are zeros.
Figure 1. Tile-based sparsity (16x1 sparsity pattern) after training with sparsity in FP32
With training for sparsity and magnitude-based pruning, we generated a sparse model for DLRM with a geometric mean (geomean) of 80 percent. The model has up to a 99 percent sparsity ratio for the General Matrix Multiplications (GEMMs). We developed the sparse GEMM kernel and verified both accuracy and performance using the MLPerf test software. With sparse GEMMs, we achieved 1.4X inference speedup  on 3rd Gen Intel Xeon Scalable processors compared to the same software with the original dense GEMMs, while keeping the accuracy loss within 1 percent, as shown in Table 1.
Table 1. Comparing Sparse and Dense DLRM Models 
2. Op-fusion demonstrated on BERT
The BERT model is widely used for NLP pretraining. The model itself has many repeated FullyConnected + Activation sub-graphs, so optimizing these repeated sub-graphs gives a significant boost to overall inference performance. Figure 2 shows the BERT pattern of FullyConnected + GeLU activation function that we fuse together.
Figure 2. Operator Fusion on BERT Sub-graph Before (left) and After (right) Fusion. (R = Read, W = Write)
Comparing the before and after fusion in the dashed box in Figure 2 for one sub-graph, we can reduce the number of memory accesses in the dashed red box for activations from 4 to 2. The tensor output from FullyConnected can go into Activation directly, without going to the memory, while the output from the activation then gets stored in memory. The fusion also reduces the quantization steps from 2 to 1 in Figure 2. This results in two fewer reads and writes to the memory. The left row has ten reads and writes in the five blocks, while we have six reads and writes in the three blocks on the right. In this example, we reduced memory reads and writes by 40 percent  in the subgraph with the operator fusions.
3. Reducing primitive creation overhead for BERT and RNN-T
Non-vision models usually have variable input sizes. For example, text sentences have variable numbers of tokens or words, and audio speech inputs have different durations. Each query to a recommendation model can comprise different numbers of user-item pairs. The details are in Table 2.
Table 2. Variable Data Input Sizes in Each Inference Query
Deep learning frameworks create different kernels for different input tensor shapes, to accommodate this variability in input sizes and optimize the inference speed. Therefore, we keep the kernels persistent, since there are input tensors with repeated shapes in a typical data center deployment scenario. We associate different input shapes with different CPU compute instances , so we can maximize kernel re-use within the instance itself and reduce the overhead of unnecessary kernel creation.
In the BERT fusion example in Figure 2, we also eliminate some of the library and framework overhead between operators; for example, the primitive creation overhead in oneDNN.
4a. Load balance across compute instances
On Intel Xeon Scalable processors, it is a common practice to create multiple instances of model inference on the same processor. Each instance binds to a subset of cores on that processor, which helps increase computation utilization and whole-system throughput. This binding also minimizes interference between compute instances, or between inference compute instances and non-machine learning workloads co-locating on the same system.
4b. Load balance across input sizes
Different input sizes create inference work of different sizes. We used a variety of load balancing techniques to ensure the optimal hardware utilization.
- Dynamic batching. We optimized the batch size for BERT, which has variable input length, by performing a one-time profiling of input shapes and a one-time calibration step. With this profiling, we can choose the optimal batch size for a specific input shape.
- Constant batching. In DLRM, we implemented a bucketing approach to ensure the total number of user-item pairs is constant across batches. This bucketing approach also helps ensure the inference speed per batch is as consistent as possible and the activation shape is in the optimal range.
Specifically, in the DLRM offline scenario, we first group the samples into different buckets, with each bucket holding samples of a different input shape. Then we pick samples from different buckets to form a batch with total user-item pairs of 420,000, which is a common multiple of the input sizes listed in Table 2. In the DLRM server scenario, we accumulate the samples in a batch until the total number of user-item pairs reaches X – 600, where X is the target batch size to meet the latency target.
- Combining in-process inference threads with multiple inference processes. DLRM models have large numbers of weights—on the order of 100G. If we launch multiple inference processes on the same processor, the number of processes is limited by the memory capacity, because each process maintains a copy of the model weights. To avoid this limitation, we launch multiple inference threads within the same process, so the threads share the same copy of model weights within the process. In our experiments with the same hardware platform, we found the number of concurrent inference instances doubles with combining multi-threading and multi-processing, compared to only multi-processing.
4c. Batch RNN-T greedy decoder
The reference RNN-T implementation has a sequential decoder. The encoded audio sequence is decoded token-by-token through the greedy decoder. To improve the compute efficiency, we introduce a batch greedy decoder and vectorize most of the decoder steps with PyTorch tensor operations.
Compared to its sequential version, a batch greedy decoder takes a vector of ‘feature front’ (as shown in Figure 4) from the encoder output. It uses each vector as the input to the joint network. In the example in Figure 4, we have several scenarios at each time step. In this example, we can decode the same vector [f00, f10]. We can advance the second value when we get the _blank_id token in the second row [f00, f11]. Or we can advance the first value when we get the _blank_id token in the first row [f01, f11].
Figure 3. Sequential Greedy Decoder with 'ft0' and 'ft1' Being Decoded in Sequence
Figure 4. Batch Greedy Decoder with two data inputs (turquoise and yellow) decoded together. The orange, green. and blue ovals are the three feature fronts in this example.
In our experiments, batch greedy decoder improves RNN-T offline throughput by 3.3X compared to sequential greedy decoder. Converting to BF16 precision brings another 2.1X improvement. The total improvement over the starting model with the sequential decoder is 6.9X , as shown in Figure 5.
Figure 5. RNN-T Offline Throughput Improvement Breakdown 
Intel Submission Results for MLPerf v1.0 Inference
Intel submitted data for all data center benchmarks and demonstrated the leading CPU performance in the entire data center benchmark suite. See the complete results of Intel submissions on the MLPerf results page with the link here.
We continue to deliver more compute and memory bandwidth with our new 3rd Gen Intel Xeon Scalable processors. We saw up to 46 percent (v0.7:  v1.0: ) more performance per socket compared to our MLPerf v0.7 submission with 2nd Gen Intel Xeon Scalable processors (code name: Cascade Lake).
Figure 6. ResNet50-v1.5 Performance Improvement from MLPerf v0.7 to v1.0.
With software optimizations, DLRM benchmarks in MLPerf v1.0 improved by up to 2.7X compared to our v0.7 submission on the same CPU platform (v0.7:  v1.0: ). (See Figure 7.)
Figure 7. DLRM Performance Improvement from MLPerf v0.7 to v1.0.
Many techniques, including BF16 precision inference and quantization of attention operators in BERT, have been up streamed to frameworks such as PyTorch and MXNet. This work lets users enjoy the performance boost today without extra code changes. Refer to the code implementation details in the MLCommons™ Inference v1.0 Results GitHub link to see all the software optimizations implemented.
We have more exciting AI-focused technologies in the pipeline. Future Intel Xeon Scalable processors, codenamed Sapphire Rapids, will include Intel® Advanced Matrix Extensions (Intel® AMX). We’re also developing a general-purpose GPU optimized for HPC/AI acceleration based on the Intel® Xe architecture, codenamed Ponte Vecchio. We continue to develop the software and hardware to optimize AI performance on Intel® products, empowering enterprises to deliver on the promise of AI.
Notices & Disclaimers
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex .
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
1. MLPerf v0.7 Inference Datacenter Closed ResNet, entry 0.7-101. https://mlcommons.org/en/inference-datacenter-07/
2. MLPerf v1.0 Inference Datacenter Closed ResNet, entry 1.0-53. https://mlcommons.org/en/inference-datacenter-10/
3. MLPerf v0.7 Inference Datacenter Closed DLRM-99.9, entry 0.7-126. https://mlcommons.org/en/inference-datacenter-07/
4. MLPerf v1.0 Inference Datacenter Closed DLRM-99.9, entry 1.0-20. https://mlcommons.org/en/inference-datacenter-10/
5. MLPerf v1.0 Inference Open DLRM-99, entry Inf-1.0-67. https://mlcommons.org/en/inference-datacenter-10/
6. Baseline FP32 DLRM dense model source: https://github.com/mlcommons/inference/tree/master/recommendation/dlrm/pytorch Optimized.FP32 DLRM sparse model reproductions steps can be found at https://github.com/mlcommons/submissions_inference_1_0/tree/master/open/Intel/code/dlrm-99 (link active on 4/21). Both the dense model and the sparse DLRM model are tested on the same hardware with the server scenario: 1-node, 2x Intel Xeon Platinum 8380 processor on Coyote Pass with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0x8d05a260, HT on, Turbo on, Ubuntu 20.04.2 LTS, 5.4.0-66-generic. Please see the additional hardware and framework detail in v1.0 Inference Open DLRM-99, entry Inf-1.0-67. https://mlcommons.org/en/inference-datacenter-10/. Test by Intel on 03/18/2021.
7. BERT-Large quantized: 1-node, 2x Intel Xeon Platinum 8380 processor on Coyote Pass with 1 TB (16 slots/ 64GB/3200) total DDR4 memory, ucode 0x8d05a260, HT on, Turbo on, Ubuntu 20.04.2 LTS, 5.4.0-66-generic. Baseline model source: https://github.com/mlcommons/inference/tree/master/language/bert. The steps to reproduce the optimized model is at: https://github.com/mlcommons/submissions_inference_1_0/tree/master/closed/Intel/code/bert-99/mxnet . Test by Intel on 3/16/2021.
8. A CPU inference instance can be a process or a thread. Each inference instance serves an inference request.
9. The relative performance profiling experiments are done with 1-node, 4x Intel Xeon Platinum 8380H processor on Cedar Island with 1.6 TB (24 slots/ 64GB/3200) total DDR4 memory, ucode 0x700001e, HT on, Turbo on, Ubuntu 20.04.2 LTS, 5.4.0-66-generic, PyTorch v1.5.0-rc3, BS 384. We submitted the optimized version using “BF16 encoder and BF16 batch greedy decoder” to MLPerf with a different 1-node, 8x Intel Xeon Platinum 8380H processor with more details at the entry MLPerf v1.0 Inference Closed RNN-T, entry Inf-1.0-20. https://mlcommons.org/en/inference-datacenter-10/. Test by Intel on 03/18/2021.