StarlingX/Platform Features

= StarlingX Platform Specific Features =

Currently only Intel processor platforms are supported by StarlingX. Contributions to extend StarlingX to other architectures are welcome.

Experiment scope
Intel AVX-512 is a set of new instructions that can accelerate performance for workloads and usages such as scientific simulations, artificial intelligence, and image data compression amount others [1]. This wiki will show how to build an application to take advantage of these instructions and run it in a StarlingX Edge Cloud. It will discuss the hardware and software requirements. And we will review when and how to make use of these instructions.

HW requirements
The next version of StarlingX uses CentOS 7.6 as the host operating system. In this OS you can test for the presence of AVX 512 instructions in your CPU by:

cat /proc/cpuinfo | grep avx512

This should give one of the following output as valid:

avx512f avx512dq avx512cd avx512bw avx512vl

There are multiple versions of AVX 512 instructions available in multiple flavors for x86 CPUs, some of them are described in [2]

SW requirements
From the Linux Operating system perspective, there are two kinds of SW requirements: to compile an application and to run the same application.

Linux SW requirements to build a C program that uses AVX 512 instruction sets

Compiler
You need an up to date version of gcc to compile an application that uses these instructions. Here is an example using a container running the latest Fedora gcc

Dockerfile : FROM   fedora:latest MAINTAINER Victor Rodriguez

RUN INSTALL_PKGS="gcc gcc-c++ gcc-gfortran gdb make" && \ dnf install -y --setopt=tsflags=nodocs $INSTALL_PKGS && \ rpm -V $INSTALL_PKGS && \ dnf clean all -y

With this we can create an image that provides the proper gcc:

$ docker run -it 7433bf0abd2b /bin/bash [root@66a74afb3871 /]# gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/8/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-redhat-linux Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,objc,obj-c++,ada,go,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --enable-libmpx --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux Thread model: posix gcc version 8.3.1 20190223 (Red Hat 8.3.1-2) (GCC)

Linux SW requirements to run a binary that uses AVX 512 instruction sets
Once the binaries have been building, running them in CentOS 7.6 is possible due to support of instruction set in the GNU Binutils project. The GNU Binutils project is a collection of binary tools that includes the GNU linker ( ld ) and the GNU assembler (as). In order to get the version of the project we can run:

ld -v

GNU ld version 2.27-34.base.el7

A simple way to know if the instruction we want to execute is supproted in our binutils is by:

strings /usr/lib64/libopcodes-2.27-34.base.el7.so | grep

For example:

strings /usr/lib64/libopcodes-2.27-34.base.el7.so | grep pclmulqdq vpclmulqdq

Test for the experiment
In this case, we will use a simple addition of 2 arrays using AVX 512 instructions:

float a[256] = {0}; float b[256] = {0}; float c[256] = {0};

void foo{ __m512 result,B,C; for (int i=0; i<256; i+=16){ B = _mm512_load_ps(&b[i]); C = _mm512_load_ps(&c[i]); result = _mm512_add_ps(B,C); _mm512_store_ps(&a[i], result); } }

Full example in GitHub [4]

Compiled as :

gcc -O3 -march=skylake-avx512 stress_add_d_avx512.c -o stress_add_d_avx512

If we check the objdump of the compiled binary in our CentOS 7.6

00000000004017d0 : 4017d0:      31 c0                   xor    %eax,%eax 4017d2:      66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1) 4017d8:      62 f1 fd 48 28 88 60    vmovapd 0x404860(%rax),%zmm1 4017df:      48 40 00 4017e2:      48 83 c0 40             add    $0x40,%rax 4017e6:      62 f1 f5 48 58 80 20    vaddpd 0x404020(%rax),%zmm1,%zmm0 4017ed:      40 40 00 4017f0:      62 f1 fd 48 29 80 20    vmovapd %zmm0,0x405020(%rax) 4017f7:      50 40 00 4017fa:      48 3d 00 08 00 00       cmp    $0x800,%rax 401800:      75 d6                   jne    4017d8  401802:      c5 f8 77                vzeroupper 401805:      c3                      retq 401806:      66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1) 40180d:      00 00 00

We will see the power of the zmm registers in addition with the vaddpd, that despite the fact that is not an AVX 512 instruction it does use the zmm registers that are only available at the avx 512 IA systems.

An example of specific AVX 512 instruction could be vrsqrt14pd. This instruction computes the approximate reciprocals of square roots of packed Float64 Values with a relative error of less than 2-14 [3]

strings /usr/lib64/libopcodes-2.27-34.base.el7.so | grep -i VRSQRT14PD

But, what if the OS does not support the binutils version required? we can launch a container image with full support for all the new instructions of AVX 512 ( including VNNI ). For example, Binutils 2.30 or above are needed for VNNI AVX 512 instruction set. Inside a Fedora docker container image we can see:

strings /usr/lib64/libopcodes-2.31.1-25.fc29.so | grep -i VP4DPWSSD vp4dpwssd Vp4dpwssds

These are Vector Neural Network Instructions (VNNI), an extension of the AVX 512 instruction set in the latest generation of Xeon platforms. These instructions perform a dot product of signed Words with Dword Accumulation. This instruction multiply signed words from source register block indicated by zmm2 by signed words from m128 and accumulate resulting signed dwords in zmm1.

As we can see Starling X next release will give the capability to use x86 instruction sets that edge applications can take advantage of. This wiki just provides the methodology to detect if your software stack ( wither a virtual machine or a container ) can take advantage of the instructions provided by your bare metal system.

References
 * 1) https://software.intel.com/en-us/articles/intel-avx-512-instructions
 * 2) https://software.intel.com/en-us/articles/additional-intel-avx-512-instructions
 * 3) https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2
 * 4) https://github.com/VictorRodriguez/avx-basics/blob/master/src/stress_add_d_avx512.c