About AVX CPP¶

Original idea came after learning about SIMD instruction sets (AVX / AVX2) available on all modern x86 CPUs.

As there were no libraries directly allowing to use those types of instructions (even though being used by many modern libraries like TensorFlow, Boost etc.) I have decided to create my own (cause why not).

* Plus to learn more about C++, CMake and low level optimisation ;)

Note

In case of GCC and Clang turning on specific optimization flags might produce code very similar in performance to this wrote by hand. Both of these compilers can produce well optimized code when SIMD instructions are enabled. For MSVC manually optimized SIMD can result in code that is few times faster than “optimized” by MSVC.

While developing this library I have followed the principles listed below:

Performance - the library should be as fast as possible, resulting in code that has as little overhead caused by abstraction as possible (that’s why almost all methods are inline).
Exception Avoidance - the library should not throw exceptions as they make code slower and harder to read. An exception to this rule is running code in Debug mode, where exceptions are explicitly thrown. Otherwise, for invalid inputs, functions have no effect or guarantee not to cause undefined behavior (e.g. buffer overflow).
Ease of Use - the library should be easy to use providing intuitive API for introducing SIMD to existing code.
Simplicity - no one likes to use overcomplicated code with many abstraction layers. That’s why each class is an independent unit (even though all of them share same namespace and most methods) not using polymorphism or iheritance.
Portability - the library should be available on Windows and Linux platforms while supporting all major compilers (GCC, Clang and MSVC). For obvious reasons the library is targeting only x86 architecture supporting AVX2 (mandatory) and AVX512 (optional).
TDD - each feature should be tested before being added to the library (along with adding a feature, tests for it should be developed). The tests should be run on all supported platforms and compilers to ensure seamless integration [1]. Tests can be found in src/tests folder.

List of classes¶

The library provides following classes, which can be used (all of which are within avx namespace):

Char256 holds 32 chars (8-bit integers),
UChar256 holds 32 unsigned chars (8-bit unsigned integers),
Short256 holds 16 shorts (16-bit integers),
UShort256 holds 16 unsigned shorts (16-bit unsigned integers),
Int256 holds 8 32-bit integers,
UInt256 holds 8 32-bit unsigned integers,
Long256 holds 4 64-bit integers,
ULong256 holds 4 64-bit unsigned integers,
Float256 holds 8 32-bit floats,
Double256 holds 4 64-bit doubles

Almost all classes are header only and utilize inline methods for better performance.

Please refer to this page for more information what benefits does inlining provide and what are it’s drawbacks.

Features¶

Each of the types supports regular math operators:

+ and +=
- and -=
* and *=
/ and /=

Each operator supports class and scalar values (of corresponding type). For example + operator for Int256 will have two versions, one accepting Int256 and other accepting int.

All types also support following logical operators:

==, accepts types like math operators,
!=, same as above + ignores -0.0 for floating-point types.

Integer types also support bitwise operators:

& and &=
| and |=
^ and ^=
~ without arguments
<< and <<=, this is an exception, accepts same class or unsigned int
>> and >>=, this is an exception, accepts same class or unsigned int

Each class will provide list of operators and used SIMD instructions.

Limitations¶

Currently the library does not support dynamic detection of supported SIMD instructions.
This means that if it is compiled with AVX512 enabled, then it will only work on CPUs supporting AVX512.
For Long256 and ULong256 data loading and saving overhead results in SIMD version being slower, than regular sequential solution.

Note

Please check Long256 and ULong256 documentation to see details.

Each constructor or load method expects source to contain at least 32 bytes of continous data. Providing pointer to memory, which contains less than 32 bytes can result in undefined behavior.