1 files changed, 68 insertions, 12 deletions
diff --git a/docs/user_guide/library.dox b/docs/user_guide/library.dox
index e987eac752..5a337c374b 100644
--- a/docs/user_guide/library.dox
+++ b/docs/user_guide/library.dox
@@ -1,5 +1,5 @@
 ///
-/// Copyright (c) 2017-2021 Arm Limited.
+/// Copyright (c) 2017-2021, 2023-2024 Arm Limited.
 ///
 /// SPDX-License-Identifier: MIT
 ///
@@ -28,20 +28,18 @@ namespace arm_compute
 
 @tableofcontents
 
-@section architecture_core_vs_runtime Core vs Runtime libraries
+@section architecture_compute_library Compute Library architecture
 
-The Core library is a low level collection of algorithms implementations, it is designed to be embedded in existing projects and applications:
+The Compute Library is a collection of low level algorithm implementations known as kernels @ref IKernel.
+These kernels are implemented as operators @ref IOperator that do not allocate any memory (i.e. all the memory allocations/mappings have to be handled by the caller)
+and are are designed to be embedded in existing projects and applications.
 
-- It doesn't allocate any memory (All the memory allocations/mappings have to be handled by the caller).
-- It doesn't perform any kind of multi-threading (but provide information to the caller about how the workload can be split).
+A higher-level interface wraps the operators into functions @ref IFunction that:
+- Performs memory allocation of images and tensors through the use of standard malloc().
+- Enables multi-threading of Arm® Neon™ code in a very basic way using a very simple pool of threads.
+- For OpenCL, uses the default CLScheduler command queue for all mapping operations and kernels.
 
-The Runtime library is a very basic wrapper around the Core library which can be used for quick prototyping, it is basic in the sense that:
-
-- It allocates images and tensors by using standard malloc().
-- It multi-threads Arm® Neon™ code in a very basic way using a very simple pool of threads.
-- For OpenCL it uses the default CLScheduler command queue for all mapping operations and kernels.
-
-For maximum performance, it is expected that the users would re-implement an equivalent to the runtime library which suits better their needs (With a more clever multi-threading strategy, load-balancing between Arm® Neon™ and OpenCL, etc.)
+For maximum performance, it is expected that the users would re-implement an equivalent to the function interface which suits better their needs (With a more clever multi-threading strategy, load-balancing between Arm® Neon™ and OpenCL, etc.)
 
 @section architecture_fast_math Fast-math support
 
@@ -54,6 +52,17 @@ When the fast-math flag is enabled, both Arm® Neon™ and CL convolution layers
     - no-fast-math: No Winograd support
     - fast-math: Supports Winograd 3x3,3x1,1x3,5x1,1x5,7x1,1x7,5x5,7x7
 
+@section bf16_acceleration BF16 acceleration
+
+Required toolchain: android-ndk-r23-beta5 or later.
+
+To build for BF16: "neon" flag should be set "=1" and "arch" has to be "=armv8.6-a", "=armv8.6-a-sve", or "=armv8.6-a-sve2". For example:
+
+	scons arch=armv8.6-a-sve neon=1 opencl=0 extra_cxx_flags="-fPIC" benchmark_tests=0 validation_tests=0 validation_examples=1 os=android Werror=0 toolchain_prefix=aarch64-linux-android29
+
+To enable BF16 acceleration when running FP32 "fast-math" has to be enabled and that works only for Neon convolution layer using cpu gemm.
+In this scenario on CPU: the CpuGemmConv2d kernel performs the conversion from FP32, type of input tensor, to BF16 at block level to exploit the arithmetic capabilities dedicated to BF16. Then transforms back to FP32, the output tensor type.
+
 @section architecture_thread_safety Thread-safety
 
 Although the library supports multi-threading during workload dispatch, thus parallelizing the execution of the workload at multiple threads, the current runtime module implementation is not thread-safe in the sense of executing different functions from separate threads.
@@ -555,5 +564,52 @@ The responsibilities of the operators can be summarized as follows:
 - Providing information to the caller required by the computation (e.g., memory requirements)
 - Allocation of any required auxiliary memory if it isn't given by its caller explicitly
 
+@subsection architecture_experimental_build_multi_isa Build multi-ISA binary
+
+Selecting multi_isa when building Compute Library, will create a library that contains all the supported ISA features.
+Based on the CPU support, the appropriate kernel will be selected at runtime for execution. Currently this option is
+supported in two configurations: (i) with armv8.2-a (ii) with armv8-a. In both cases all the supported ISA features are enabled
+in the build.
+
+The arch option in a multi_isa build sets the minimum architecture required to run the resulting binary.
+For example a multi_isa build for armv8-a will run on any armv8-a or later, when the binary is executed on a armv8.2-a device
+it will use the additional cpu features present in this architecture: FP16 and dot product.
+In order to have a binary like this (multi_isa+armv8-a) the FP16 and dot product kernels in the library are compiled for the
+target armv8.2-a and all other common code for armv8-a.
+
+@subsection architecture_experimental_per_operator_build Per-operator build
+
+Dependencies for all operators have been explicitly defined, this provides the ability to users to generate Compute Library
+binaries that include a user-defined list of operators.
+
+An experimental flag 'build_config' has been introduced where a JSON configuration file can be provided and consumed.
+An example config looks like:
+@code{.py}
+{
+    "operators": [
+        "Activation",
+        "DepthwiseConv2d",
+        "Conv2d",
+        "Permute",
+        "Pool2d",
+        "Reshape"
+    ],
+    "data_types": [
+        "NHWC"
+    ]
+}
+@endcode
+
+Supported data-types options are:
+- "NHWC"
+- "NCHW"
+
+The list of supported operators can be found in filelist.json in the root of Compute Library repo.
+
+@subsection architecture_experimental_build_high_priority_operators Build high priority operators
+
+Selecting high_priority when building Compute Library, one new library will be created: libarm_compute_hp and
+will contain a selected subset of the libary operators. Currently the operators are staticly set.
+
 */
 } // namespace arm_compute