4 files changed, 1060 insertions, 0 deletions
diff --git a/docs/contributor_guide/adding_operator.dox b/docs/contributor_guide/adding_operator.dox
new file mode 100644
index 0000000000..559e8e2e76
--- /dev/null
+++ b/docs/contributor_guide/adding_operator.dox
@@ -0,0 +1,334 @@
+///
+/// Copyright (c) 2018-2022 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+
+namespace arm_compute
+{
+/**
+@page adding_operator How to Add a New Operator
+
+@tableofcontents
+
+@section S4_0_introduction Adding new operators
+
+@section S4_1_introduction Introduction
+In Compute Library there are two main parts or modules:
+- The core library consists of a low-level collection of algorithms implemented in C++ and optimized for Arm CPUs and GPUs. The core module is designed to be embedded in other projects and it doesn't perform any memory management or scheduling.
+- The runtime library is a wrapper of the core library and provides other additional features like memory management, multithreaded execution of workloads and allocation of the intermediate tensors.
+
+The library can be integrated in an existing external library or application that provides its own scheduler or a specific memory manager. In that case, the right solution is to use only the core library which means that the user must also manage all the memory allocation not only for the input/output tensor but also for the intermediate tensors/variables necessary. On the other hand, if the user doesn't want to care about allocation and multithreading then the right choice is to use the functions from the runtime library.
+
+Apart from these components that get linked into the application, the sources also include the validation test suite and the C++ reference implementations against which all the operators are validated.
+
+
+@section S4_1_supporting_new_operators Supporting new operators
+
+Following are the steps involved in adding support for a new operator in Compute Library
+- Add new data types (if required)
+- Add the kernel to the core library.
+- Add the function to the runtime library.
+- Add validation tests.
+    - Add the reference implementation.
+    - Add the fixture
+    - register the tests.
+
+@subsection S4_1_1_add_datatypes Adding new data types
+
+Compute Library declares a few new datatypes related to its domain, kernels, and functions in the library process Tensors and Images (Computer Vision functions). Tensors are multi-dimensional arrays with a maximum of Coordinates::num_max_dimensions dimensions; depending on the number of dimensions tensors can be interpreted as various objects. A scalar can be represented as a zero-dimensional tensor and a vector of numbers can be represented as a one-dimensional tensor. Furthermore, an image is just a 2D tensor, a 3D tensor can be seen as an array of images and a 4D tensor as a 2D array of images, etc.
+All the datatype classes or structures are grouped in the core library folder arm_compute/core  like the @ref ITensor, @ref ITensorInfo (all the information of a tensor), TensorShape and simpler types are in arm_compute/core/CoreTypes.h.
+
+If an operator handles a new datatype, it must be added to the library. While adding a new data type to the library, it's necessary to implement the function to enable printing, the to_string() method and the output stream insertion (<<) operator. Every datatype implements these two functions in utils/TypePrinter.h
+
+A quick example, in <a href="https://github.com/ARM-software/ComputeLibrary/blob/main/arm_compute/core/CoreTypes.h">CoreTypes.h</a> we add:
+
+@snippet arm_compute/core/CoreTypes.h DataLayout enum definition
+
+And for printing:
+
+@snippet utils/TypePrinter.h Print DataLayout type
+
+In Compute Library, we use namespaces to group all the operators, functions, classes and interfaces. The main namespace to use is arm_compute. In the test suite, the test framework and the individual tests use nested namespaces like @ref test::validation or @ref test::benchmark to group the different purposes of various parts of the suite.
+Utility functions like conversion or type cast operators, that are shared by multiple operators are in arm_compute/core/Utils.h. Non-inlined function definitions go in the corresponding .cpp files in the src folder.
+Similarly, all common functions that process shapes, like calculating output shapes of an operator or shape conversions etc are in arm_compute/core/utils/misc/ShapeCalculator.h.
+
+
+@subsection S4_1_2_add_kernel Add a kernel
+As we mentioned at the beginning, the kernel is the implementation of the operator or algorithm partially using a specific programming language related to the backend we want to use. Adding a kernel in the library means implementing the algorithm in a SIMD technology like Arm® Neon™ or OpenCL. All kernels in Compute Library must implement a common interface IKernel or one of the specific subinterfaces.
+IKernel is the common interface for all the kernels in the core library, it contains the main methods for configure and run the kernel itself, such as window()  that return the maximum window the kernel can be executed on or is_parallelisable() for indicate whether or not the kernel is parallelizable. If the kernel is parallelizable then the window returned by the window() method can be split into sub-windows which can then be run in parallel, in the other case, only the window returned by window() can be passed to the run method.
+There are specific interfaces for OpenCL and Neon™: @ref ICLKernel, INEKernel (using INEKernel = @ref ICPPKernel).
+
+- @ref ICLKernel is the common interface for all the OpenCL kernels. It implements the inherited methods and adds all the methods necessary to configure the CL kernel, such as set/return the Local-Workgroup-Size hint, add single, array or tensor argument, set the targeted GPU architecture according to the CL device. All these methods are used during the configuration and the run of the operator.
+- INEKernel inherits from @ref IKernel as well and it's the common interface for all kernels implemented in Neon™, it adds just the run and the name methods.
+
+There are two others implementation of @ref IKernel called @ref ICLSimpleKernel and INESimpleKernel, they are the interface for simple kernels that have just one input tensor and one output tensor.
+Creating a new kernel implies adding new files:
+- src/core/CL/kernels/CLReshapeLayerKernel.h
+- src/core/CL/cl_kernels/reshape_layer.cl
+- src/core/CL/kernels/CLReshapeLayerKernel.cpp
+- src/core/CL/CLKernelLibrary.cpp
+
+Neon™ kernel
+- arm_compute/core/NEON/kernels/NEReshapeLayerKernel.h
+- src/core/NEON/kernels/NEReshapeLayerKernel.cpp
+
+We must register the new layer in the respective libraries:
+- src/core/CL/CLKernels.h
+- arm_compute/core/NEON/NEKernels.h
+
+These files contain the list of all kernels available in the corresponding Compute Library's backend, for example CLKernels:
+@code{.cpp}
+...
+#include "src/core/CL/kernels/CLMinMaxLayerKernel.h"
+#include "src/core/CL/kernels/CLMinMaxLocationKernel.h"
+...
+#include "src/core/CL/kernels/CLReshapeLayerKernel.h"
+...
+
+@endcode
+
+For OpenCL we need to update the CLKernelLibrary.cpp, adding the appropriate code to embed the .cl kernel in the library. The OpenCL code can be compiled offline and embed in the library as binary.
+The essential operation we want to do with a kernel will be
+- create the kernel object
+- initialize the kernel with the input/output and any other parameters that may be required
+- retrieve the execution window of the kernel and run the whole kernel window in the current thread or use the multithreading.
+
+Each kernel will have to implement the method:
+- %validate: is a static function that checks if the given info will lead to a valid configuration of the kernel.
+- configure: configure the kernel, its window, accessor, valid region, etc for the given set of tensors and other parameters.
+- run: execute the kernel in the window
+
+The structure of the kernel .cpp file should be similar to the next ones.
+For OpenCL:
+@snippet src/gpu/cl/kernels/ClReshapeKernel.cpp ClReshapeKernel Kernel
+The run will call the function defined in the .cl file.
+
+For the Arm® Neon™ backend case:
+@snippet src/cpu/kernels/CpuReshapeKernel.cpp NEReshapeLayerKernel Kernel
+
+In the Arm® Neon™ case, there is no need to add an extra file and we implement the kernel in the same NEReshapeLayerKernel.cpp file.
+If the tests are already in place, the new kernel can be tested using the existing tests by adding the configure and run of the kernel to the compute_target() in the fixture.
+
+
+@subsection S4_1_3_add_function Add a function
+
+%Memory management and scheduling the underlying kernel(s) must be handled by the function implementation. A kernel class must support window() API which return the execute window for the configuration that the kernel is configured for. A window specifies the dimensions of a workload. It has a start and end on each of the dimension. A maximum of Coordinates::num_max_dimensions is supported. The run time layer is expected to query the kernel for the window size and schedule the window as it sees fit. It could choose to split the window into sub windows so that it could be run in parallel. The split must adhere to the following rules
+
+- max[n].start() <= sub[n].start() < max[n].end()
+- sub[n].start() < sub[n].end() <= max[n].end()
+- max[n].step() == sub[n].step()
+- (sub[n].start() - max[n].start()) % max[n].step() == 0
+- (sub[n].end() - sub[n].start()) % max[n].step() == 0
+
+@ref CPPScheduler::schedule provides a sample implementation that is used for Arm® Neon™ kernels.
+%Memory management is the other aspect that the runtime layer is supposed to handle. %Memory management of the tensors is abstracted using TensorAllocator. Each tensor holds a pointer to a TensorAllocator object, which is used to allocate and free the memory at runtime. The implementation that is currently supported in Compute Library allows memory blocks, required to be fulfilled for a given operator, to be grouped together under a @ref MemoryGroup. Each group can be acquired and released. The underlying implementation of memory groups vary depending on whether Arm® Neon™ or CL is used. The memory group class uses memory pool to provide the required memory. It also uses the memory manager to manage the lifetime and a IPoolManager to manage the memory pools registered with the memory manager.
+
+
+We have seen the various interfaces for a kernel in the core library, the same structure the same file structure design exists in the runtime module. IFunction is the base class for all the functions, it has two child interfaces: ICLSimpleFunction and INESimpleFunction that are used as base class for functions which call a single kernel.
+
+The new operator has to implement %validate(), configure() and run(), these methods will call the respective function in the kernel considering that the multi-threading is used for the kernels which are parallelizable, by default std::thread::hardware_concurrency() threads are used. For Arm® Neon™ function can be used CPPScheduler::set_num_threads() to manually set the number of threads, whereas for OpenCL kernels all the kernels are enqueued on the queue associated with CLScheduler and the queue is then flushed.
+For the runtime functions, there is an extra method implemented: prepare(), this method prepares the function for the run, it does all the heavy operations that are done only once (reshape the weight, release the memory not necessary after the reshape, etc). The prepare method can be called standalone or in the first run, if not called before, after then the function will be marked as prepared.
+The files we add are:
+
+OpenCL function
+- arm_compute/runtime/CL/functions/CLReshapeLayer.h
+- src/runtime/CL/functions/CLReshapeLayer.cpp
+
+Neon™ function
+- arm_compute/runtime/NEON/functions/NEReshapeLayer.h
+- src/runtime/NEON/functions/NEReshapeLayer.cpp
+
+As we did in the kernel we have to edit the runtime libraries to register the new operator modifying the relative library file:
+- arm_compute/runtime/CL/CLFunctions.h
+- arm_compute/runtime/NEON/NEFunctions.h
+
+For the special case where the new function calls only one kernel, we could use as base class ICLSimpleFunction or INESimpleFunction. The configure and the validate methods will simply call the corresponding functions. The structure will be:
+@snippet src/runtime/CL/functions/CLReshapeLayer.cpp CLReshapeLayer snippet
+
+
+If the function is more complicated and calls more than one kernel we have to use the memory manager to manage the intermediate tensors; in the configure() method we call the manage() function passing the tensor to keep track, in the run method we will have to acquire all the buffer managed and released at the end.
+For OpenCL if we want to add two tensor input and reshape the result:
+
+@code{.cpp}
+using namespace arm_compute;
+
+CLAddReshapeLayer:: CLAddReshapeLayer(std::shared_ptr<IMemoryManager> memory_manager)
+    : _memory_group(std::move(memory_manager))
+{
+}
+
+void CLAddReshapeLayer::configure(const ICLTensor *input1, const ICLTensor *input2, ICLTensor *output)
+{
+    // Allocate memory
+    TensorInfo info();
+    add_output.allocator()->init(info);
+
+    // Manage intermediate buffers
+    memory_group.manage(&_addOutput);
+
+    // Initialise kernel
+    _add_kernel.configure(input1, input2, &add_output);
+    _reshape_kernel.configure(&add_output, output);
+
+    // Allocate intermediate tensors
+    add_output.allocator()->allocate();
+}
+
+Status CLAddReshapeLayer::validate(const ITensorInfo *input1, const ITensorInfo *input2, const ITensorInfo *output)
+{
+    TensorInfo add_output();
+    ARM_COMPUTE_RETURN_ERROR_ON(CLAddLayerKernel::validate(input1, input2, add_output));
+    ARM_COMPUTE_RETURN_ERROR_ON(CLReshapeLayerKernel::validate(add_output, output));
+    return Status{};
+}
+
+void CLAddReshapeLayer::run()
+{
+    memory_group.acquire();
+
+    // Run Add
+    add_kernel.run();
+
+    // Run Reshape
+    CLScheduler::get().enqueue(reshape_kernel);
+
+    memory_group.release();
+}
+
+@endcode
+
+For Neon™:
+
+@code{.cpp}
+using namespace arm_compute;
+
+NEAddReshapeLayer:: NEAddReshapeLayer (std::shared_ptr<IMemoryManager> memory_manager)
+    : _memory_group(std::move(memory_manager))
+{
+}
+
+void NEAddReshapeLayer::configure(const ITensor *input1, const ITensor *input2, ITensor *output)
+{
+    // Allocate memory
+    TensorInfo info();
+    add_output.allocator()->init(info);
+
+    // Manage intermediate buffers
+    memory_group.manage(&_addOutput);
+
+    // Initialise kernel
+    add_kernel.configure(input1, input2, &addOutput);
+    reshape_kernel.configure(&addOutput, output);
+
+    // Allocate intermediate tensors
+    add_output.allocator()->allocate();
+}
+
+void NEAddReshapeLayer::run()
+{
+    memory_group.acquire();
+
+    // Run Add
+    add_kernel.run();
+
+    // Run Reshape
+    NEScheduler::get().schedule(_reshape_kernel.get(), Window::DimY);
+
+    memory_group.release();
+}
+@endcode
+
+
+At this point, everything is in place at the library level. If you are following an tests driven implementation and all the tests are already in place, we can call the function configuration in the fixture and remove any redundant code like the allocation of the intermediate tensors since it's done in the function. Run the final tests to check the results match with the expected results from the reference implementation.
+
+@subsection S4_1_4_add_validation Add validation artifacts
+
+@subsubsection S4_1_4_1_add_reference Add the reference implementation and the tests
+As mentioned in the introduction, the reference implementation is a pure C++ implementation without any optimization or backend specific instruction.
+The reference implementation consist of two files into the folder tests/validation/reference:
+- tests/validation/reference/ReshapeLayer.h
+- tests/validation/reference/ReshapeLayer.cpp
+
+where we will put respectively the declaration and definition of the new operator.
+All the utility functions that are used ONLY in the tests are in test/validation/helpers.h, for all the others, as mentioned before, there are helpers in the library.
+Compute Library and the tests do use templates, the reference implementation is a generic implementation independent from the datatype and we use the templates to generalize the datatype concept.
+Following the example, let's have a look at the ReshapeLayer operator:
+
+- tests/validation/reference/ReshapeLayer.h
+
+@snippet tests/validation/reference/ReshapeLayer.h ReshapeLayer
+
+- tests/validation/reference/ReshapeLayer.cpp
+
+@snippet tests/validation/reference/ReshapeLayer.cpp ReshapeLayer
+
+An explicit instantiation of the template for the required datatypes must be added in the .cpp file.
+
+@subsubsection S4_1_4_2_add_dataset Add dataset
+One of the parameters of the tests is the dataset, it will be used to generate versions of the test case with different inputs.
+To pass the dataset at the fixture data test case we have three cases
+- the operator dataset is simple so it can be added directly in the test case data declaration
+- we can create a class that return tuples at the test framework
+
+@snippet tests/datasets/PoolingTypesDataset.h PoolingTypes datasets
+
+- if we want to create dynamically the dataset combining different parameter, we can create the dataset using iterators.
+For example, dataset for ReshapeLayer:
+
+@snippet tests/datasets/ReshapeLayerDataset.h ReshapeLayer datasets
+
+@subsubsection S4_1_4_3_add_fixture  Add a fixture and a data test case
+
+Benchmark and validation tests are based on the same framework to setup and run the tests. In addition to running simple, self-contained test functions the framework supports fixtures and data test cases.
+Fixtures can be used to share common setup, teardown or even run tasks among multiple test cases, for that purpose a fixture can define a "setup", "teardown" and "run" method.
+Adding tests for the new operator in the runtime library we need to implement at least the setup method, that is used to call two methods for configure, run and return the output respectively of the target (CL or Neon™) and the reference (C++ implementation).
+
+For example let's have a look at Reshape Layer Fixture :
+
+@snippet tests/validation/fixtures/ReshapeLayerFixture.h ReshapeLayer fixture
+
+In the fixture class above we can see that the setup method computes the target and reference and store them in the two members _target and _reference which will be used later to check for correctness.
+The compute_target method reflects the exact behavior expected when we call a function. The input and output tensor must be declared, function configured, tensors allocated, the input tensor filled with required data, and finally, the function must be run and the results returned.
+This fixture is used in the test case, that is a parameterized test case that inherits from a fixture. The test case will have access to all public and protected members of the fixture. Only the setup and teardown methods of the fixture will be used. The setup method of the fixture needs to be a template and must accept inputs from the dataset as arguments.
+The body of this function will be used as a test function.
+For the fixture test case the first argument is the name of the test case (has to be unique within the enclosing test suite), the second argument is the class name of the fixture, the third argument is the dataset mode in which the test will be active (PRECOMMIT or NIGHTLY) and the fourth argument is the dataset.
+For example:
+
+@snippet tests/validation/CL/ActivationLayer.cpp CLActivationLayerFixture snippet
+
+@code{.cpp}
+TEST_SUITE(CL)
+TEST_SUITE(ActivationLayer)
+TEST_SUITE(Float)
+TEST_SUITE(FP16)
+@endcode
+@snippet tests/validation/CL/ActivationLayer.cpp CLActivationLayer Test snippet
+@code{.cpp}
+TEST_SUITE_END()
+TEST_SUITE_END()
+TEST_SUITE_END()
+TEST_SUITE_END()
+@endcode
+
+This will produce a set of tests that can be filtered with "CL/ReshapeLayer/Float/FP16/RunSmall". Each test produced from the cartesian product of the dataset is associated to a number and can be filtered specifying all the parameters.
+*/
+} // namespace arm_compute
diff --git a/docs/contributor_guide/contribution_guidelines.dox b/docs/contributor_guide/contribution_guidelines.dox
new file mode 100644
index 0000000000..cbaa502635
--- /dev/null
+++ b/docs/contributor_guide/contribution_guidelines.dox
@@ -0,0 +1,533 @@
+///
+/// Copyright (c) 2019-2023 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/**
+@page contribution_guidelines Contribution Guidelines
+
+@tableofcontents
+
+If you want to contribute to Arm Compute Library, be sure to review the following guidelines.
+
+The development is structured in the following way:
+- Release repository: https://github.com/arm-software/ComputeLibrary
+- Development repository: https://review.mlplatform.org/#/admin/projects/ml/ComputeLibrary
+- Please report issues here: https://github.com/ARM-software/ComputeLibrary/issues
+
+@section S5_0_inc_lang Inclusive language guideline
+As part of the initiative to use inclusive language, there are certain phrases and words that were removed or replaced by more inclusive ones. Examples include but not limited to:
+\includedoc non_inclusive_language_examples.dox
+
+Please also follow this guideline when committing changes to Compute Library.
+It is worth mentioning that the term "master" is still used in some comments but only in reference to external code links that Arm has no governance on.
+
+Futhermore, starting from release (22.05), 'master' branch is no longer being used, it has been replaced by 'main'. Please update your clone jobs accordingly.
+@section S5_1_coding_standards Coding standards and guidelines
+
+Best practices (as suggested by clang-tidy):
+
+- No uninitialised values
+
+Helps to prevent undefined behaviour and allows to declare variables const if they are not changed after initialisation. See http://clang.llvm.org/extra/clang-tidy/checks/cppcoreguidelines-pro-type-member-init.html
+
+@code{.cpp}
+const float32x4_t foo = vdupq_n_f32(0.f);
+const float32x4_t bar = foo;
+
+const int32x4x2_t i_foo = {{
+	vconvq_s32_f32(foo),
+    vconvq_s32_f32(foo)
+}};
+const int32x4x2_t i_bar = i_foo;
+@endcode
+
+- No C-style casts (in C++ source code)
+
+Only use static_cast, dynamic_cast, and (if required) reinterpret_cast and const_cast. See http://en.cppreference.com/w/cpp/language/explicit_cast for more information when to use which type of cast. C-style casts do not differentiate between the different cast types and thus make it easy to violate type safety. Also, due to the prefix notation it is less clear which part of an expression is going to be casted. See http://clang.llvm.org/extra/clang-tidy/checks/cppcoreguidelines-pro-type-cstyle-cast.html
+
+- No implicit casts to bool
+
+Helps to increase readability and might help to catch bugs during refactoring. See http://clang.llvm.org/extra/clang-tidy/checks/readability-implicit-bool-cast.html
+
+@code{.cpp}
+extern int *ptr;
+if(ptr){} // Bad
+if(ptr != nullptr) {} // Good
+
+extern int foo;
+if(foo) {} // Bad
+if(foo != 0) {} // Good
+@endcode
+
+- Use nullptr instead of NULL or 0
+
+The nullptr literal is type-checked and is therefore safer to use. See http://clang.llvm.org/extra/clang-tidy/checks/modernize-use-nullptr.html
+
+- No need to explicitly initialise std::string with an empty string
+
+The default constructor of std::string creates an empty string. In general it is therefore not necessary to specify it explicitly. See http://clang.llvm.org/extra/clang-tidy/checks/readability-redundant-string-init.html
+
+@code{.cpp}
+// Instead of
+std::string foo("");
+std::string bar = "";
+
+// The following has the same effect
+std::string foo;
+std::string bar;
+@endcode
+
+- Braces for all control blocks and loops (which have a body)
+
+To increase readability and protect against refactoring errors the body of control block and loops must be wrapped in braces. See http://clang.llvm.org/extra/clang-tidy/checks/readability-braces-around-statements.html
+
+For now loops for which the body is empty do not have to add empty braces. This exception might be revoked in the future. Anyway, situations in which this exception applies should be rare.
+
+@code{.cpp}
+Iterator it;
+while(it.next()); // No need for braces here
+
+// Make more use of it
+@endcode
+
+- Only one declaration per line
+
+Increase readability and thus prevent errors.
+
+@code{.cpp}
+int a, b; // BAD
+int c, *d; // EVEN WORSE
+
+int e = 0; // GOOD
+int *p = nullptr; // GOOD
+@endcode
+
+- Pass primitive types (and those that are cheap to copy or move) by value
+
+For primitive types it is more efficient to pass them by value instead of by const reference because:
+
+ - the data type might be smaller than the "reference type"
+ - pass by value avoids aliasing and thus allows for better optimisations
+ - pass by value is likely to avoid one level of indirection (references are often implemented as auto dereferenced pointers)
+
+This advice also applies to non-primitive types that have cheap copy or move operations and the function needs a local copy of the argument anyway.
+
+More information:
+
+ - http://stackoverflow.com/a/14013189
+ - http://stackoverflow.com/a/270435
+ - http://web.archive.org/web/20140113221447/http://cpp-next.com/archive/2009/08/want-speed-pass-by-value/
+
+@code{.cpp}
+void foo(int i, long l, float32x4_t f); // Pass-by-value for builtin types
+void bar(const float32x4x4_t &f); // As this is a struct pass-by-const-reference is probably better
+void foobar(const MyLargeCustomTypeClass &m); // Definitely better as const-reference except if a copy has to be made anyway.
+@endcode
+
+- Don't use unions
+
+Unions cannot be used to convert values between different types because (in C++) it is undefined behaviour to read from a member other than the last one that has been assigned to. This limits the use of unions to a few corner cases and therefore the general advice is not to use unions. See http://releases.llvm.org/3.8.0/tools/clang/tools/extra/docs/clang-tidy/checks/cppcoreguidelines-pro-type-union-access.html
+
+- Use pre-increment/pre-decrement whenever possible
+
+In contrast to the pre-increment the post-increment has to make a copy of the incremented object. This might not be a problem for primitive types like int but for class like objects that overload the operators, like iterators, it can have a huge impact on the performance. See http://stackoverflow.com/a/9205011
+
+To be consistent across the different cases the general advice is to use the pre-increment operator unless post-increment is explicitly required. The same rules apply for the decrement operator.
+
+@code{.cpp}
+for(size_t i = 0; i < 9; i++); // BAD
+for(size_t i = 0; i < 9; ++i); // GOOD
+@endcode
+
+- Don't use uint in C/C++
+
+The C and C++ standards don't define a uint type. Though some compilers seem to support it by default it would require to include the header sys/types.h. Instead we use the slightly more verbose unsigned int type.
+
+- Don't use unsigned int in function's signature
+
+Unsigned integers are good for representing bitfields and modular arithmetic. The fact that unsigned arithmetic doesn't model the behavior of a simple integer, but is instead defined by the standard to model modular arithmetic (wrapping around on overflow/underflow), means that a significant class of bugs cannot be diagnosed by the compiler. Mixing signedness of integer types is responsible for an equally large class of problems.
+
+- No "Yoda-style" comparisons
+
+As compilers are now able to warn about accidental assignments if it is likely that the intention has been to compare values it is no longer required to place literals on the left-hand side of the comparison operator. Sticking to the natural order increases the readability and thus prevents logical errors (which cannot be spotted by the compiler). In the rare case that the desired result is to assign a value and check it the expression has to be surrounded by parentheses.
+
+@code{.cpp}
+if(nullptr == ptr || false == cond) // BAD
+{
+	//...
+}
+
+if(ptr == nullptr || cond == false) // GOOD
+{
+	//...
+}
+
+if(ptr = nullptr || cond = false) // Most likely a mistake. Will cause a compiler warning
+{
+	//...
+}
+
+if((ptr = nullptr) || (cond = false)) // Trust me, I know what I'm doing. No warning.
+{
+	//...
+}
+@endcode
+
+@subsection S5_1_1_rules Rules
+
+ - Use spaces for indentation and alignment. No tabs! Indentation should be done with 4 spaces.
+ - Unix line returns in all the files.
+ - Pointers and reference symbols attached to the variable name, not the type (i.e. char \&foo;, and not char& foo).
+ - No trailing spaces or tabs at the end of lines.
+ - No spaces or tabs on empty lines.
+ - Put { and } on a new line and increase the indentation level for code inside the scope (except for namespaces).
+ - Single space before and after comparison operators ==, <, >, !=.
+ - No space around parenthesis.
+ - No space before, one space after ; (unless it is at the end of a line).
+
+@code{.cpp}
+for(int i = 0; i < width * height; ++i)
+{
+	void *d = foo(ptr, i, &addr);
+	static_cast<uint8_t *>(data)[i] = static_cast<uint8_t *>(d)[0];
+}
+@endcode
+
+ - Put a comment after \#else, \#endif, and namespace closing brace indicating the related name
+
+@code{.cpp}
+namespace mali
+{
+#ifdef MALI_DEBUG
+	...
+#else // MALI_DEBUG
+	...
+#endif // MALI_DEBUG
+} // namespace mali
+@endcode
+
+- CamelCase for class names only and lower case words separated with _ (snake_case) for all the functions / methods / variables / arguments / attributes.
+
+@code{.cpp}
+class ClassName
+{
+    public:
+        void my_function();
+        int my_attribute() const; // Accessor = attribute name minus '_', const if it's a simple type
+    private:
+        int _my_attribute; // '_' in front of name
+};
+@endcode
+
+- In header files, use header guards that use the full file path from the project root and prepend it with "ACL_"
+
+@code{cpp}
+// For File arm_compute/runtime/NEON/functions/NEBatchNormalizationLayer.h
+#ifndef ACL_ARM_COMPUTE_RUNTIME_NEON_FUNCTIONS_NEBATCHNORMALIZATIONLAYER
+#define ACL_ARM_COMPUTE_RUNTIME_NEON_FUNCTIONS_NEBATCHNORMALIZATIONLAYER
+.
+.
+.
+#endif /* ACL_ARM_COMPUTE_RUNTIME_NEON_FUNCTIONS_NEBATCHNORMALIZATIONLAYER */
+@endcode
+
+- Use quotes instead of angular brackets to include local headers. Use angular brackets for system headers.
+- Also include the module header first, then local headers, and lastly system headers. All groups should be separated by a blank line and sorted lexicographically within each group.
+- Where applicable the C++ version of system headers has to be included, e.g. cstddef instead of stddef.h.
+- See http://llvm.org/docs/CodingStandards.html#include-style
+
+@code{.cpp}
+#include "MyClass.h"
+
+#include "arm_cv/core/Helpers.h"
+#include "arm_cv/core/Types.h"
+
+#include <cstddef>
+#include <numeric>
+@endcode
+
+- Only use "auto" when the type can be explicitly deduced from the assignment.
+
+@code{.cpp}
+auto a = static_cast<float*>(bar); // OK: there is an explicit cast
+auto b = std::make_unique<Image>(foo); // OK: we can see it's going to be an std::unique_ptr<Image>
+auto c = img.ptr(); // NO: Can't tell what the type is without knowing the API.
+auto d = vdup_n_u8(0); // NO: It's not obvious what type this function returns.
+@endcode
+
+- When to use const
+
+    - Local variables: Use const as much as possible. E.g. all read-ony variables should be declared as const.
+
+    - Function parameters
+
+        - Top-level const must not be used in the function declaration or definition. (Note that this applies to all types, including non-primitive types)
+          This is because for function parameters, top-level const in function declaration is always ignored by the compiler (it is meaningless).
+          Therefore we should omit top-level const to reduce visual clutter. In addition, its omission can improve API/ABI
+          stability to some extent as there is one fewer varying factor in function signatures.
+
+          Note that we could in theory allow top-level const in only definition (which is not ignored by the compiler) but not declaration.
+          But certain toolchains are known to require the declaration and definition to match exactly.
+
+        - Use low-level const (of references and pointers) as much as possible.
+@code{.cpp}
+// Primitive types
+void foo(const int a);              // NO: Top-level const must not be used in function declaration or definition
+void foo(int a);                    // OK
+// Pointer to primitive types
+void foo(int *const a);             // NO: Top-level const
+void foo(const int *const a);       // NO: Top-level const
+void foo(int *a);                   // OK. But only if foo needs to mutate the underlying object
+void foo(const int *a);             // OK but not recommended: See section above about passing primitives by value
+// Reference to primitive types
+// There's no "top-level const" for references
+void foo(int &a);                   // OK. But only if foo needs to mutate the underlying object
+void foo(const int &a);             // OK but not recommended: See section above about passing primitives by value
+
+// Custom types
+void foo(const Goo g);              // NO: Top-level const
+void foo(Goo g);                    // OK
+// Pointer to custom types
+void foo(Goo *const g);             // NO: Top-level const
+void foo(Goo *g);                   // OK. But only if foo needs to mutate the underlying object
+void foo(const Goo *g);             // OK
+// Reference to custom types
+void foo(Goo &g);                   // OK. But only if foo needs to mutate the underlying object
+void foo(const Goo &g);             // OK
+@endcode
+
+- OpenCL:
+    - Use __ in front of the memory types qualifiers and kernel: __kernel, __constant, __private, __global, __local.
+    - Indicate how the global workgroup size / offset / local workgroup size are being calculated.
+
+    - Doxygen:
+
+        - No '*' in front of argument names
+        - [in], [out] or [in,out] *in front* of arguments
+        - Skip a line between the description and params and between params and \@return (If there is a return)
+        - Align params names and params descriptions (Using spaces), and with a single space between the widest column and the next one.
+        - Use an upper case at the beginning of the description
+
+@snippet arm_compute/runtime/NEON/functions/NEActivationLayer.h NEActivationLayer snippet
+
+@subsection S5_1_2_how_to_check_the_rules How to check the rules
+
+astyle (http://astyle.sourceforge.net/) and clang-format (https://clang.llvm.org/docs/ClangFormat.html) can check and help you apply some of these rules.
+
+We have also provided the python scripts we use in our precommit pipeline inside scripts directory.
+    - format_code.py: checks Android.bp, bad style, end of file, formats doxygen, runs astyle and clang-format (assuming necessary binaries are in the path). Example invocations:
+@code{.sh}
+        python format_code.py
+        python format_code.py --error-on-diff
+        python format_code.py --files=git-diff (Default behavior in pre-commit configuration, where it checks the staged files)
+@endcode
+    - generate_build_files.py: generates build files required for CMake and Bazel builds. Example invocations:
+@code{.sh}
+        python generate_build_files.py --cmake
+        python generate_build_files.py --bazel
+@endcode
+
+Another way of running the checks is using `pre-commit` (https://pre-commit.com/) framework, which has also nice features like checking trailing spaces, and large committed files etc.
+`pre-commit` can be installed via `pip`. After installing, run the following command in the root directory of the repository:
+
+	pre-commit install
+
+This will create the hooks that perform the formatting checks mentioned above and will automatically run just before committing to flag issues.
+
+@subsection S5_1_3_library_size_guidelines Library size: best practices and guidelines
+
+@subsubsection S5_1_3_1_template_suggestions Template suggestions
+
+When writing a new patch we should also have in mind the effect it will have in the final library size. We can try some of the following things:
+
+ - Place non-dependent template code in a different non-templated class/method
+
+@code{.cpp}
+template<typename T>
+class Foo
+{
+public:
+    enum { v1, v2 };
+    // ...
+};
+@endcode
+
+    can be converted to:
+
+@code{.cpp}
+struct Foo_base
+{
+    enum { v1, v2 };
+    // ...
+};
+
+template<typename T>
+class Foo : public Foo_base
+{
+public:
+    // ...
+};
+@endcode
+
+ - In some cases it's preferable to use runtime switches instead of template parameters
+
+ - Sometimes we can rewrite the code without templates and without any (significant) performance loss. Let's say that we've written a function where the only use of the templated argument is used for casting:
+
+@code{.cpp}
+template <typename T>
+void NETemplatedKernel::run(const Window &window)
+{
+...
+ *(reinterpret_cast<T *>(out.ptr())) = *(reinterpret_cast<const T *>(in.ptr()));
+...
+}
+@endcode
+
+The above snippet can be transformed to:
+
+@code{.cpp}
+void NENonTemplatedKernel::run(const Window &window)
+{
+...
+std::memcpy(out.ptr(), in.ptr(), element_size);
+...
+}
+@endcode
+
+@subsection S5_1_4_secure_coding_practices Secure coding practices
+
+@subsubsection S5_1_4_1_general_coding_practices General Coding Practices
+
+- **Use tested and approved managed code** rather than creating new unmanaged code for common tasks.
+- **Utilize locking to prevent multiple simultaneous requests** or use a synchronization mechanism to prevent race conditions.
+- **Protect shared variables and resources** from inappropriate concurrent access.
+- **Explicitly initialize all your variables and other data stores**, either during declaration or just before the first usage.
+- **In cases where the application must run with elevated privileges, raise privileges as late as possible, and drop them as soon as possible**.
+- **Avoid calculation errors** by understanding your programming language's underlying representation and how it interacts with numeric calculation. Pay close attention to byte size discrepancies, precision, signed/unsigned distinctions, truncation, conversion and casting between types, "not-a-number" calculations, and how your language handles numbers that are too large or too small for its underlying representation.
+- **Restrict users from generating new code** or altering existing code.
+
+
+@subsubsection S5_1_4_2_secure_coding_best_practices Secure Coding Best Practices
+
+- **Validate input**. Validate input from all untrusted data sources. Proper input validation can eliminate the vast majority of software vulnerabilities. Be suspicious of most external data sources, including command line arguments, network interfaces, environmental variables, and user controlled files.
+- **Heed compiler warnings**. Compile code using the default compiler flags that exist in the SConstruct file.
+- Use **static analysis tools** to detect and eliminate additional security flaws.
+- **Keep it simple**. Keep the design as simple and small as possible. Complex designs increase the likelihood that errors will be made in their implementation, configuration, and use. Additionally, the effort required to achieve an appropriate level of assurance increases dramatically as security mechanisms become more complex.
+- **Default deny**. Base access decisions on permission rather than exclusion. This means that, by default, access is denied and the protection scheme identifies conditions under which access is permitted
+- **Adhere to the principle of least privilege**. Every process should execute with the least set of privileges necessary to complete the job. Any elevated permission should only be accessed for the least amount of time required to complete the privileged task. This approach reduces the opportunities an attacker has to execute arbitrary code with elevated privileges.
+- **Sanitize data sent to other systems**. Sanitize all data passed to complex subsystems such as command shells, relational databases, and commercial off-the-shelf (COTS) components. Attackers may be able to invoke unused functionality in these components through the use of various injection attacks. This is not necessarily an input validation problem because the complex subsystem being invoked does not understand the context in which the call is made. Because the calling process understands the context, it is responsible for sanitizing the data before invoking the subsystem.
+- **Practice defense in depth**. Manage risk with multiple defensive strategies, so that if one layer of defense turns out to be inadequate, another layer of defense can prevent a security flaw from becoming an exploitable vulnerability and/or limit the consequences of a successful exploit. For example, combining secure programming techniques with secure runtime environments should reduce the likelihood that vulnerabilities remaining in the code at deployment time can be exploited in the operational environment.
+
+@subsection S5_1_5_guidelines_for_stable_api_abi Guidelines for stable API/ABI
+
+The Application Programming Interface (API) and Application Binary Interface (ABI) are the interfaces exposed
+to users so their programs can interact with the library efficiently and effectively. Even though changing API/ABI
+in a way that does not give backward compatibility is not necessarily bad if it can improve other users' experience and the library,
+contributions should be made with the awareness of API/ABI stability. If you'd like to make changes that affects
+the library's API/ABI, please review and follow the guidelines shown in this section. Also, please note that
+these guidelines are not exhaustive list but discussing things that might be easily overlooked.
+
+@subsubsection S5_1_5_1_guidelines_for_api Guidelines for API
+
+- When adding new arguments, consider grouping arguments (including the old ones) into a struct rather than adding arguments with default values.
+Introducing a new struct might break the API/ABI once, but it will be helpful to keep the stability.
+- When new member variables are added, please make sure they are initialized.
+- Avoid adding enum elements in the middle.
+- When removing arguments, follow the deprecation process described in the following section.
+- When changing behavior affecting API contracts, follow the deprecation process described in the following section.
+
+@subsubsection S5_1_5_2_guidelines_for_abi Guidelines for ABI
+
+We recommend to read through <a href="https://community.kde.org/Policies/Binary_Compatibility_Issues_With_C%2B%2B">this page</a>
+and double check your contributions to see if they include the changes listed.
+
+Also, for classes that requires strong ABI stability, consider using <a href="https://en.cppreference.com/w/cpp/language/pimpl">pImpl idiom</a>.
+
+@subsubsection S5_1_5_3_api_deprecation_process API deprecation process
+
+In order to deprecate an existing API, these rules should be followed.
+
+- Removal of a deprecated API should wait at least for one official release.
+- Deprecation of runtime APIs should strictly follow the aforementioned period, whereas core APIs can have more flexibility as they are mostly used internally rather than user-facing.
+- Any API changes (update, addition and deprecation) in all components should be well documented by the contribution itself.
+
+Also, it is recommended to use the following utility macros which is designed to work with both clang and gcc using C++14 and later.
+
+- ARM_COMPUTE_DEPRECATED: Just deprecate the wrapped function
+- ARM_COMPUTE_DEPRECATED_REL: Deprecate the wrapped function and also capture the release that was deprecated
+- ARM_COMPUTE_DEPRECATED_REL_REPLACE: Deprecate the wrapped function and also capture the release that was deprecated along with a possible replacement candidate
+
+@code{.cpp}
+ARM_COMPUTE_DEPRECATED_REL_REPLACE(20.08, DoNewThing)
+void DoOldThing();
+
+void DoNewThing();
+@endcode
+
+@section S5_2_how_to_submit_a_patch How to submit a patch
+
+To be able to submit a patch to our development repository you need to have a GitHub account. With that, you will be able to sign in to Gerrit where your patch will be reviewed.
+
+Next step is to clone the Compute Library repository:
+
+	git clone "ssh://<your-github-id>@review.mlplatform.org:29418/ml/ComputeLibrary"
+
+If you have cloned from GitHub or through HTTP, make sure you add a new git remote using SSH:
+
+	git remote add acl-gerrit "ssh://<your-github-id>@review.mlplatform.org:29418/ml/ComputeLibrary"
+
+After that, you will need to upload an SSH key to https://review.mlplatform.org/#/settings/ssh-keys
+
+Then, make sure to install the commit-msg Git hook in order to add a change-ID to the commit message of your patch:
+
+	cd "ComputeLibrary" && mkdir -p .git/hooks && curl -Lo `git rev-parse --git-dir`/hooks/commit-msg https://review.mlplatform.org/tools/hooks/commit-msg; chmod +x `git rev-parse --git-dir`/hooks/commit-msg)
+
+When your patch is ready, remember to sign off your contribution by adding a line with your name and e-mail address to every git commit message:
+
+	Signed-off-by: John Doe <john.doe@example.org>
+
+You must use your real name, no pseudonyms or anonymous contributions are accepted.
+
+You can add this to your patch with:
+
+	git commit -s --amend
+
+You are now ready to submit your patch for review:
+
+	git push acl-gerrit HEAD:refs/for/main
+
+@section S5_3_code_review Patch acceptance and code review
+
+Once a patch is uploaded for review, there is a pre-commit test that runs on a Jenkins server for continuous integration tests. In order to be merged a patch needs to:
+
+- get a "+1 Verified" from the pre-commit job
+- get a "+1 Comments-Addressed", in case of comments from reviewers the committer has to address them all. A comment is considered addressed when the first line of the reply contains the word "Done"
+- get a "+2" from a reviewer, that means the patch has the final approval
+
+At the moment, the Jenkins server is not publicly accessible and for security reasons patches submitted by non-allowlisted committers do not trigger the pre-commit tests. For this reason, one of the maintainers has to manually trigger the job.
+
+If the pre-commit test fails, the Jenkins job will post a comment on Gerrit with the details about the failure so that the committer will be able to reproduce the error and fix the issue, if any (sometimes there can be infrastructure issues, a test platform disconnecting for example, where the job needs to be retriggered).
+
+*/
+} // namespace arm_compute
diff --git a/docs/contributor_guide/implementation_topics.dox b/docs/contributor_guide/implementation_topics.dox
new file mode 100644
index 0000000000..6ca78f98e7
--- /dev/null
+++ b/docs/contributor_guide/implementation_topics.dox
@@ -0,0 +1,189 @@
+///
+/// Copyright (c) 2017-2021, 2024 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/** @page implementation_topic Implementation Topics
+
+@section implementation_topic_assembly_kernels Assembly kernels
+
+Arm Compute Library contains a collection of highly optimized assembly kernels for Arm® A profile architecture. At runtime the
+library selects the best kernel based on the CPU detected. For example if the CPU supports the dot product instruction
+the library will choose a GEMM kernel which uses the dot product instruction. There are various kernels using Neon™ and
+architecture extensions like FP16, Dot product, SVE, SVE2 and SME.
+
+For example, some assembly kernels are located in the folders:
+- src/core/NEON/kernels/arm_gemm/kernels
+- src/core/NEON/kernels/arm_gemm/pooling
+- src/core/NEON/kernels/arm_conv/depthwise
+
+
+The assembly kernels are written using assembly mnemonics and the .inst directive which inserts the machine code to the output directly.
+
+Below you can see a code block from one of the kernels in the library which uses the .inst directive to generate the sdot instruction.
+This code can be found in the kernel @ref src/core/NEON/kernels/arm_gemm/kernels/a64_hybrid_s8qa_dot_4x16/a55.cpp
+
+@code{.cpp}
+".inst 0x4f80eb10  // sdot v16.4s, v24.16b, v0.4b[2]\n"
+".inst 0x4f81eb14  // sdot v20.4s, v24.16b, v1.4b[2]\n"
+" ldr d24, [x12, #0xf0]\n"
+" ldr x20, [x12, #0xf8]\n"
+" .inst 0x4f80ebd1  // sdot v17.4s, v30.16b, v0.4b[2]\n"
+" .inst 0x4f81ebd5  // sdot v21.4s, v30.16b, v1.4b[2]\n"
+" mov v27.d[1], x23\n"
+" .inst 0x4f80ebb2  // sdot v18.4s, v29.16b, v0.4b[2]\n"
+" mov v26.d[1], x22\n"
+" .inst 0x4f81ebb6  // sdot v22.4s, v29.16b, v1.4b[2]\n"
+" mov v25.d[1], x21\n"
+" .inst 0x4f80eb93  // sdot v19.4s, v28.16b, v0.4b[2]\n"
+" mov v24.d[1], x20\n"
+" .inst 0x4f81eb97  // sdot v23.4s, v28.16b, v1.4b[2]\n"
+" add x9, x9, #0x10\n"
+" add x28, x28, #0x10\n"
+" add x12, x12, #0x100\n"
+" .inst 0x4fa0eb70  // sdot v16.4s, v27.16b, v0.4b[3]\n"
+" .inst 0x4fa1eb74  // sdot v20.4s, v27.16b, v1.4b[3]\n"
+" .inst 0x4fa0eb51  // sdot v17.4s, v26.16b, v0.4b[3]\n"
+" .inst 0x4fa1eb55  // sdot v21.4s, v26.16b, v1.4b[3]\n"
+@endcode
+
+Note that every occurrence of .inst is accompanied by a comment with the original opcode for readability purposes.
+
+The reason for using the opcodes instead of the mnemonic is that this approach will work on any toolchain, including the ones without support for the dot product mnemonic. The .inst directive is used to generate many other instructions and ensuring the code will compile on older toolchains that do not support them.
+
+@section implementation_topic_windows Windows
+
+A @ref Window represents a workload to execute, it can handle up to @ref Coordinates::num_max_dimensions dimensions.
+Each dimension is defined by a start, end and step.
+
+It can split into subwindows as long as *all* the following rules remain true for all the dimensions:
+
+- max[n].start() <= sub[n].start() < max[n].end()
+- sub[n].start() < sub[n].end() <= max[n].end()
+- max[n].step() == sub[n].step()
+- (sub[n].start() - max[n].start()) % max[n].step() == 0
+- (sub[n].end() - sub[n].start()) % max[n].step() == 0
+
+@section implementation_topic_kernels Kernels
+
+Each implementation of the @ref IKernel interface (base class of all the kernels in the core library) works in the same way:
+
+OpenCL kernels:
+
+@code{.cpp}
+// Initialize the CLScheduler with the default context and default command queue
+// Implicitly initializes the CLKernelLibrary to use ./cl_kernels as location for OpenCL kernels files and sets a default device for which OpenCL programs are built.
+CLScheduler::get().default_init();
+
+cl::CommandQueue q = CLScheduler::get().queue();
+//Create a kernel object:
+MyKernel kernel;
+// Initialize the kernel with the input/output and options you want to use:
+kernel.configure( input, output, option0, option1);
+// Retrieve the execution window of the kernel:
+const Window& max_window = kernel.window();
+// Run the whole kernel in the current thread:
+kernel.run( q, max_window ); // Enqueue the kernel to process the full window on the default queue
+
+// Wait for the processing to complete:
+q.finish();
+@endcode
+
+Neon / CPP kernels:
+
+@code{.cpp}
+//Create a kernel object:
+MyKernel kernel;
+// Initialize the kernel with the input/output and options you want to use:
+kernel.configure( input, output, option0, option1);
+// Retrieve the execution window of the kernel:
+const Window& max_window = kernel.window();
+// Run the whole kernel in the current thread:
+kernel.run( max_window ); // Run the kernel on the full window
+@endcode
+
+@section implementation_topic_multithreading Multi-threading
+
+The previous section shows how to run a Arm® Neon™ / CPP kernel in the current thread, however if your system has several CPU cores, you will probably want the kernel to use several cores. Here is how this can be done:
+
+@code{.cpp}
+    ThreadInfo info;
+    info.cpu_info = &_cpu_info;
+
+    const Window      &max_window     = kernel->window();
+    const unsigned int num_iterations = max_window.num_iterations(split_dimension);
+    info.num_threads                  = std::min(num_iterations, _num_threads);
+
+    if(num_iterations == 0)
+    {
+        return;
+    }
+
+    if(!kernel->is_parallelisable() || info.num_threads == 1)
+    {
+        kernel->run(max_window, info);
+    }
+    else
+    {
+        int  t         = 0;
+        auto thread_it = _threads.begin();
+
+        for(; t < info.num_threads - 1; ++t, ++thread_it)
+        {
+            Window win     = max_window.split_window(split_dimension, t, info.num_threads);
+            info.thread_id = t;
+            thread_it->start(kernel, win, info);
+        }
+
+        // Run last part on main thread
+        Window win     = max_window.split_window(split_dimension, t, info.num_threads);
+        info.thread_id = t;
+        kernel->run(win, info);
+
+        try
+        {
+            for(auto &thread : _threads)
+            {
+                thread.wait();
+            }
+        }
+        catch(const std::system_error &e)
+        {
+            std::cerr << "Caught system_error with code " << e.code() << " meaning " << e.what() << '\n';
+        }
+    }
+@endcode
+
+This is a very basic implementation which was originally used in the Arm® Neon™ runtime library by all the Arm® Neon™ functions.
+
+@sa CPPScheduler
+
+@note Some kernels need some local temporary buffer to perform their calculations. In order to avoid memory corruption between threads, the local buffer must be of size: ```memory_needed_per_thread * num_threads``` and a unique thread_id between 0 and num_threads must be assigned to the @ref ThreadInfo object passed to the ```run``` function.
+
+
+@section implementation_topic_cl_scheduler OpenCL kernel library
+
+All OpenCL kernels used by the library are built and stored in @ref CLKernelLibrary.
+If the library is compiled with embed_kernels=0 the application can set the path to the OpenCL kernels by calling @ref CLKernelLibrary::init(), by default the path is set to "./cl_kernels"
+*/
+} // namespace arm_compute
diff --git a/docs/contributor_guide/non_inclusive_language_examples.dox b/docs/contributor_guide/non_inclusive_language_examples.dox
new file mode 100644
index 0000000000..addfdd34dd
--- /dev/null
+++ b/docs/contributor_guide/non_inclusive_language_examples.dox
@@ -0,0 +1,4 @@
+ - master/slave
+ - black/white
+ - he/she, him/her, his/hers
+   - When referring to a person where gender is irrelevant or unknown, kindly use they, them, theirs, or a person’s preferred pronoun.
+\ No newline at end of file