Commit 6905ada6 authored by Kamil Iskra's avatar Kamil Iskra
Browse files

Minor tweaks

parent 23b69254
Pipeline #9644 passed with stages
in 3 minutes and 16 seconds
......@@ -25,7 +25,7 @@ void test_default_area(const size_t size, void *host_buf)
aml_perror("aml_area_cuda");
exit(1);
}
// Check we can perform a data transfert to mapped device memory.
// Check we can perform a data transfer to mapped device memory.
assert(cudaMemcpy(device_buf,
host_buf,
size, cudaMemcpyHostToDevice) == cudaSuccess);
......@@ -71,7 +71,7 @@ void test_custom_area(const size_t size, void *host_buf)
memset(host_buf, '#', size);
// Check that data on device has been set to the same value,
// i.e mapping works.
// i.e., mapping works.
assert(cudaMemcpy(host_buf,
device_buf,
size, cudaMemcpyDeviceToHost) == cudaSuccess);
......
......@@ -84,7 +84,7 @@ is_interleaved(void *data, const size_t size, const size_t page_size)
// Custom area attributes.
struct area_data {
unsigned long nid; // Current node id;
unsigned long nmax; // maximum amount of numa nodes in this system.
unsigned long nmax; // highest node number on this system.
int page_size; // Size of a page;
};
......
Areas
=====
AML areas represent places where data can be read and write.
Typically an area is an object from which you can query and free memory.
AML lends these actions mmap() and munmap() after linux system call to
AML areas represent places where data can be read and written.
Typically, an area is an object from which you can query and free memory.
AML lends these actions mmap() and munmap() after Linux system calls to
map physical memory in the virtual address space.
On NUMA processor, mapping of data can be as complex as interleaving chunks
on several physical memories and making them appear as contiguous is user land.
On a NUMA processor, mapping of data can be as complex as interleaving chunks
of several physical memories and making them appear as contiguous in user land.
On accelerator, physical memory of the device can be mapped with virtual
address range that is mapped on the host physical memory.
AML builds areas on top of libnuma and cuda to reach these levels of
AML builds areas on top of libnuma and CUDA to reach these levels of
customization while keeping the memory queries/free as simple as a function call.
As compute nodes get bigger, building relevant areas to manage locality
is likely to improve performance of memory-bound applications.
AML areas provide functions to query memory from specific places materialized as
areas. Available area implementations dictate the way such
places can be arranged and their properties. AML areas is a
places can be arranged and determine their properties. AML area is a
low-level/high-overhead abstraction and is not intended to be optimized for
fragmented small allocations. Instead it intends to be a basic block for
fragmented, small allocations. Instead, it is intended to be a basic block for
implementing better allocators.
The API of AML Area is broke down into two levels.
The API of AML Area is broken down into two levels.
- The `high-level API <../../pages/areas.html>`_ provides generic functions that can be applied on all areas. It also describes the general structure of an area for implementers.
- Implementations specific methods, constructors and static areas declarations stand in the second level of headers `<aml/area/\*.h> <https://xgitlab.cels.anl.gov/argo/aml/tree/master/include/aml/area>`_.
- The `high-level API <../../pages/areas.html>`_ provides generic functions that can be applied to all areas. It also describes the general structure of an area for implementers.
- Implementation-specific methods, constructors, and static area declarations reside in the second level of headers `<aml/area/\*.h> <https://xgitlab.cels.anl.gov/argo/aml/tree/master/include/aml/area>`_.
Use Cases
-------------
- Building custom memory mapping policies: high bandwidth memory only, interleave custom block sizes.
- Building custom memory mapping policies: high-bandwidth memory only, interleave custom block sizes.
- Abstracting allocators memory mapping.
Usage
......@@ -51,8 +51,8 @@ a static declaration of a default area: `aml_area_linux`.
void *data = aml_area_mmap(&aml_area_linux, 4096, NULL);
Here we have allocated 4096 Bytes of data available in `data` field.
This data can later be freed as follow.
Here we have allocated 4096 bytes of data and stored ite in the `data`
variable. This data can later be freed as follows.
.. code-block:: c
......@@ -62,10 +62,10 @@ Linux Area
----------
If you are working on a NUMA processor, you eventually want more
control on your memory provider. For instance you might want your data
to be spread on all memories to balance the load. One way to achieve it
is to use interleave linux policy. This policy can be applied when
building a custom linux area.
control over your memory provider. For instance, you might want your data
to be spread across all memories to balance the load. One way to achieve it
is to use the interleave Linux policy. This policy can be applied when
building a custom Linux area.
.. code-block:: c
......@@ -78,31 +78,32 @@ Now we have an "allocator" of interleaved data.
void *data = aml_area_mmap(interleave_area, 4096*8, NULL);
Here we have allocated 8*4096 Bytes of data across system memories.
Here we have allocated 8*4096 bytes of data across system memories.
CUDA Area
---------
If you compiled AML on a cuda capable node, you will be able to use
AML cuda implementations of its building blocks.
It is possible to allocate cuda devices memory with aml,
in a very similar way as with linux implementation.
If you compiled AML on a CUDA-capable node, you will be able to use
AML CUDA implementation of its building blocks.
It is possible to allocate CUDA device memory with AML,
in a very similar way to the Linux implementation.
.. code-block:: c
#include <aml.h> // General high level API
#include <aml/cuda/linux.h> // Cuda area implementation.
#include <aml/cuda/linux.h> // CUDA area implementation.
void *data = aml_area_mmap(&aml_area_cuda, 4096, NULL);
The pointer obtained from this allocation is a device side pointer.
It can't be directly read and written from host processor.
The pointer obtained from this allocation is a device-side pointer.
It can't be directly read and written from a host processor.
Exercise: CUDA Mirror Allocation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As an exercise, dive into `<aml/cuda/linux.h>` header and create an area
that will hand out pointer that can be read and written both on host and
device side. Check that modifications on host side are mirrored on device side.
that will hand out pointer that can be read and written both on the host and
the device side. Check that modifications on the host side are mirrored on the
device side.
.. container:: toggle
......@@ -113,7 +114,7 @@ device side. Check that modifications on host side are mirrored on device side.
.. literalinclude:: 0_aml_area_cuda.c
:language: c
You can find this solution in *doc/tutorials/area*.
You can find this solution in *doc/tutorials/area/*.
Implementing a Custom Area
--------------------------
......@@ -121,10 +122,11 @@ Implementing a Custom Area
You might want to use AML blocks with a different area behaviour that is not
part of AML. This is achievable by implementing the area building block to
match the desired behavior.
In short, all AML building blocks consist in attributes stored in `data` field
and methods stored in `ops` field. In the case of area, `struct aml_area_ops`
require that custom mmap, munmap, and fprintf fields are implemented.
Let's implement an empty area. This area will have no attributes, i.e data
In short, all AML building blocks consist of attributes stored in the `data`
field and methods stored in the `ops` field. In the case of an area,
`struct aml_area_ops`
requires that custom mmap, munmap, and fprintf fields are implemented.
Let's implement an empty area. This area will have no attributes, i.e., data
is NULL and its operation will print a message.
We first implement area methods.
......@@ -180,22 +182,22 @@ Exercise: interleaving in blocks of 2 pages
With the use of mbind() function from libnuma, implement an area
that will interleave blocks of 2 pages on the system memories.
For instance, let a system with 4 NUMA nodes and a buffer of
16 pages. Pages have to be allocated as follow:
For instance, let's assume a system with 4 NUMA nodes and a buffer of
16 pages. Pages have to be allocated as follows:
.. code-block:: c
page: [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 ]
NUMA: [ 0, 0, 1, 1, 2, 2, 3, 3, 0, 0, 1, 1, 2, 2, 3, 3 ]
You can retrieve the size of a page the following way:
You can retrieve the size of a page in the following way:
.. code-block:: c
#include <unistd.h>
sysconf(_SC_PAGESIZE);
You can test if your data is interleaved as requested with below code.
You can test if your data is interleaved as requested using the below code.
.. container:: toggle
......
......@@ -137,7 +137,7 @@ main(void)
aml_dma_async_copy(dma, &b_request, b_continuous_layout, b_layout);
CHK_ABORT(err, "aml_dma_async_copy_custom:");
// Wait requests
// Wait for the requests to complete
err = aml_dma_wait(dma, &a_request);
CHK_ABORT(err, "aml_dma_wait:");
err = aml_dma_wait(dma, &b_request);
......
......@@ -4,7 +4,7 @@ DMAs
What is an AML DMA ?
--------------------
In computer science, DMA (Direct Memory Access) is an hardware accelerated
In computer science, DMA (Direct Memory Access) is a hardware-accelerated
method for moving data across memory regions without the intervention of a
compute unit.
......@@ -13,41 +13,42 @@ generally moved between two `areas <../../pages/areas.html>`_,
between two virtual memory ranges represented by a pointer.
Data is moved from one `layout <../../pages/layout.html>`_ to another.
When performing a dma, layout coordinates are walked element by element in
post-order and matched to translate source coordinates into destination
When performing a DMA operation, layout coordinates are walked element by
element in
post order and matched to translate source coordinates into destination
coordinates.
Depending on the dma implementation, this operation can be
optimized or offloaded to a dma accelerator.
Data can thus be moved asynchronously by the DMA engine, e.g pthreads on a CPU
and cuda streams on cuda accelerators.
Depending on the DMA implementation, this operation can be
optimized or offloaded to a DMA accelerator.
Data can thus be moved asynchronously by the DMA engine, e.g., pthreads on a CPU
and CUDA streams on CUDA accelerators.
The API for using AML DMA is broke down into two levels.
- The `high-level API <../../pages/dmas.html>`_ provides generic functions that
can be applied on all DMAs. It also describes the general structure of a DMA
for implementers.
- Implementations specific methods, constructors and static DMAs declarations
stand in the second level of headers `<aml/dma/\*.h> <https://xgitlab.cels.anl.gov/argo/aml/tree/master/include/aml/dma>`_.
The API for using AML DMA is broken down into two levels.
- The `high-level API <../../pages/dmas.html>`_ provides generic functions that can be applied on all DMAs. It also describes the general structure of a DMA for implementers.
- Implementation-specific methods, constructors, and static DMAs declarations reside in the second level of headers `<aml/dma/\*.h> <https://xgitlab.cels.anl.gov/argo/aml/tree/master/include/aml/dma>`_.
Examples of AML DMA Use Cases
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- `Prefetching <https://doi.org/10.1109/MCHPC49590.2019.00015>`_:
Writing an efficient matrix multiplication routine requires many architecture
specific optimizations such as vectorization, and cache blocking.
On hardware with software managed side memory cache (e.g MCDRAM on Intel
Knights Landing) manual prefetch of blocks in the side cache helps improving
performance on large problem sizes. AML DMA can help making the code
Writing an efficient matrix multiplication routine requires many
architecture-specific optimizations such as vectorization and cache
blocking.
On hardware with software-managed side memory cache (e.g., MCDRAM on Intel
Knights Landing), manual prefetch of blocks in the side cache helps improve
performance for large problem sizes. AML DMA can help make the code
more compact by abstracting asynchronous memory movements. Moreover,
AML is also able to benefit from the prefetch time to reorganize data
such that matrix multiplication on the prefetched blocks will be vectorized.
See linked `publication <https://doi.org/10.1109/MCHPC49590.2019.00015>`_
for more details.
- Replication: `Some applications <https://github.com/ANL-CESAR/XSBench>`_
will have a memory access pattern such that all threads will access one data
- Replication:
`Some applications <https://github.com/ANL-CESAR/XSBench>`_
will have a memory access pattern such that all threads will access same data
in a read-only and latency-bound fashion. On NUMA computing systems, accessing
distant memories will imply a penalty that translates in a penalty for
distant memories will imply a penalty that results in an increased
application execution time. In such a scenario (application + NUMA),
replicating data on memories in order to avoid NUMA penalties can result in
significant performance improvements. AML DMA is the building block to go when
......@@ -62,42 +63,42 @@ First, include the good headers.
.. code-block:: c
#include <aml.h>
#include <aml/dma/linux-par.h> // one dma implementation.
#include <aml/dma/linux-par.h> // one DMA implementation.
#include <aml/layout/dense.h> // one layout implementation.
First header contains `DMA generic API <../../pages/dmas.html>`_ and AML utils.
`Second header <../../pages/dma_linux_par_api.html>`_ will help building a dma
performing data transfer in the background with pthreads.
`Third header <../../pages/layout_dense.html>`_ will help describing source and
`Second header <../../pages/dma_linux_par_api.html>`_ will help build a
DMA-performing data transfer in the background with pthreads.
`Third header <../../pages/layout_dense.html>`_ will help describe source and
destination data to transfer.
In order to perform a DMA request, you will need to set up a DMA, i.e
the engine that perform requests, then perform the request.
In order to perform a DMA request, you will need to set up a DMA, i.e.,
the engine that performs requests, then perform the request itself.
.. code-block:: c
struct aml_dma *dma;
aml_dma_linux_par_create(&dma, 128, NULL, NULL);
We created a dma that has 128 slots available off the shelf to
handle asycnhronous data transfer requests.
We created a DMA that has 128 slots available off-the-shelf to
handle asynchronous data transfer requests.
The requests happen in between two layouts, therefore you will have
to setup layouts has well. Let suppose we want to copy `src` to `dst`
The requests happen in-between two layouts, therefore you will have
to set up layouts as well. Let's suppose we want to copy `src` to `dst`
.. code-block:: c
double src[8] = {1, 2, 3, 4, 5, 6, 7, 8};
double dst[8] = {0, 0, 0, 0, 0, 0, 0, 0};
The simplest copy requires that both `src` and `dst` are one dimensional
The simplest copy requires that both `src` and `dst` are one-dimensional
layouts of 8 elements.
.. code-block:: c
size_t dims[1] = {8};
For a one dimension layout, dimensions order does not matter, so let's pick
For a one-dimension layout, dimension order does not matter, so let's pick
`AML_LAYOUT_ORDER_COLUMN_MAJOR`. Now we can initialize layouts.
.. code-block:: c
......@@ -107,14 +108,14 @@ For a one dimension layout, dimensions order does not matter, so let's pick
aml_layout_dense_create(&src_layout, src, AML_LAYOUT_ORDER_COLUMN_MAJOR, sizeof(*src), 1, dims, NULL, NULL);
We have created a DMA engine and described our source and destination data.
We are all set to schedule a copy dma request.
We are all set to schedule a copy DMA request.
.. code-block:: c
struct aml_dma_request *request;
aml_dma_async_copy_custom(dma, &request, dst_layout, src_layout, NULL, NULL);
Now the dma request is on flight.
Now the DMA request is in-flight.
When we are ready to access data in dst, we can wait for it.
.. code-block:: c
......@@ -124,9 +125,9 @@ When we are ready to access data in dst, we can wait for it.
Exercise
--------
Let `a` a strided vector where contiguous elements are separated by a blank.
Let `b` a strided vector where contiguous elements are separated by 2 blanks.
Let `ddot` a function operating on two continuous vectors to perform a dot
Let `a` be a strided vector where contiguous elements are separated by a blank.
Let `b` be a strided vector where contiguous elements are separated by 2 blanks.
Let `ddot` be a function operating on two continuous vectors to perform a dot
product.
The goal is to transform `a` into `continuous_a` and `b` into `continuous_b`
in order to perform the dot product.
......@@ -178,5 +179,5 @@ Solution
.. literalinclude:: 1_reduction.c
:language: c
You can find this solution in *doc/tutorials/dma*.
You can find this solution in *doc/tutorials/dma/*.
Hello World: Init and Version Check
===================================
A first and easy test that AML is setup properly and can be linked with a user
program is to try to check that headers and library version are matching. All
AML programs must also start by initializing the library, and must end with a
A first and easy test that AML is set up properly and can be linked with a user
program is to try to check that the headers and the library version match. All
AML programs must also start by initializing the library and must end with a
call to the cleanup function.
APIs
......@@ -24,13 +24,14 @@ Version API
Usage
-----
Both setup and version APIs is available directly from the main AML header.
Both the setup and the version APIs are available directly from the main AML
header.
.. code-block:: c
#include <aml.h>
Initialization is done by passing pointers to the command line arguments of
Initialization is done by passing pointers to the command-line arguments of
the program to the library.
......@@ -68,4 +69,4 @@ code.
.. literalinclude:: 0_hello.c
:language: c
You can find this solution in *doc/tutorials/hello_world*.
You can find this solution in *doc/tutorials/hello_world/*.
AML: Tutorials
==============
This section contains step by step tutorials from each of the building blocks
This section contains step-by-step tutorials for each of the building blocks
of AML. They are intended to be followed in order, and include both
explanations of each building block abstraction as well as directed exercises
to better understand each abstraction.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment