Commit 20409e3a authored by Florence Monna's avatar Florence Monna

[wip] working on a layout example based on DGEMM.

parent ae4b3a20
Pipeline #8411 passed with stages
in 11 minutes and 15 seconds
......@@ -47,32 +47,74 @@ pictured in the figure below.
Example
-------
Let's look at a problem where layouts can be quite useful: matrix
multiplication, with DGEMM.
Let's say you want to multiple matrix A (size [m, k]) with matrix B
Let's look at a problem where layouts can be quite useful: DGEMM in multiple
levels.
Let's say you want to multiply matrix A (size [m, k]) with matrix B
(size [k, n]) to get matrix C (size [m, n]).
The first step is implementing an efficient micro-kernel.
The micro-kernel update a block of C of size [mr, nr] noted C_r using a block of
The naive matrix multiplication algorithm should look something like:
.. code:: c
for (i = 0; i < m; i++){
for (j = 0; j < n; j++){
cij = C[i*n + j];
for (l = 0; l < k; l++)
cij += A[i*n + l] * B[l*n + j];
C[i*n + j] = cij;
}
}
Unfortunately this algorithm does not have a great runtime complexity...
We can then have 3 nested loops running on blocks of the matrices.
With several sizes of memory, we want to lverage the power of using blocks of
different sizes.
Let's take an algorithm with three levels of granularity.
The first level is focused on fitting our blocks in the smallest cache.
We compute a block of C of size [mr, nr] noted C_r using a block of
A of size [mr, kb] noted A_r, and a block of B of size [kb, nr] noted B_r.
A_r is stored in column major order while C_r and B_r are stored in row major
order.
The medium kernel works using blocks of intermediate size.
The medium kernel updates a block of C of size [kb, n] noted C_b using a block
order, allowing us to read A_r row by row, and go with B_r and C_r column by
column.
.. code:: c
for (i = 0; i < m_r; i++){
for (j = 0; j < n_r; j++){
for (l = 0; l < k_b; l++)
C_r[i][j] += A_r[i][l] + B_r[l][j];
}
}
These are our smallest blocks.
The implementation at this level is simply doing the multiplication at a level
where is fast enough.
B_r blocks need to be transposed before they can be accessed column by column.
The second level is when the matrices are so big that you need a second
caching.
We then use blocks of intermediate size.
We compute a block of C of size [mb, n] noted C_b using a block
of A of size [mb, kb] noted A_b, and a block of B of size [kb, n] noted B_b.
A_b is stored as mb/mr consecutive blocks of size [mr, kb] (A_r) in column major
order while C_b is stored as (mb/mr)*(n/nr) blocks of size [mr, nr] (C_r) in row
major order and B_b is stored as n/nr blocks of size [kb, nr] (B_r) in row major
order.
To be efficient, A_b is stored as mb/mr consecutive blocks of size [mr, kb]
(A_r) in column major order while C_b is stored as (mb/mr)*(n/nr) blocks of
size [mr, nr] (C_r) in row major order and B_b is stored as n/nr blocks of size
[kb, nr] (B_r) in row major order.
This means we need to have Ab laid out as a 3-dimensional array mr x kb x (mb/mr),
B as nr x kb x (n/nr), C with 4 dimensions as nr x mr x (mb/mr) x (n/nr).
The large kernel uses matrices of any size.
Let's say we consider the matrices already transformed.
The last level uses the actual matrices, of any size.
The original matrices are C of size [m, n], A of size [m, k] and B of size
[k, n].
The layout used here are: C is stored as m/mb blocks of C_b, A is stored as
(k/kb) * (m/mb) blocks of A_b and B is stored as k/kb blocks of B_b.
This means we need to rework A to be laid out in 5 dimensions as
mr x kb x mb/mr x m/mb x k/kb,
B in 4 dimensions as nr x kb x n/nr x k/kb,
C in 5 dimensions as nr x mr x mb/mr x n/nr x m/mb
High level API
--------------
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment