Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
A
aml
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
24
Issues
24
List
Boards
Labels
Milestones
Merge Requests
5
Merge Requests
5
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
argo
aml
Commits
20409e3a
Commit
20409e3a
authored
Sep 04, 2019
by
Florence Monna
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
[wip] working on a layout example based on DGEMM.
parent
ae4b3a20
Pipeline
#8411
passed with stages
in 11 minutes and 15 seconds
Changes
1
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
57 additions
and
15 deletions
+57
-15
doc/pages/layout.rst
doc/pages/layout.rst
+57
-15
No files found.
doc/pages/layout.rst
View file @
20409e3a
...
...
@@ -47,32 +47,74 @@ pictured in the figure below.
Example
-------
Let's look at a problem where layouts can be quite useful:
matrix
multiplication, with DGEMM
.
Let's say you want to multipl
e
matrix A (size [m, k]) with matrix B
Let's look at a problem where layouts can be quite useful:
DGEMM in multiple
levels
.
Let's say you want to multipl
y
matrix A (size [m, k]) with matrix B
(size [k, n]) to get matrix C (size [m, n]).
The first step is implementing an efficient micro-kernel.
The micro-kernel update a block of C of size [mr, nr] noted C_r using a block of
The naive matrix multiplication algorithm should look something like:
.. code:: c
for (i = 0; i < m; i++){
for (j = 0; j < n; j++){
cij = C[i*n + j];
for (l = 0; l < k; l++)
cij += A[i*n + l] * B[l*n + j];
C[i*n + j] = cij;
}
}
Unfortunately this algorithm does not have a great runtime complexity...
We can then have 3 nested loops running on blocks of the matrices.
With several sizes of memory, we want to lverage the power of using blocks of
different sizes.
Let's take an algorithm with three levels of granularity.
The first level is focused on fitting our blocks in the smallest cache.
We compute a block of C of size [mr, nr] noted C_r using a block of
A of size [mr, kb] noted A_r, and a block of B of size [kb, nr] noted B_r.
A_r is stored in column major order while C_r and B_r are stored in row major
order.
The medium kernel works using blocks of intermediate size.
The medium kernel updates a block of C of size [kb, n] noted C_b using a block
order, allowing us to read A_r row by row, and go with B_r and C_r column by
column.
.. code:: c
for (i = 0; i < m_r; i++){
for (j = 0; j < n_r; j++){
for (l = 0; l < k_b; l++)
C_r[i][j] += A_r[i][l] + B_r[l][j];
}
}
These are our smallest blocks.
The implementation at this level is simply doing the multiplication at a level
where is fast enough.
B_r blocks need to be transposed before they can be accessed column by column.
The second level is when the matrices are so big that you need a second
caching.
We then use blocks of intermediate size.
We compute a block of C of size [mb, n] noted C_b using a block
of A of size [mb, kb] noted A_b, and a block of B of size [kb, n] noted B_b.
A_b is stored as mb/mr consecutive blocks of size [mr, kb] (A_r) in column major
order while C_b is stored as (mb/mr)*(n/nr) blocks of size [mr, nr] (C_r) in row
major order and B_b is stored as n/nr blocks of size [kb, nr] (B_r) in row major
order.
To be efficient, A_b is stored as mb/mr consecutive blocks of size [mr, kb]
(A_r) in column major order while C_b is stored as (mb/mr)*(n/nr) blocks of
size [mr, nr] (C_r) in row major order and B_b is stored as n/nr blocks of size
[kb, nr] (B_r) in row major order.
This means we need to have Ab laid out as a 3-dimensional array mr x kb x (mb/mr),
B as nr x kb x (n/nr), C with 4 dimensions as nr x mr x (mb/mr) x (n/nr).
The large kernel uses matrices of any size.
Let's say we consider the matrices already transformed.
The last level uses the actual matrices, of any size.
The original matrices are C of size [m, n], A of size [m, k] and B of size
[k, n].
The layout used here are: C is stored as m/mb blocks of C_b, A is stored as
(k/kb) * (m/mb) blocks of A_b and B is stored as k/kb blocks of B_b.
This means we need to rework A to be laid out in 5 dimensions as
mr x kb x mb/mr x m/mb x k/kb,
B in 4 dimensions as nr x kb x n/nr x k/kb,
C in 5 dimensions as nr x mr x mb/mr x n/nr x m/mb
High level API
--------------
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment