Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Support
    • Submit feedback
    • Contribute to GitLab
  • Sign in
M
margo
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 13
    • Issues 13
    • List
    • Boards
    • Labels
    • Milestones
  • Merge Requests 1
    • Merge Requests 1
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
  • sds
  • margo
  • Issues
  • #68

Closed
Open
Opened Dec 08, 2020 by Rob Latham@robl
  • Report abuse
  • New issue
Report abuse New issue

`margo_forward_timed` and long-tail latency

Description

In ssg, most of the rpcs use the margo_forward_timed routine with a timeout of 2 seconds (20000 msecs). In some situations, 2 seconds is not long enough. However in those situations margo_forward completes ten times faster.

Scenario

  • OLCF Summit
  • 8 nodes for SSG provider, one process per node
  • 32 nodes for SSG "observers" (clients), 32 processes per node
  • The 'ssg-bench' tests, which are the launch/observe tests from ssg but I added some more timing information https://xgitlab.cels.anl.gov/sds/ssg-bench

My modified 'ssg-bench' does the following:

  • each MPI process records how long it takes to initialize the software stack (MPI_Init(), margo_init() and ssg_init()); how long it takes to load the ssg serialized group state file; and how long it takes to observe the ssg group
  • rank 0 collects all the timings
  • rank 0 reports a five-bin histogram of observe times
  • rank 0 reports the average, minimum, and maximum times for the initialize, load, and observe steps

With margo_forward_timed, the 1024 clients show the following distributions of observe times:

0.006624-0.420983 : 1020
0.420983-0.835342 : 0
0.835342-1.249700 : 0
1.249700-1.664059 : 0
1.664059-2.078418 : 4
 1024 : init average (min max): 6.495028 ( 4.839191 7.676488 )
 1024 : load average (min max): 0.004141 ( 0.000073 0.030747 )
 1024 : observe average (min max): 0.123265 ( 0.006624 2.078418 )

but with margo_forward, the same experiment completes much more quickly:

0.007187-0.047266 : 270
0.047266-0.087346 : 121
0.087346-0.127425 : 248
0.127425-0.167504 : 276
0.167504-0.207583 : 109
 1024 : init average (min max): 6.133601 ( 4.772916 7.190876 )
 1024 : load average (min max): 0.003372 ( 0.000062 0.021013 )
 1024 : observe average (min max): 0.099808 ( 0.007187 0.207583 )
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
0
Labels
None
Assign labels
  • View project labels
Reference: sds/margo#68