Documentation: add a doc for blk-iolatency
A basic documentation to describe the interface, statistics, and
behavior of io.latency.
Signed-off-by: Josef Bacik <email@example.com>
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeae..f6684ec 100644
@@ -51,6 +51,9 @@
5-3-1. IO Interface Files
+ 5-3-3. IO Latency
+ 5-3-3-1. How IO Latency Throttling Works
+ 5-3-3-2. IO Latency Interface Files
5-4-1. PID Interface Files
@@ -1395,6 +1398,82 @@
+This is a cgroup v2 controller for IO workload protection. You provide a group
+with a latency target, and if the average latency exceeds that target the
+controller will throttle any peers that have a lower latency target than the
+The limits are only applied at the peer level in the hierarchy. This means that
+in the diagram below, only groups A, B, and C will influence each other, and
+groups D and F will influence each other. Group G will influence nobody.
+ / | \
+ A B C
+ / \ |
+ D F G
+So the ideal way to configure this is to set io.latency in groups A, B, and C.
+Generally you do not want to set a value lower than the latency your device
+supports. Experiment to find the value that works best for your workload, start
+at higher than the expected latency for your device and watch the total_lat_avg
+value in io.stat for your workload group to get an idea of the latency you see
+during normal operation. Use this value as a basis for your real setting,
+setting at 10-15% higher than the value in io.stat. Experimentation is key here
+because total_lat_avg is a running total, so is the "statistics" portion of
+"lies, damned lies, and statistics."
+How IO Latency Throttling Works
+io.latency is work conserving, so as long as everybody is meeting their latency
+target the controller doesn't do anything. Once a group starts missing it's
+target it begins throttling any peer group that has a higher target than itself.
+This throttling takes 2 forms:
+- Queue depth throttling. This is the number of outstanding IO's a group is
+ allowed to have. We will clamp down relatively quickly, starting at no limit
+ and going all the way down to 1 IO at a time.
+- Artificial delay induction. There are certain types of IO that cannot be
+ throttled without possibly adversely affecting higher priority groups. This
+ includes swapping and metadata IO. These types of IO are allowed to occur
+ normally, however they are "charged" to the originating group. If the
+ originating group is being throttled you will see the use_delay and delay
+ fields in io.stat increase. The delay value is how many microseconds that are
+ being added to any process that runs in this group. Because this number can
+ grow quite large if there is a lot of swapping or metadata IO occurring we
+ limit the individual delay events to 1 second at a time.
+Once the victimized group starts meeting it's latency target again it will start
+unthrottling any peer groups that were throttled previously. If the victimized
+group simply stops doing IO the global counter will unthrottle appropriately.
+IO Latency Interface Files
+ This takes a similar format as the other controllers.
+ "MAJOR:MINOR target=<target time in microseconds"
+ If the controller is enabled you will see extra stats in io.stat in
+ addition to the normal ones.
+ This is the current queue depth for the group.
+ The running average IO latency for this group in microseconds.
+ Running average is generally flawed, but will give an
+ administrator a general idea of the overall latency they can
+ expect for their workload on the given disk.