blob: 8a6107bf80d4588bd891bbaf762dcdff83319508 [file] [log] [blame]
Persistent Memory
-----------------
These pages contain instructions, links and other information related to
persistent memory in Linux.
.. toctree::
:maxdepth: 1
memmap_kernel_params
fs_mount_options
2mib_fs_dax
pmem_in_qemu
Links
~~~~~
Miscellaneous
^^^^^^^^^^^^^
- NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
- Driver Writer’s Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
- NVDIMM Kernel Tree: https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git
- NDCTL: https://github.com/pmem/ndctl.git
- NDCTL manual pages online: http://pmem.io/ndctl/
- linux-nvdimm Mailing List: https://lore.kernel.org/nvdimm/
- linux-nvdimm Patchwork: https://patchwork.kernel.org/project/linux-nvdimm/list/
Blogs
^^^^^
- `NVDIMM Enabling in SUSE Linux Enterprise 12, Service Pack 2 - Part
1 <https://www.suse.com/communities/blog/nvdimm-enabling-suse-linux-enterprise-12-service-pack-2/>`__
- `NVDIMM Enabling in SUSE Linux Enterprise 12, Service Pack 2 - Part
2 <https://www.suse.com/communities/blog/nvdimm-enabling-part-2-intel/>`__
Industry standards and specifications
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Advanced Configuration and Power Interface (ACPI) 6.2a:
http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf
- Unified Extensible Firmware Interface (UEFI) Specification 2.7:
http://www.uefi.org/sites/default/files/resources/UEFI_Spec_2_7.pdf
- DMTF System Management BIOS (SMBIOS 3.2.0):
https://www.dmtf.org/sites/default/files/standards/documents/DSP0134_3.2.0.pdf
- JEDEC Byte Addressable Energy Backed Interface (JESD245B.01):
https://www.jedec.org/system/files/docs/JESD245B-01.pdf
- JEDEC DDR4 NVDIMM-N Design Specification (JESD248A):
https://www.jedec.org/system/files/docs/JESD248A.pdf
Vendor-specific tools and specifications
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Intel Optane DC persistent memory management software:
https://github.com/intel/ipmctl
- Intel DSM Interface:
http://pmem.io/documents/NVDIMM_DSM_Interface-V1.7.pdf
- Microsoft DSM Interface:
https://docs.microsoft.com/en-us/windows-hardware/drivers/storage/-dsm-interface-for-byte-addressable-energy-backed-function-class--function-interface-1-
- HPE DSM Interface: https://github.com/HewlettPackard/hpe-nvm
Subtopics
~~~~~~~~~
- :doc:`memmap_kernel_params`
- :doc:`2mib_fs_dax`
- :doc:`pmem_in_qemu`
Quick Setup Guide
~~~~~~~~~~~~~~~~~
One interesting use of the PMEM driver is to allow users to begin
developing software using DAX, which was upstreamed in v4.0. On a
non-NFIT system this can be done by using PMEM's memmap kernel command
line to manually create a type 12 memory region.
Here are the additions I made for my system with 32 GiB of RAM:
1) Reserve 16 GiB of memory via the "memmap" kernel parameter in grub's
menu.lst, using PMEM's new "!" specifier::
memmap=16G!16G
The documentation for this parameter can be found here:
https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt
Also see: :doc:`memmap_kernel_params`.
2) Set up the correct kernel configuration options for PMEM and DAX in
.config. To use huge pages for mmapped files, you'll need
CONFIG_FS_DAX_PMD selected, which is done automatically if you have the
prerequisites marked below.
Options in make menuconfig:
- Device Drivers - NVDIMM (Non-Volatile Memory Device) Support
- PMEM: Persistent memory block device support
- BLK: Block data window (aperture) device support
- BTT: Block Translation Table (atomic sector updates)
- Enable the block layer
- Block device DAX support <not available in kernel-4.5 due to page
cache issues>
- File systems
- Direct Access (DAX) support
- Processor type and features
- Support non-standard NVDIMMs and ADR protected memory <if using
the memmap kernel parameter>
- Transparent Hugepage Support <needed for huge pages>
- Allow for memory hot-add <needed for huge pages>
- Allow for memory hot remove <needed for huge pages>
- Device memory (pmem, HMM, etc...) hotplug support <needed for huge
pages>
::
CONFIG_ZONE_DEVICE=y
CONFIG_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTREMOVE=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_ACPI_NFIT=m
CONFIG_X86_PMEM_LEGACY=m
CONFIG_OF_PMEM=m
CONFIG_LIBNVDIMM=m
CONFIG_BLK_DEV_PMEM=m
CONFIG_BTT=y
CONFIG_NVDIMM_PFN=y
CONFIG_NVDIMM_DAX=y
CONFIG_FS_DAX=y
CONFIG_DAX=y
CONFIG_DEV_DAX=m
CONFIG_DEV_DAX_PMEM=m
CONFIG_DEV_DAX_KMEM=m
This configuration gave me one pmem device with 16 GiB of space::
$ fdisk -l /dev/pmem0
Disk /dev/pmem0: 16 GiB, 17179869184 bytes, 33554432 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
lsblk shows the block devices, including pmem devices. Examples::
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
pmem0 259:0 0 16G 0 disk
├─pmem0p1 259:6 0 4G 0 part /mnt/ext4-pmem0
└─pmem0p2 259:7 0 11.9G 0 part /mnt/btrfs-pmem0
pmem1 259:1 0 16G 0 disk /mnt/xfs-pmem1
pmem2 259:2 0 16G 0 disk /mnt/xfs-pmem2
pmem3 259:3 0 16G 0 disk /mnt/xfs-pmem3
$ lsblk -t
NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME
pmem0 0 4096 0 4096 512 0 128 128 0B
pmem1 0 4096 0 4096 512 0 128 128 0B
pmem2 0 4096 0 4096 512 0 128 128 0B
pmem3 0 4096 0 4096 512 0 128 128 0B
Namespaces
~~~~~~~~~~
You can divide persistent memory address ranges into namespaces with
ndctl. This stores namespace label metadata at the beginning of the
persistent memory address range.
ndctl create-namespace ties a namespace to a block device or character
device:
.. list-table::
:header-rows: 1
- - mode
- description
- device path
- device type
- label metadata
- atomicity
- filesystems
- DAX
- PFN metadata
- former name
- - raw
- raw
- /dev/pmemN
- block
- no
- no
- yes
- no
- no
-
- - sector
- sector atomic
- /dev/pmemNs
- block
- yes
- yes
- yes
- no
- no
-
- - fsdax
- filesystem DAX
- /dev/pmemN
- block
- yes
- no
- yes
- yes
- yes
- memory
- - devdax
- device DAX
- /dev/daxN.M
- character
- yes
- no
- no
- yes
- yes
- dax
There are two places to store PFN metadata ("struct page" metadata):
- ``--map=mem`` = regular system memory
- adequate for small persistent memory capacities
- ``--map=dev`` = persistent memory
- intended for large persistent memory capacities (there might not
be enough regular memory in the system!)
- persistence of the PFN metadata is not important; this is just
convenient because it scales with the persistent memory capacity
The PFN metadata size is 64 bytes per 4 KiB of persistent memory
(1.5625%). For some common persistent memory capacities:
+-----------------------+-------------------+-----------------------+
| persistent memory | PFN metadata size | example |
| capacity | | |
+=======================+===================+=======================+
| 8 GiB | 128 MiB | |
+-----------------------+-------------------+-----------------------+
| 16 GiB | 256 MiB | |
+-----------------------+-------------------+-----------------------+
| 96 GiB | 1.5 GiB | Six 16 GiB NVDIMMs |
+-----------------------+-------------------+-----------------------+
| 128 GiB | 2 GiB | |
+-----------------------+-------------------+-----------------------+
| 192 GiB | 3 GiB | Twelve 16 GiB NVDIMMs |
+-----------------------+-------------------+-----------------------+
| 256 GiB | 4 GiB | |
+-----------------------+-------------------+-----------------------+
| 512 GiB | 8 GiB | |
+-----------------------+-------------------+-----------------------+
| 768 GiB | 12 GiB | Six 128 GiB NVDIMMs |
+-----------------------+-------------------+-----------------------+
| 1 TiB | 16 GiB | |
+-----------------------+-------------------+-----------------------+
| 1.5 TiB | 24 GiB | Six 256 GiB NVDIMMs |
+-----------------------+-------------------+-----------------------+
| 3 TiB | 48 GiB | Six 512 GiB NVDIMMs |
+-----------------------+-------------------+-----------------------+
| 6 TiB | 96 GiB | Six 1 TiB NVDIMMs, or |
| | | twelve 512 GiB |
| | | NVDIMMs |
+-----------------------+-------------------+-----------------------+
Sector Atomic mode uses a Block Translation Layer (BTT) to help software
that doesn't understand sectors might end up with a mix of old and new
data if power loss occurs while writes were underway.
Filesystem DAX mode lets the filesystem provide direct access to
persistent memory to applications by using mmap() (e.g., ext4 and xfs
filesystems).
Device DAX mode creates a character device instead of a block device,
and is intended for applications that mmap() the the entire capacity. It
does not support filesystems or interact with the kernel page cache.
Example commands on an 8 GiB NVDIMM with output showing the resulting
sizes and /dev/ device names:
::
$ ndctl create-namespace --mode raw -e namespace0.0 -f
{
"dev":"namespace0.0",
"mode":"raw",
"size":"8.00 GiB (8.59 GB)",
"sector_size":512,
"blockdev":"pmem0",
"numa_node":0
}
$ ndctl create-namespace --mode sector -e namespace0.0 -f
{
"dev":"namespace0.0",
"mode":"sector",
"size":"7.99 GiB (8.58 GB)",
"uuid":"30868a48-9763-4d4d-a6b7-e43dbb165b16",
"sector_size":4096,
"blockdev":"pmem0s",
"numa_node":0
}
$ ndctl create-namespace --mode fsdax --map mem -e namespace0.0 -f
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"mem",
"size":"8.00 GiB (8.59 GB)",
"uuid":"f0ab3a91-c5bc-42b2-805f-4fa6c6075a50",
"sector_size":512,
"blockdev":"pmem0",
"numa_node":0
}
$ ndctl create-namespace --mode fsdax --map dev -e namespace0.0 -f
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":"7.87 GiB (8.45 GB)",
"uuid":"64f617f3-b79a-4c92-8ca7-c02d05572d3c",
"sector_size":512,
"blockdev":"pmem0",
"numa_node":0
}
$ ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
{
"dev":"namespace0.0",
"mode":"devdax",
"map":"mem",
"size":"8.00 GiB (8.59 GB)",
"uuid":"7fc2ecfb-edb2-4370-b9e1-09ecbdf7df16",
"daxregion":{
"id":0,
"size":"8.00 GiB (8.59 GB)",
"align":2097152,
"devices":[
{
"chardev":"dax0.0",
"size":"8.00 GiB (8.59 GB)"
}
]
},
"numa_node":0
}
$ ndctl create-namespace --mode devdax --map dev -e namespace0.0 -f
{
"dev":"namespace0.0",
"mode":"devdax",
"map":"dev",
"size":"7.87 GiB (8.45 GB)",
"uuid":"47343804-46f5-49d8-a76e-76cc240d8fc7",
"daxregion":{
"id":0,
"size":"7.87 GiB (8.45 GB)",
"align":2097152,
"devices":[
{
"chardev":"dax0.0",
"size":"7.87 GiB (8.45 GB)"
}
]
},
"numa_node":0
}
When using QEMU (see the :doc:`pmem_in_qemu` page) your namespaces will
by default be in raw mode. You can use the following bash script to
convert all your raw mode namespaces to fsdax mode:
::
#!/usr/bin/bash -ex
namespaces=$(ndctl list | jq -r '((. | arrays | .[]), . | objects) | select(.mode == "raw") | .dev')
for n in $namespaces; do
ndctl create-namespace -f -e $n --mode=memory
done
This function highlights a tricky thing about ndctl and json. If you
have a single namespace, that is returned by ``ndctl list`` as a single
json object:
::
# ndctl list
{
"dev":"namespace0.0",
"mode":"fsdax",
"size":17834180608,
"uuid":"830d3440-df00-4e5a-9f89-a951dfb962cd",
"raw_uuid":"2dbddec6-44cc-41a4-bafd-a4cc3e345e50",
"sector_size":512,
"blockdev":"pmem0",
"numa_node":0
}
If you have two or more namespaces, though, they are returned as an
array of json objects:
::
# ndctl list
[
{
"dev":"namespace1.0",
"mode":"fsdax",
"size":17834180608,
"uuid":"ce92c90c-1707-4a39-abd8-1dd12788d137",
"raw_uuid":"f8130943-5867-4e84-b2e5-6c685434ef81",
"sector_size":512,
"blockdev":"pmem1",
"numa_node":0
},
{
"dev":"namespace0.0",
"mode":"fsdax",
"size":17834180608,
"uuid":"33d46163-095a-4bf8-acf0-6dbc5dc8a738",
"raw_uuid":"8f44ccd3-50f3-4dec-9817-554e9d1a5c5f",
"sector_size":512,
"blockdev":"pmem0",
"numa_node":0
}
]
Note the outer ``[`` and ``]`` brackets surrounding the objects which
turn it into an array. The difficulty is that a given ``jq`` command
expects to either operate on objects or on an array, but not both. So,
the command you need to run will vary based on how many namespaces you
have.
The command above works around this by first converting the multiple
namespace output from an array of objects to multiple objects in a
series:
::
# ndctl list | jq -r '((. | arrays | .[]), . | objects)'
{
"dev": "namespace1.0",
"mode": "fsdax",
"size": 17834180608,
"uuid": "ce92c90c-1707-4a39-abd8-1dd12788d137",
"raw_uuid": "f8130943-5867-4e84-b2e5-6c685434ef81",
"sector_size": 512,
"blockdev": "pmem1",
"numa_node": 0
}
{
"dev": "namespace0.0",
"mode": "fsdax",
"size": 17834180608,
"uuid": "33d46163-095a-4bf8-acf0-6dbc5dc8a738",
"raw_uuid": "8f44ccd3-50f3-4dec-9817-554e9d1a5c5f",
"sector_size": 512,
"blockdev": "pmem0",
"numa_node": 0
}
We then structure the rest of the ``jq`` command to operate on normal
objects, and it works whether we have one namespace or many.
Persistent Naming
~~~~~~~~~~~~~~~~~
The device names chosen by the kernel are subject to creation order and
discovery order. Environments can not rely the kernel name being
consistent from one boot to the next. For the most part they do not
change if the configuration stays static, but if a permanent name is
needed use /dev/disk/by-id. Recent versions of udev deploy the following
udev rule in 60-persistent-storage.rules:
::
# PMEM devices
KERNEL=="pmem*", ENV{DEVTYPE}=="disk", ATTRS{uuid}=="?*", SYMLINK+="disk/by-id/pmem-$attr{uuid}"
This rule yields symlinks like the following to be created for
namespaces defined by labels:
::
ls -l /dev/disk/by-id/*
lrwxrwxrwx 1 root root 13 Jul 9 15:24 pmem-206dcdfe-69b7-4e86-a01b-f540621ce62e -> ../../pmem1.2
lrwxrwxrwx 1 root root 13 Jul 9 15:24 pmem-73840bf1-4e74-4ba4-a9c8-8248934c07c8 -> ../../pmem1.1
lrwxrwxrwx 1 root root 13 Jul 9 15:24 pmem-8137bdfd-3c4d-4b26-b326-21da3d4cd4e5 -> ../../pmem1.4
lrwxrwxrwx 1 root root 13 Jul 9 15:24 pmem-f43d1b6e-3300-46cb-8afc-06d66a7c16f6 -> ../../pmem1.3
The persistent name for a pmem namespace is then listed in /etc/fstab
like so:
::
/dev/disk/by-id/pmem-206dcdfe-69b7-4e86-a01b-f540621ce62e /mnt/pmem xfs defaults,dax 1 2
Partitions
~~~~~~~~~~
You can divide raw, sector, and fsdax devices (/dev/pmemN and
/dev/pmemNs) into partitions. In parted, the mkpart subcommand has this
syntax
::
mkpart [part-type fs-type name] start end
Although mkpart defaults to 1 MiB alignment, you may want to use 2 MiB
alignment to support more efficient page mappings - see :doc:`2mib_fs_dax`.
Example carving a 16 GiB /dev/pmem0 into 4 GiB, 8 GiB, and 4 GiB
partitions (constrained by 1 MiB alignment at the beginning and end)
(note: parted displays its outputs using SI decimal units; lsblk uses
binary units):
::
$ parted -s -a optimal /dev/pmem0 \
mklabel gpt -- \
mkpart primary ext4 1MiB 4GiB \
mkpart primary xfs 4GiB 12GiB \
mkpart primary btrfs 12GiB -1MiB \
print
Model: Unknown (unknown)
Disk /dev/pmem0: 17.2GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 4295MB 4294MB ext4 primary
2 4295MB 12.9GB 8590MB xfs primary
3 12.9GB 17.2GB 4294MB btrfs primary
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
pmem0 259:0 0 16G 0 disk
├─pmem0p1 259:4 0 4G 0 part
├─pmem0p2 259:5 0 8G 0 part
└─pmem0p3 259:8 0 4G 0 part
$ fdisk -l /dev/pmem0
Disk /dev/pmem0: 16 GiB, 17179869184 bytes, 33554432 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: B334CBC6-1C56-47DF-8981-770C866CEABE
Device Start End Sectors Size Type
/dev/pmem0p1 2048 8388607 8386560 4G Linux filesystem
/dev/pmem0p2 8388608 25165823 16777216 8G Linux filesystem
/dev/pmem0p3 25165824 33552383 8386560 4G Linux filesystem
Filesystems
~~~~~~~~~~~
You may place any filesystem (e.g., ext4, xfs, btrfs) on a raw or fsdax
device (e.g., /dev/pmem0), a partition on a raw or fsdax device (e.g.
/dev/pmem0p1), a sector device (e.g., /dev/pmem0s), or a partition on a
sector device (e.g., /dev/pmem0sp1).
ext4 and xfs support DAX, which allow applications to perform direct
access to persistent memory with mmap(). You may use DAX on raw devices
and fsdax devices, but not on sector devices.
Example creating ext4, xfs, and btrfs filesystems on three partitions
and mounting ext4 and xfs with DAX (note: df -h displays sizes in IEC
binary units; df -H uses SI decimal units):
::
$ mkfs.ext4 -F /dev/pmem0p1
$ mkfs.xfs -f -m reflink=0 /dev/pmem0p2
$ mkfs.btrfs -f /dev/pmem0p3
$ mount [dax_mount_options] /dev/pmem0p1 /mnt/ext4-pmem0
$ mount [dax_mount_options] /dev/pmem0p2 /mnt/xfs-pmem0
$ mount /dev/pmem0p3 /mnt/btrfs-pmem0
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
pmem0 259:0 0 16G 0 disk
├─pmem0p1 259:4 0 4G 0 part /mnt/ext4-pmem0
├─pmem0p2 259:5 0 8G 0 part /mnt/xfs-pmem0
└─pmem0p3 259:8 0 4G 0 part /mnt/btrfs-pmem0
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/pmem0p1 3.9G 8.0M 3.7G 1% /mnt/ext4-pmem0
/dev/pmem0p2 8.0G 33M 8.0G 1% /mnt/xfs-pmem0
/dev/pmem0p3 4.0G 17M 3.8G 1% /mnt/btrfs-pmem0
$ df -H
Filesystem Size Used Avail Use% Mounted on
/dev/pmem0p1 4.2G 8.4M 4.0G 1% /mnt/ext4-pmem0
/dev/pmem0p2 8.6G 34M 8.6G 1% /mnt/xfs-pmem0
/dev/pmem0p3 4.3G 17M 4.1G 1% /mnt/btrfs-pmem0
Where **[dax_mount_options]** depends on the kernel support you have and
the desired behavior. See `fs_mount_options <fs_mount_options>`__ for
details.
Check the kernel log to ensure the DAX mount option was honored; mount
does not print this information. Example failures:
::
$ dmesg | tail
[1811131.922331] XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
[1811131.962630] XFS (pmem0): DAX unsupported by block device. Turning off DAX.
[1811131.999039] XFS (pmem0): Mounting V5 Filesystem
[1811132.025458] XFS (pmem0): Ending clean mount
[1811261.329868] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
[1811261.371653] EXT4-fs (pmem0): DAX unsupported by block device. Turning off DAX.
[1811261.410944] EXT4-fs (pmem0): mounted filesystem with ordered data mode. Opts: dax
Example successes:
::
$ dmesg | tail
[1811420.919434] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
[1811420.961539] EXT4-fs (pmem0): mounted filesystem with ordered data mode. Opts: dax
[1811472.505650] XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
[1811472.545702] XFS (pmem0): Mounting V5 Filesystem
[1811472.571268] XFS (pmem0): Ending clean mount
iostats
~~~~~~~
iostats are disabled by default due to performance overhead (e.g., 12M
IOPS dropping 25% to 9M IOP S). However, they can be enabled in sysfs if
desired.
As of kernel 4.5, iostats are only collected for the base pmem device,
not per-partition. Also, I/Os that go through DAX paths (rw_page,
rw_bytes, and direct_access functions) are not counted, so nothing is
collected for:
- I/O to files in filesystems mounted with -o dax
- I/O to raw block devices if CONFIG_BLOCK_DAX is enabled
::
$ echo 1 > /sys/block/pmem0/queue/iostats
$ echo 1 > /sys/block/pmem1/queue/iostats
$ echo 1 > /sys/block/pmem2/queue/iostats
$ echo 1 > /sys/block/pmem3/queue/iostats
$ iostat -mxy 1
avg-cpu: %user %nice %system %iowait %steal %idle
21.53 0.00 78.47 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
pmem0 0.00 0.00 4706551.00 0.00 18384.95 0.00 8.00 6.00 0.00 0.00 0.00 0.00 113.90
pmem1 0.00 0.00 4701492.00 0.00 18365.20 0.00 8.00 6.01 0.00 0.00 0.00 0.00 119.30
pmem2 0.00 0.00 4701851.00 0.00 18366.60 0.00 8.00 6.37 0.00 0.00 0.00 0.00 108.90
pmem3 0.00 0.00 4688767.00 0.00 18315.50 0.00 8.00 6.43 0.00 0.00 0.00 0.00 117.40
fio
~~~
Example fio script to perform 4 KiB random reads to four pmem devices:
::
[global]
direct=1
ioengine=libaio
norandommap
randrepeat=0
bs=256k # for bandwidth
bs=4k # for IOPS and latency
iodepth=1
runtime=30
time_based=1
group_reporting
thread
gtod_reduce=0 # for latency
gtod_reduce=1 # IOPS and bandwidth
zero_buffers
## local CPU
numjobs=9 # for bandwidth
numjobs=1 # for latency
numjobs=18 # for IOPS
cpus_allowed_policy=split
rw=randwrite
rw=randread
# CPU affinity based on two 18-core CPUs with QPI snoop configuration of cluster-on-die
[drive_0]
filename=/dev/pmem0
cpus_allowed=0-8,36-44
[drive_1]
filename=/dev/pmem1
cpus_allowed=9-17,45-53
[drive_2]
filename=/dev/pmem2
cpus_allowed=18-26,54-62
[drive_3]
filename=/dev/pmem3
cpus_allowed=27-35,63-71
When using /dev/dax character devices, you must specify the size,
because character devices do not have a size.
Example fio script to perform 4 KiB random reads to four /dev/dax
character devices:
::
[global]
ioengine=mmap
pre_read=1
norandommap
randrepeat=0
bs=4k
iodepth=1
runtime=60000
time_based=1
group_reporting
thread
gtod_reduce=1 # reduce=1 except for latency test
zero_buffers
size=2G
numjobs=36
cpus_allowed=0-17,36-53
cpus_allowed_policy=split
[drive_0]
filename=/dev/dax0.0
rw=randread
[drive_1]
filename=/dev/dax1.0
rw=randread
[drive_2]
filename=/dev/dax2.0
rw=randread
[drive_3]
filename=/dev/dax3.0
rw=randread