Question: If you had to transfer a large amount of data from A to B how would you do it? Dealing with I/O intensive big data workload challenges: velocity.

Introduction

In Part I of this series we acknowledged that data has quickly become one of the worlds most valuable assets. It powers social media, Large Language Models (LLMs), trading screens, security & risk analysis, manufacturing, healthcare, retail, advertising, agriculture, and more.

A few key traits pose engineering challenges beyond the norm: volume (comes in enormous quantities), velocity (created quickly, usually in real-time), and variety (comes structured, semi-structured, and unstructured).

Previously we discussed intensive IO: getting volumes of data moving. However, data can diminish in relevance if we can’t handle the often extreme velocity of the data production, or latency isn’t managed effectively.

Velocity

So far, for handling data volume we’ve been leveraging system calls¹ like open/openat, lseek, close, read, write. In this article, we will be introduced to some more.

Using Memory Wisely

Memory Mapping

When dealing with larger files, most processes tend to utilize mmap() which is a system call that will create a direct mapping for the file in the virtual memory space of the process. This means the application can access it directly without additional read(), write(), or seek() calls or copying into an extra buffer. Behind the scenes the kernel will still handle any swapping in/out of pages so that the calls work as intended. Memory is unmapped using munmap().

In these examples we use mmap() to load and traverse files up to 4GB in size without issuing extra read() calls and context switching between user and kernel space:

Example: Rust

As of the time of writing, Rust (focused on memory safety) doesn’t support mmap() in its standard libraries and requires a crate (memmap2), or custom FFI wrapper to use it.

use memmap2::MmapOptions;
use std::fs::File;
 
fn main() -> std::io::Result<()> {
    println!("*** START ***");
    {
        let file = File::open("testfile.bin")?;
        let mmap = unsafe { MmapOptions::new().map(&file)? };
        for _b in &mmap[..] {
            _b;
        }
    }
    println!("***  END  ***");
    Ok(())
}

strace for 100K file shows no more calls to read() whilst processing:

openat(AT_FDCWD, "testfile.bin", O_RDONLY|O_CLOEXEC) = 3
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=102400, ...}) = 0
mmap(NULL, 102400, PROT_READ, MAP_SHARED, 3, 0) = 0xffff86c43000
munmap(0xffff86c43000, 102400)          = 0
fcntl(3, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
close(3)                                = 0

strace for 4GB file:

openat(AT_FDCWD, "testfile.bin", O_RDONLY|O_CLOEXEC) = 3
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=4294967296, ...}) = 0
mmap(NULL, 4294967296, PROT_READ, MAP_SHARED, 3, 0) = 0xfffebd99b000
munmap(0xfffebd99b000, 4294967296)      = 0
fcntl(3, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
close(3)                                = 0

When running on a 4G file it keeps 1G resident in the cache (without tuning):

$ vmtouch testfile.bin 
           Files: 1
     Directories: 0
  Resident Pages: 485379/1048576  1G/4G  46.3%
         Elapsed: 0.093367 seconds

Example: Java

// Full file read
import java.io.File;
import java.io.FileInputStream;
import java.nio.channels.FileChannel;
import java.nio.MappedByteBuffer;
 
public class App {
    public void run() throws Exception {
        var file = new File("testfile.bin");
        try (var input = new FileInputStream(file)) {
            FileChannel channel = input.getChannel();
            long size = channel.size();
            MappedByteBuffer buffer = channel.map(
                FileChannel.MapMode.READ_ONLY, 0, size
            );
            while (buffer.hasRemaining()) {
                var b = buffer.get(); // read a byte without another copy
            }  
         }
    }
 
    public static void main(String[] args) throws Exception {
        System.out.println("****** START ******");
        var app = new App();
        app.run();
        System.out.println("******  END  ******");
    }
}

Executing the app with an strace on a 100K file shows we’ve said goodbye to read():

...
[pid 54205] openat(AT_FDCWD, "testfile.bin", O_RDONLY) = 5
...
[pid 54205] mmap(NULL, 102400, PROT_READ, MAP_SHARED, 5, 0) = 0xffff8a1b7000
...
[pid 54205] close(5)                    = 0
...

A check with vmtouch or fincore will show that all pages were loaded into memory (note: the same also occurs if only a single byte is read)

$ vmtouch testfile.bin 
           Files: 1
     Directories: 0
  Resident Pages: 25/25  100K/100K  100%
         Elapsed: 7.7e-05 seconds

After execution on a 1G file, the strace output doesn’t differ but the number of virtual pages cached is given:

$ vmtouch testfile.bin 
           Files: 1
     Directories: 0
  Resident Pages: 32/262144  128K/1G  0.0122%
         Elapsed: 0.006163 seconds

Note: As of writing Java 23 has a limitation of 2GB on the mmap size, so a sliding window would be required for files larger than that.

mmap() implements demand paging — a lazy loading technique where the kernel only copies data into physical memory when there is an attempt to access it. The memory mapped area seen by the process reads directly from the page cache without requiring a copy.

Fig 1: Using mmap() for fast local reads in user space.

When a mmap() is created with the MAP_SHARED flag it can be shared with other processes. That means velocity! Velocity through durable Inter-Process Communication (IPC). One process can write whilst the other reads, like in an ETL pipeline, or both can read/write to the same file (or, rather, virtual memory space). This can be complicated when in comes to managing synchronization² and efficient reads but a common technique used is append-only data structures and write-ahead logs.

Fig 2: Using mmap() for durable shared memory reads/writes between processes

For same-device communication between processes this technique can provide greater velocity than either domain sockets or network sockets over TCP. You would be forgiven for reaching for something dedicated and optimized for that like Kafka, gRPC, ZeroMQ, Chronicle in no particular order — but then, you’d miss out on the Linux tricks. Note: if the data did not need to be durable we could use shm (shared memory).

Time and Space

If we wanted to move a large amount of data from point a to point b and there was no processing involved, no need for the CPU, then how would we do it? Big data poses greater challenges for data replication and redundancy than usual. Also the velocity at which data is produced requires a means to move it to where it needs to be as quickly as possible.

We might be tempted to reach for our read(), and write() operations but perhaps prime the cache before the calls with a mmap(). In fact if you run an strace on a cp command you may find that’s exactly what it does.

Example: cp command

Linux cp command will run an mmap() prior to running as many read() and write() calls as necessary to cp the contents of file a to file b: cp testfile.bin nextfile.bin

...
openat(AT_FDCWD, "testfile.bin", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=102400, ...}) = 0
openat(AT_FDCWD, "nextfile.bin", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffffb8dde000
read(3, "\200\264\21\177\376#<c{1\203\264\306\221\25750cY[\341H\3\305\205\253\223Tn\263N\230"..., 131072) = 102400
write(4, "\200\264\21\177\376#<c{1\203\264\306\221\25750cY[\341H\3\305\205\253\223Tn\263N\230"..., 102400) = 102400
read(3, "", 131072)                     = 0
close(4)                                = 0
close(3)                                = 0
munmap(0xffffb8dde000, 139264)          = 0
...

That, however is not fast enough. Why do we need CPU copying and processing when we are not inspecting the contents of the file?

In order to save time, we need to save space: user space. Avoiding user space entirely apart from an initiating call and staying close to the hardware is the key to extracting more speed. Fortunately, Linux provides the means to do so.

Three additional system calls worth our consideration are copy_file_range(), sendfile()and splice(). All of these calls allow for copies of data without unnecessarily going via user space so long as the files support it³. (Note: YMMV: sendfile() has a 2GB limitation on offset, hence sendfile64(). Using copy_file_range() effectively within limitations also may require multiple calls). They are known more generally as zero-copy techniques — calls that require either zero or at least minimal copying.

Example: Rust

use std::fs;
 
fn main() -> std::io::Result<()> {
    println!("*** START ***");
    fs::copy("testfile.bin", "nextfile.bin")?;
    println!("***  END  ***");
    Ok(())
}

The strace for the transfer of a 4GB file:

...
openat(AT_FDCWD, "testfile.bin", O_RDONLY|O_CLOEXEC) = 3
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=4294967296, ...}) = 0
openat(AT_FDCWD, "nextfile.bin", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0100644) = 4
statx(4, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=0, ...}) = 0
fchmod(4, 0100644)                      = 0
copy_file_range(3, NULL, 4, NULL, 1073741824, 0) = 1073741824
copy_file_range(3, NULL, 4, NULL, 1073741824, 0) = 1073741824
copy_file_range(3, NULL, 4, NULL, 1073741824, 0) = 1073741824
copy_file_range(3, NULL, 4, NULL, 1073741824, 0) = 1073741824
copy_file_range(3, NULL, 4, NULL, 1073741824, 0) = 0
close(4)                                = 0
close(3)                                = 0
...

Example: Java

Java exposes copy_file_range() via its FileChannel API:

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
 
public class FileTransfer {
 
    public void run() throws Exception {
        var inputFile = new File("testfile.bin");
        var outputFile = new File("nextfile.bin");
        try (
            var in = new FileInputStream(inputFile);
            var out = new FileOutputStream(outputFile); 
        ) {
            var inChannel = in.getChannel();
            var outChannel = out.getChannel();
            var offset = 0L;
            while (offset < inputFile.length()) {
                offset += inChannel.transferTo(offset, inputFile.length() - offset, outChannel);
            }
        }
    }
 
    public static void main(String[] args) throws Exception {
        var app = new FileTransfer();
        System.out.println("***********  START  ************");
        app.run();
        System.out.println("***********   END   ************");
    }
}

The strace output for a transfer of a 100K file:

...
[pid 42010] openat(AT_FDCWD, "testfile.bin", O_RDONLY) = 4
[pid 42010] openat(AT_FDCWD, "nextfile.bin", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
[pid 42010] copy_file_range(4, [0], 5, NULL, 102400, 0) = 102400
[pid 42010] close(5)                    = 0
[pid 42010] close(4)                    = 0
...

…for a 2GB file:

...
[pid 45045] openat(AT_FDCWD, "testfile.bin", O_RDONLY) = 4
[pid 45045] openat(AT_FDCWD, "nextfile.bin", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
[pid 45045] copy_file_range(4, [0], 5, NULL, 2147479552, 0) = 2147479552
[pid 45045] close(5)                    = 0
[pid 45045] close(4)                    = 0
...

…for a 4GB file:

...
[pid  1028] openat(AT_FDCWD, "testfile.bin", O_RDONLY) = 4
[pid  1028] openat(AT_FDCWD, "nextfile.bin", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
[pid  1028] copy_file_range(4, [0], 5, NULL, 2147479552, 0) = 2147479552
[pid  1028] copy_file_range(4, [2147479552], 5, NULL, 2147479552, 0) = 2147479552
[pid  1028] copy_file_range(4, [4294959104], 5, NULL, 8192, 0) = 8192
[pid  1028] close(5)                    = 0
[pid  1028] close(4)                    = 0
...

Fig 3: Using copy_file_range() to zero-copy a file between to storage locations without going via user space.

Shared Nothing

The techniques we’ve looked at so far imply vertical scaling to process larger and larger data sets. This implies scaling up and potentially sharing more and more memory to deal with larger workloads. The problem, though, with a vertical scaling, shared-memory approach is cost and fault tolerance. Importantly, cost grows faster when vertically scaling one machine than with horizontal scaling, i.e. scaling out — a “shared nothing” architecture⁴. Furthermore, if that single system goes down it takes down everything with it.

In a shared nothing architecture, we can’t rely on shared memory within a single system alone to process data quickly. This requires fast transfer of data not just within a system but between systems too.

Thankfully, Linux treats sockets the same way it treats files, as just another file descriptor. In that way, the same system calls apply, so very little changes in our approach. In our examples we use netcat⁵ to listen to the transfer to keep the them simple.

Example: Rust

At the time of writing, Rust doesn’t have a generic call which leverages sendfile() for socket calls. io:copy will send between a file a socket with multiple read() and sendto() calls, similar to how the cp command works with local files.

use std::fs::File;
use std::net::TcpStream;
use std::io;
 
fn main() -> std::io::Result<()> {
    println!("*** START ***");
    {
        let mut input = File::open("testfile.bin")?;
        let mut output = TcpStream::connect("localhost:2015")?;
 
        io::copy(&mut input, &mut output)?;
    }
    println!("***  END  ***");
    Ok(())
}

Produces strace output similar to the following for a 100K file:“

...
connect(4, {sa_family=AF_INET, sin_port=htons(2015), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
...
read(3, "\274kd\5\336\221\264=\0\347\342!\\\345\307\17\375\10\210\210\321\315N\203\254b\323U\225\246q\240"..., 8192) = 8192
sendto(4, "\274kd\5\336\221\264=\0\347\342!\\\345\307\17\375\10\210\210\321\315N\203\254b\323U\225\246q\240"..., 8192, MSG_NOSIGNAL, NULL, 0) = 8192
read(3, "\240?B\26\304\3470\204kFTaf\222p%\224h\347;\327N\372\36\367\344\216^r\230\201j"..., 8192) = 4096
sendto(4, "\240?B\26\304\3470\204kFTaf\222p%\224h\347;\327N\372\36\367\344\216^r\230\201j"..., 4096, MSG_NOSIGNAL, NULL, 0) = 4096
read(3, "", 8192)                       = 0
fcntl(4, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
close(4)                                = 0
fcntl(3, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
close(3)                                = 0
...

However, if we introduce the nix crate, then we can use the call directly ourselves (using sendfile64 in this case):

use std::fs::File;
use std::net::TcpStream;
use nix::sys::sendfile;
use std::os::unix::io::AsFd;
use libc::off64_t;
 
fn main() -> std::io::Result<()> {
    println!("*** START ***");
    {
        let input = File::open("testfile.bin")?;
        let output = TcpStream::connect("localhost:2015")?;
 
        let input_size = input.metadata()?.len() as usize;
 
        let mut offset: off64_t = 0;
        let mut total_sent = 0;
        while total_sent < input_size {        
            total_sent += sendfile::sendfile64(
                output.as_fd(),
                input.as_fd(),
                Some(&mut offset),
                input_size - total_sent
            )?;
        }
    }
    println!("***  END  ***");
    Ok(())
}

Then when we run strace on a 100K file we get output like the following:

openat(AT_FDCWD, "testfile.bin", O_RDONLY|O_CLOEXEC) = 3
...
connect(4, {sa_family=AF_INET, sin_port=htons(2015), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=102400, ...}) = 0
sendfile(4, 3, [0] => [102400], 102400) = 102400
fcntl(4, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
close(4)                                = 0
fcntl(3, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
close(3)                                = 0
...

…and on a 4GB file

openat(AT_FDCWD, "testfile.bin", O_RDONLY|O_CLOEXEC) = 3
...
connect(4, {sa_family=AF_INET, sin_port=htons(2015), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=4294967296, ...}) = 0
sendfile(4, 3, [0] => [2147479552], 4294967296) = 2147479552
sendfile(4, 3, [2147479552] => [4294959104], 2147487744) = 2147479552
sendfile(4, 3, [4294959104] => [4294967296], 8192) = 8192
fcntl(4, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
close(4)                                = 0
fcntl(3, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
close(3)                                = 0
...

Example: Java

Code example for sending to the socket. On the other end we leverage netcat to listen for the transfer with netcat -l 2015.

import java.io.FileInputStream;
import java.net.InetSocketAddress;
import java.nio.channels.FileChannel;
import java.nio.channels.SocketChannel;
 
public class App {
 
    public void run() throws Exception {
        try (
            var input = new FileInputStream("testfile.bin");
            var inputChannel = input.getChannel();
            var outputChannel = SocketChannel.open(new InetSocketAddress("localhost", 2015));
        ) {
            var offset = 0L;
            while (offset < inputChannel.size()) {
                offset += inputChannel.transferTo(offset, inputChannel.size() - offset, outputChannel);
            }
        }
    }
 
    public static void main(String[] args) throws Exception {
        var app = new App();
        System.out.println("***  START  ***");
        app.run();
        System.out.println("***   END   ***");
    }
}

Note: when running an strace to send a 100K file the Java application first tries (and fails) to run copy_file_range then falls back to sendfile. This occurs because copy_file_range does not work on sockets.

...
[pid 42017] openat(AT_FDCWD, "testfile.bin", O_RDONLY) = 5
...
[pid 42017] copy_file_range(5, [0], 13, NULL, 102400, 0) = -1 EINVAL (Invalid argument)
[pid 42017] sendfile(13, 5, [0] => [102400], 102400) = 102400
...
[pid 42017] close(5)                    = 0
...

Fig 4: Using sendfile() to zero-copy data over TCP sockets without going via user space.

When we look at how the transfer works within the kernel, it’s clear there is still opportunity to avoid copies. Network cards that support gather operations can avoid the extra copy within kernel space from the page cache to the network buffer by gathering directly at the NIC⁶. Further, on systems that support non-uniform memory access (NUMA) we can take advantage of the relative location of CPU to memory, or memory to the NIC to improve speed even further.

For modern applications hoping to leverage performance advantages enabled beyond those considered so far we have to look at kernel-less techniques (like DPDK) or go beyond system calls using io_uring() which provide sub-microsecond IO opportunities. We’ll follow-up with those in the future.

Modern I/O

Not all I/O problems present themselves as a simple data transfer from A to B without application intervention in user space. Modern interfaces enable I/O from user space whilst minimizing kernel overhead.

Ring buffers (aka circular queues/circular buffers) have proved their value as a foundational data structures for achieving low-latency, high-throughput communication in systems with constrained or fixed resources. Their predictable, mechanically sympathetic, mostly sequential memory access minimizes cache churn and takes full advantage of prefetching, making them very cache-friendly.

io_uring is a storage API for asynchronous I/O that enables the submission of one or more non-blocking I/O requests bringing together many well-established ideas around asynchronous high-performance storage I/O. It hosts ring buffers which are shared between user space and kernel space: a submission queue (SQ) for I/O requests and completion queue (CQ) for I/O responses. This setup allows for efficient I/O, while avoiding the overhead of inefficient system calls and unnecessary copies.

The manual pages for io_uring describe how to use it, but we ~~could~~ should⁷ leverage liburing which simplifies the api. At a base level the main system calls we make use of are mmap(), io_uring_setup(), io_uring_enter(), and io_uring_register(). These calls are responsible for memory mapping, io_uring setup, queue submission of I/O events, and file/buffer registration, respectively.⁸

Rust

TODO()

Java

TODO()

Placeholder

Diagram

So far we’ve discussed both volume and velocity but variety also poses unique I/O challenges for high speed analytics and processing of big data. Data storage and serialization formats play an important part in modern engineering for making this possible in an efficient way. We’ll discuss that topic next in Big Bytes, Big Data, Intensive IO Part III - Variety.

Disclaimer:

Any views and opinions expressed in this blog are based on my personal experiences and knowledge acquired throughout my career. They do not necessarily reflect the views of or experiences at my current or past employers

For an understanding of user space/kernel space and system calls and how they are relevant to handling data, see Big Bytes, Big Data, Intensive IO Part I - Volume and related references. ↩
This synchronization could be done using another system call futex or, more expensively with shm_open but that is out of scope for this article. ↩
M. Kerrisk, The Linux Programming Interface: A Linux and UNIX System Programming Handbook. San Francisco: No Starch Press, 2010. ↩
Kleppmann, Martin. Designing Data-Intensive Applications: The Big Ideas behind Reliable, Scalable, and Maintainable Systems. First edition. Boston: O’Reilly Media, 2017. ↩
Netcat is usually available with one of the commands netcat or nc. If not they can be installed with sudo apt-get install netcat. Alternatively, one can create a server socket to listen to the transfer for on the other end. ↩
“Zero Copy I: User-Mode Perspective,” Dragan Stancevic. www.linuxjournal.com Available: https://www.linuxjournal.com/article/6345 [Accessed: Dec, 09, 2024] ↩
The API author says “Don’t be a hero” by using the direct api yourself, in favor of using liburing. Doing so will save some of the additional boilerplate of ring setup and IO submission as well as memory barriers/fencing when reading from or writing to ring buffers. https://kernel-recipes.org/en/2019/talks/faster-io-through-io_uring/. ↩
Axboe, Jens. “Efficient IO with io_uring” https://kernel.dk/io_uring.pdf [Accessed: Dec, 11, 2024] ↩

The Life of Brian Corbin

Explorer

Big Bytes, Big Data, Intensive IO Part II - Velocity

Introduction

Velocity

Using Memory Wisely

Memory Mapping

Time and Space

Shared Nothing

Modern I/O

Graph View

Table of Contents

Backlinks

The Life of Brian Corbin

Explorer

Big Bytes, Big Data, Intensive IO Part II - Velocity

Introduction

Velocity

Using Memory Wisely

Memory Mapping

Time and Space

Shared Nothing

Modern I/O

Footnotes

Graph View

Table of Contents

Backlinks