Question: If you had to transfer a large amount of data from A to B how would you do it?

Introduction

Big Data

Data has quickly become one of the worlds most valuable assets. It powers social media, Large Language Models (LLMs), trading screens, security & risk analysis, manufacturing, healthcare, retail, advertising, agriculture, and more.

What makes data more than just data is when it comes with a few key traits: volume (comes in enormous quantities), velocity (created quickly, usually in real-time), and variety (comes structured, semi-structured, and unstructured). There are, of course, other related measures ¹ but those three stand out in particular as posing engineering challenges beyond the norm.

Those challenges mean that simple data file handling doesn’t work well and that traditional data management systems cannot store, process, or analyze it well. Since big data has its greatest value when it is accurate and quickly and consistently available to those who need to consume it, those are challenges that must be overcome.

Linux

Linux is the probably the most important operating system most people have never heard of; it holds only around 3% of the desktop market. Even if you are a developer my umami insights will tell me you’re likely reading this on either a Windows or MacOS PC or an iPhone or Android mobile device. (I challenge you to prove me wrong - pass it on!).

Despite that the overwhelming majority of servers used on the internet (96%) and cloud (90%) use Linux². Aworldwithoutlinux is a world without big data. Suffice to say, if you want to talk volume, velocity, and variety of big data then the conversation starts in Linux.

Spaces and Syscalls

If you’ve never been exposed to a lot of low-level or high-performance engineering then you may not have given terms like “user space”, “kernel space”, or even “system calls” a second thought. However, understanding those is a pathway to many abilities, some consider to be [necessary] — dark arts that make all of the difference when it comes to performance and scalability.

Simply put, unless you’re developing an operating system or writing hardware level drivers, anti-virus, or anti-cheat software then the programs that you typically run or develop are running in what’s called “user space”, with their own virtualized memory. This is a siloed area that sits above a privileged area called the “kernel space” which is responsible for programs that do critical operations and I/O like interacting directly with hardware like disks and memory.

User space processes run in user mode, which requires that they perform a system call (syscall) to the kernel to perform the operation and return the results back to the user. This is the case whether your program is written in Java, Python, C/C++, or Rust. Every time you need to make a syscall or switch from kernel to user mode and back comes with cost. That cost includes permissions checks, extra data transfers, context switching, and time.

Handling Volume

For data file handling, whether they’re text files or binary, large or small, the kernel provides an API³ that exposes syscalls like open/openat, lseek, close, read, write, and others for various interactions.

We’re going to use a couple of linux tools strace and vmtouch⁴ to help us track system calls and system cache behavior when we interact with files. These are key to for helping us see what’s happening behind the scenes and improving performance.

Caching is key

Linux maintains a shared page cache as part of a virtual file system in kernel space. This is designed to make multiple 4K (4096 byte) pages from the block storage available to processes that are reading/writing files to avoid frequent calls to storage improving I/O latency⁵. Until a read() call is made on a file descriptor made available to a process with open() then cache will not hold data from that file.

Sequential Reads

When making the first read() from a file in linux, the system will check the offset that is being read from. If that offset is 0 (the beginning of the file), it will assume you intend to make sequential reads so will load at least 4 pages³ worth of data into the cache due to a cache miss (aka page fault). Similar behavior occurs for subsequent sequential reads taking advantage of the principle of locality which posits that if a particular data location is referenced at a particular time, then it is likely that nearby data locations will be referenced in the near future.

Let’s take an example and see what happens when we open() a 100K file, read() a single byte, and close() it.

Example: Rust

// sequential-reader/src/main.rs
use std::io;
use std::fs::File;
use std::io::Read;
 
fn main() -> io::Result<()> {
   println!("***********  START  ************");
   run()?; 
   println!("***********   END   ************");
   Ok(())
}
 
fn run() -> io::Result<()> {
   let mut buffer = [0; 1];
   let mut file = File::open("testfile.txt")?;
   file.read(&mut buffer)?;
   Ok(())
}

To observe the system calls being made by the process after building we can run strace target/debug/sequential-reader

...
write(1, "***********  START  ************"..., 33***********  START  ************
) = 33
openat(AT_FDCWD, "testfile.txt", O_RDONLY|O_CLOEXEC) = 3
read(3, "L", 1)                         = 1
fcntl(3, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
close(3)                                = 0
write(1, "***********   END   ************"..., 33***********   END   ************
) = 33
...

To observe the impact to the page cache we can use vmtouch testfile.txt to observe the number of pages (and bytes) that were cached to fulfill that request.

         Files: 1
    Directories: 0
 Resident Pages: 4/25  16K/100K  16%
        Elapsed: 8.6e-05 seconds

As expected, since we began a read from the first byte, 4 pages (16K) of our testfile.txt data file are resident in the cache after the single read and stay resident after the process has closed.

Example: Java

// SequentialReader.java
import java.io.File;
import java.io.FileInputStream;
 
public class SequentialReader {
 
   public static void main(String[] args) throws Exception {
       System.out.println("***********  START  ************");
       var app = new SequentialReader();
       app.run();
       System.out.println("***********   END   ************");
   }
 
   public void run() {
       var file = new File("testfile.txt");
       var buffer = new byte[1];
       try (var in = new FileInputStream(file)) {
           in.read(buffer);
       }
   }
}

To observe the system calls being made by the process after building we can run strace -f java SequentialReader or filter just the syscalls we care about for now with: strace -f -eread,write,close,/seek,/open java SequentialReader

...
[pid  5296] write(1, "***********  START  ************"..., 33***********  START  ************
) = 33
[pid  5296] openat(AT_FDCWD, "testfile.txt", O_RDONLY) = 4
[pid  5296] read(4, "L", 1)             = 1
[pid  5296] close(4)                    = 0
[pid  5296] write(1, "***********   END   ************"..., 33***********   END   ************
...

To observe the impact to the page cache we can use vmtouch testfile.txt to observe the number of pages (and bytes) that were cached to fulfill that request.

          Files: 1
    Directories: 0
 Resident Pages: 4/25  16K/100K  16%
        Elapsed: 3.2e-05 seconds

As expected, since we began a read from the first byte, 4 pages (16K) of our testfile.txt data file are resident in the cache after the single read and stay resident after the process has closed.

A sequential read

Fig 1: a single read from the first offset of a file will cache multiple pages to reduce future sequential read latency

Random Reads

If, however, we decide to first move the offset elsewhere in the file before making our the first read() from a file in linux, it will assume you intend to make random access reads so will load only the page relevant to the bytes being read at the current offset³ into the cache due to a cache miss (aka page fault and will behave similarly for subsequent reads.

Let’s take an example and see what happens when we open() a 100K file, move our offset with a lseek(), read() a single byte, and close() it.

Example: Rust

// nonsequential-reader/src/main.rs
use std::io;
use std::fs::File;
use std::io::{Read, Seek, SeekFrom};
 
fn main() -> io::Result<()> {
   println!("***********  START  ************");
   run()?; 
   println!("***********   END   ************");
   Ok(())
}
 
fn run() -> io::Result<()> {
   let mut buffer = [0; 1];
   let mut file = File::open("../../testfile.txt")?;
   file.seek(SeekFrom::Start(4096))?;
   file.read(&mut buffer)?;
   Ok(())
}

To observe the system calls being made by the process after building we can run strace target/debug/nonsequential-reader

...
write(1, "***********  START  ************"..., 33***********  START  ************
) = 33
openat(AT_FDCWD, "../../testfile.txt", O_RDONLY|O_CLOEXEC) = 3
lseek(3, 4096, SEEK_SET)                = 4096
read(3, " ", 1)                         = 1
fcntl(3, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
close(3)                                = 0
write(1, "***********   END   ************"..., 33***********   END   ************
) = 33
...

To observe the impact to the page cache we can use vmtouch testfile.txt to observe the number of pages (and bytes) that were cached to fulfill that request.

         Files: 1
    Directories: 0
 Resident Pages: 1/25  4K/100K  4%
        Elapsed: 7.2e-05 seconds

As expected, since we began a read after the first page (after a 4K seek) only 1 page (4K) of our testfile.txt data file is resident in the cache.

Example: Java

// NonSequentialReader.java
import java.io.File;
import java.io.FileInputStream;
 
public class NonSequentialReader {
 
   public static void main(String[] args) throws Exception {
       System.out.println("***********  START  ************");
       var app = new NonSequentialReader();
       app.run();
       System.out.println("***********   END   ************");
   }
 
   public void run() {
       var file = new File("testfile.txt");
       var buffer = new byte[1];
       try (var in = new FileInputStream(file)) {
	       in.skip(4096L); // Skip a page
           in.read(buffer);
       }
   }
}

To observe the system calls being made by the process after building we run strace -f java NonSequentialReader or filter just the syscalls we care about for now with: strace -f -eread,write,close,/seek,/open java NonSequentialReader

...
[pid  26077] write(1, "***********  START  ************"..., 33***********  START  ************
) = 33
[pid 31368] openat(AT_FDCWD, "testfile.txt", O_RDONLY) = 4
[pid 31368] lseek(4, 0, SEEK_CUR)       = 0
[pid 31368] lseek(4, 4096, SEEK_CUR)    = 4096
[pid 31368] read(4, " ", 1)             = 1
[pid 31368] close(4)                    = 0
[pid  26077] write(1, "***********   END   ************"..., 33***********   END   ************
...

To observe the impact to the page cache we can use vmtouch testfile.txt to observe the number of pages (and bytes) that were cached to fulfill that request.

          Files: 1
    Directories: 0
 Resident Pages: 1/25  4K/100K  4%
        Elapsed: 3.5e-05 seconds

As expected, since we began a read after the first page (after a 4K seek) only 1 page (4K) of our testfile.txt data file is resident in the cache.

Fig 2: a single read after a seek of a file will cache a single page as a random access pattern is assumed in future

Efficient Reads

So given what we now know, making good use of the page cache is the key to efficient reads.

For sequential reads: it would make sense that if we wanted to read a whole file then reading from the beginning to the end using a buffer sized as some page-aligned multiple of 4K would work on a file of any size given default options.

For example we could iterate using 16K byte buffer blocks over a 100K file:

Example: Rust

use std::io;
use std::fs::File;
use std::io::Read;
 
fn main() -> io::Result<()> {
    println!("*********** START ***********");
    let mut buffer = [0; 4 * 4096];
    {
        let mut file = File::open("testfile.bin")?;
        loop {
            if file.read(&mut buffer)? == 0 {
                break;
            }
        }
    }
    println!("***********  END  ***********");
    Ok(())
}

Produces the following strace output when executed on a 100K test data file.

...
write(1, "*********** START ***********\n", 30*********** START ***********
) = 30
openat(AT_FDCWD, "testfile.bin", O_RDONLY|O_CLOEXEC) = 3
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 4096
read(3, "", 16384)                      = 0
fcntl(3, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
close(3)                                = 0
write(1, "***********  END  ***********\n", 30***********  END  ***********
) = 30
...

Example: Java

public class App {
 
    public void run() throws Exception {
        var buffer = new byte[4 * 4096];
        var file = new File("testfile.bin");
        try (var input = new FileInputStream(file)) {
            while (input.available() > 0) {
                input.read(buffer);
            }
        }
    }
 
    public static void main(String[] args) throws Exception {
        System.out.println("****** START ******");
        var app = new App();
        app.run();
	    System.out.println("******  END  ******");   
    }
}

Produces the following strace when executed on a 100K test data file:

[pid 42961] write(1, "****** START ******\n", 20****** START ******
) = 20
[pid 42961] openat(AT_FDCWD, "testfile.bin", O_RDONLY) = 5
[pid 42961] lseek(5, 0, SEEK_CUR)       = 0
[pid 42961] read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
[pid 42961] lseek(5, 0, SEEK_CUR)       = 16384
[pid 42961] read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
[pid 42961] lseek(5, 0, SEEK_CUR)       = 32768
[pid 42961] read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
[pid 42961] lseek(5, 0, SEEK_CUR)       = 49152
[pid 42961] read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
[pid 42961] lseek(5, 0, SEEK_CUR)       = 65536
[pid 42961] read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
[pid 42961] lseek(5, 0, SEEK_CUR)       = 81920
[pid 42961] read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
[pid 42961] lseek(5, 0, SEEK_CUR)       = 98304
[pid 42961] read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 4096
[pid 42961] lseek(5, 0, SEEK_CUR)       = 102400
[pid 42961] close(5)                    = 0
[pid 42961] write(1, "******  END  ******\n", 20******  END  ******
) = 20

In practice, though, big data isn’t in the region of kilobytes, it’s gigabytes, terabytes, petabytes of data that must be read and processed efficiently. So picking buffer sizes of 64KB to 256KB for distributed systems, or 512KB to 1MB on local SDDs, or even 4MB buffers for large sequential reads is the norm.

What must be considered, as the buffer sizes gets larger, is the impact on other processes in the system and the cache as well as the CPU, and memory utilization.

For random reads: we would want to have way to have more control what it being cached and when, or at all. We could take advantage of the fact that a page cache is shared for a given having a process, or thread read it ahead of its use in the processing thread. We could even, employ the same concepts by pre-fetching and pooling the data ourselves to be read by our processing thread.

Whilst we won’t look at that here it’s interesting that most languages have some buffered calls for textual data or “read all” calls for binary data that further abstract away the back and forth process of calling read(), picking the optimal behavior for a given file size.

For, example to read a whole file rust has std::fs::read and Java has FileInputStream.readAllBytes (which will crash for anything over 2GB). What is telling is that they exhibit some similar characteristics when a file size goes beyond a given limit:

Example: Rust

use std::io;
 
fn main() -> io::Result<()> {
    println!("*********** START ***********");
    {
        std::fs::read("testfile.bin")?;
    }
    println!("***********  END  ***********");
    Ok(())
}

strace for 100K file:

write(1, "*********** START ***********\n", 30*********** START ***********
) = 30
openat(AT_FDCWD, "testfile.bin", O_RDONLY|O_CLOEXEC) = 3
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=102400, ...}) = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 102400) = 102400
read(3, "", 32)                         = 0
close(3)                                = 0
write(1, "***********  END  ***********\n", 30***********  END  ***********

strace for 1GB file:

write(1, "*********** START ***********\n", 30*********** START ***********
) = 30
openat(AT_FDCWD, "testfile.bin", O_RDONLY|O_CLOEXEC) = 3
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=1073741824, ...}) = 0
mmap(NULL, 1073745920, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff7828a000
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1073741824) = 1073741824
read(3, "", 32)                         = 0
close(3)                                = 0
munmap(0xffff7828a000, 1073745920)      = 0
write(1, "***********  END  ***********\n", 30***********  END  ***********
) = 30

Automatically, rust has decided to make use of some additional system calls mmap and munmap.

Example: Java

import java.io.File;
import java.io.FileInputStream;
 
public class App {
 
    public void run() throws Exception {
        var file = new File("testfile.bin");
        try (var input = new FileInputStream(file)) {
            input.readAllBytes();
        }
    }
 
    public static void main(String[] args) throws Exception {
        System.out.println("****** START ******");
        var app = new App();
        app.run();
        System.out.println("******  END  ******");
    }
}

Running an strace for a 100K file:

...
[pid 74587] write(1, "****** START ******\n", 20****** START ******
) = 20
[pid 74587] openat(AT_FDCWD, "testfile.bin", O_RDONLY) = 5
[pid 74587] fstat(5, {st_mode=S_IFREG|0644, st_size=102400, ...}) = 0
[pid 74587] fstat(5, {st_mode=S_IFREG|0644, st_size=102400, ...}) = 0
[pid 74587] lseek(5, 0, SEEK_CUR)       = 0
[pid 74587] mprotect(0xffff7810d000, 102400, PROT_READ|PROT_WRITE) = 0
[pid 74587] read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 102400) = 102400
[pid 74587] read(5, "", 1)              = 0
[pid 74587] close(5)                    = 0
[pid 74587] write(1, "******  END  ******\n", 20******  END  ******
) = 20
...

Running an strace for a 1G file leads to lots of mmap calls (not all listed):

...
[pid 86723] mmap(0x743a00000, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x743a00000
[pid 86723] mmap(0xffff7d6c2000, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xffff7d6c2000
[pid 86723] mmap(0xffff9cca8000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xffff9cca8000
[pid 86723] mmap(0xffff9c4a8000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xffff9c4a8000
[pid 86723] mmap(0x743c00000, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x743c00000
[pid 86723] mmap(0xffff7d6ca000, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xffff7d6ca000
[pid 86723] mmap(0xffff9cca9000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xffff9cca9000
[pid 86723] mmap(0xffff9c4a9000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xffff9c4a9000
[pid 86723] close(5)                    = 0
[pid 86723] write(1, "******  END  ******\n", 20******  END  ******
) = 20
...
# Added counts (the majority are from the explicit mapping)
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 61.43    0.063357          28      2251           mmap
 12.28    0.012670          76       166         1 close
 11.79    0.012161          81       149        44 openat
  9.06    0.009342          80       116           read
  2.35    0.002420          73        33           pread64
  1.52    0.001566          78        20           write
  1.43    0.001476          64        23           munmap
  0.14    0.000149          74         2           readlinkat
------ ----------- ----------- --------- --------- ----------------
100.00    0.103141          37      2760        45 total

When dealing with larger files, particularly larger-than-memory files, most I/O libraries tend to utilize mmap() — a system call that will create a direct mapping for the file in the virtual memory space of the process. Why? We’ll consider that in Part II of our series in Big Bytes, Big Data, Getting Data Moving Part II - Velocity.

Efficient Writes

Until now, we’ve only been considering reading data. However, the behavior behind the scenes when writing data is important to know. We might have read the data to make calculations, analyze, display it, or compress it, or even encrypt it. We may even be building an ETL pipeline or simply just be generating a huge amount of data ourselves.

Does a write() call just write directly to the device or does it use the page cache? Reasonably given a read() will go via the page cache, then it would make sense that the write() would too, since if process A is reading from a file that process B is writing to, then it would (eventually) see those writes and not just a past snapshot view.

Fig 3: a single write() will write data to the page cache to be later flushed to the system using a sync() or fsync()

We can verify this with a code example:

Example: Rust

use std::io;
use std::fs::File;
use std::io::Write;
 
fn main() -> io::Result<()> {
    println!("*** START ***"); 
    let buffer = [0; 1];
    {
        let mut file = File::create("testfile.bin")?;
        file.write(&buffer)?;
    }
    println!("***  END  ***"); 
    Ok(())
}

Output from strace:

...
openat(AT_FDCWD, "testfile.bin", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 3
write(3, "\0", 1)                       = 1
fcntl(3, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
close(3)                                = 0
...

$ vmtouch testfile.bin 
           Files: 1
     Directories: 0
  Resident Pages: 1/1  4K/4K  100%
         Elapsed: 5.1e-05 seconds

Example: Java

package org.example;
 
import java.io.File;
import java.io.FileOutputStream;
 
public class App {
 
    public void run() throws Exception {
        var buffer = new byte[1];
        var file = new File("testfile.bin");
        try (var output = new FileOutputStream(file)) {
            output.write(buffer);
        }
    }
 
    public static void main(String[] args) throws Exception {
        System.out.println("****** START ******");
        var app = new App();
        app.run();
        System.out.println("******  END  ******");
    }
}

Output from strace

...
[pid 18614] openat(AT_FDCWD, "testfile.bin", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
[pid 18614] fstat(5, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
[pid 18614] write(5, "\0", 1)           = 1
[pid 18614] pread64(3, "\312\376\272\276\0\0\0C\0)\t\0\2\0\3\7\0\4\f\0\5\0\6\1\0\32java/i"..., 766, 9784899) = 766
[pid 18614] close(5)                    = 0
...

$ fincore testfile.bin 
RES PAGES SIZE FILE
 4K     1   1B testfile.bin

Note, in the output of our examples sync() or fsync() does not get called. Although a close() operation may flush our pages into the cache, it will not guarantee it makes it to the disk immediately. That responsibility is left to the kernel unless explicitly called by the process.

At this point we have a good idea about how we can get a large volume of data moving through a system. However, at this point it is still slow.

As we mentioned at the outset, every time we need to make a system call or switch from kernel to user mode and back it comes with cost — permissions checks, superfluous copies, and context switching. All of this can waste time in a data intensive application.

How can we handle volume whilst paying attention to velocity? The next article will discuss this Big Bytes, Big Data, Getting Data Moving Part II - Velocity.

Disclaimer:

Any views and opinions expressed in this blog are based on my personal experiences and knowledge acquired throughout my career. They do not necessarily reflect the views of or experiences at my current or past employers

“What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets.” doi: 10.1177/2053951716631130. Available: https://journals.sagepub.com/doi/epub/10.1177/2053951716631130. [Accessed: Dec. 07, 2024] ↩
“Linux Statistics: Dominance in Supercomputers, Cloud, and Mobile Devices | Gitnux.org,” Nov. 26, 2024. Available: https://gitnux.org/linux-statistics/. [Accessed: Dec. 07, 2024] ↩
D. P. Bovet and M. Cesati, Understanding the Linux Kernel, 3rd ed. Sebastopol: O’Reilly Media, Inc, 2008. ↩ ↩² ↩³
This doesn’t come pre-installed on Linux systems but you can build it from scratch or use sudo apt-get install vmtouch. See https://github.com/hoytech/vmtouch. Alternatively, for checking the number of pages of a file in virtual memory you can use fincore - though this doesn’t have the ability to evict pages. ↩
“Essential Linux Page Cache theory,” Viacheslav Biriukov. Available: https://biriukov.dev/docs/page-cache/2-essential-page-cache-theory/. [Accessed: Dec. 07, 2024] ↩

The Life of Brian Corbin

Explorer

Big Bytes, Big Data, Getting Data Moving Part I - Volume

Introduction

Big Data

Linux

Spaces and Syscalls

Handling Volume

Caching is key

Sequential Reads

Random Reads

Efficient Reads

Efficient Writes

Graph View

Table of Contents

Backlinks

The Life of Brian Corbin

Explorer

Big Bytes, Big Data, Getting Data Moving Part I - Volume

Introduction

Big Data

Linux

Spaces and Syscalls

Handling Volume

Caching is key

Sequential Reads

Random Reads

Efficient Reads

Efficient Writes

Footnotes

Graph View

Table of Contents

Backlinks