Question: If you had to transfer a large amount of data from A to B how would you do it? Dealing with I/O intensive big data workload challenges: volume.
Introduction
Big Data
Data has quickly become one of the worlds most valuable assets. It powers social media, Large Language Models (LLMs), trading screens, security & risk analysis, manufacturing, healthcare, retail, advertising, agriculture, and more.
What makes data more than just typical data is when it comes with a few key traits: volume (comes in enormous quantities), velocity (created quickly, usually in real-time), and variety (comes structured, semi-structured, and unstructured). There are, of course, other related measures 1 but those three stand out in particular as posing engineering challenges beyond the norm due to often being both compute and I/O intensive.
Those challenges mean that simple data file I/O doesn’t work well and that traditional data management systems cannot store, process, or analyze it well. Since big data has its greatest value when it is accurate and quickly and consistently available to those who need to consume it, those are challenges that must be overcome.
Linux
Linux is the probably the most important operating system most people have never heard of; it holds only around 3% of the desktop market. Even if you are a developer my umami insights will tell me you’re likely reading this on either a Windows or MacOS PC or an iPhone or Android mobile device. (I challenge you to prove me wrong - pass it on!).
Despite that the overwhelming majority of servers used on the internet (96%) and cloud (90%) use Linux2. Aworldwithoutlinux is a world without big data. Suffice to say, if you want to talk volume, velocity, and variety of big data then the conversation starts in Linux.
Spaces and Syscalls
If you’ve never been exposed to a lot of low-level or high-performance engineering then you may not have given terms like “user space”, “kernel space”, or even “system calls” a second thought. However, understanding those is a pathway to many abilities, some consider to be [necessary] — dark arts that make all of the difference when it comes to performance and scalability.
Simply put, unless you’re developing an operating system or writing hardware level drivers, anti-virus, or anti-cheat software then the programs that you typically run or develop are running in what’s called “user space”, with their own virtualized memory. This is a siloed area that sits above a privileged area called the “kernel space” which is responsible for programs that do critical operations and I/O like interacting directly with hardware like disks and memory.
User space processes run in user mode, which requires that they perform a system call (syscall) to the kernel to perform the operation and return the results back to the user. This is the case whether your program is written in Java, Python, C/C++, or Rust. Every time you need to make a syscall or switch from kernel to user mode and back comes with cost. That cost includes permissions checks, extra data transfers, context switching, and time. Therefore, minimizing the time spent in making these calls or switching between spaces is key to efficient data handling.
Handling Volume
For data file handling, whether they’re text files or binary, large or small, the kernel provides an API3 that exposes syscalls like open/openat, lseek, close, read, write, and others for various interactions.
We’re going to use a couple of linux tools strace
and vmtouch
4 to help us track system calls and system cache behavior when we interact with files. These are key to for helping us see what’s happening behind the scenes and improving performance.
Caching is key
Linux maintains a shared page cache as part of a virtual file system in kernel space. This is designed to make multiple 4K (4096 byte) pages from the block storage available to processes that are reading/writing files to avoid frequent calls to storage improving I/O latency5. Until a read()
call is made on a file descriptor made available to a process with open()
then cache will not hold data from that file.
Sequential Reads
When making the first read()
from a file in linux, the system will check the offset that is being read from. If that offset is 0 (the beginning of the file), it will assume you intend to make sequential reads so will load at least 4 pages3 worth of data into the cache due to a cache miss (aka page fault). Similar behavior occurs for subsequent sequential reads taking advantage of the principle of locality which posits that if a particular data location is referenced at a particular time, then it is likely that nearby data locations will be referenced in the near future.
Let’s take an example and see what happens when we open()
a 100K file, read()
a single byte, and close()
it.
Example: Rust
To observe the system calls being made by the process after building we can run
strace target/debug/sequential-reader
To observe the impact to the page cache we can use
vmtouch testfile.txt
to observe the number of pages (and bytes) that were cached to fulfill that request.As expected, since we began a read from the first byte, 4 pages (16K) of our
testfile.txt
data file are resident in the cache after the single read and stay resident after the process has closed.
Example: Java
To observe the system calls being made by the process after building we can run
strace -f java SequentialReader
or filter just the syscalls we care about for now with:strace -f -eread,write,close,/seek,/open java SequentialReader
To observe the impact to the page cache we can use
vmtouch testfile.txt
to observe the number of pages (and bytes) that were cached to fulfill that request.As expected, since we began a read from the first byte, 4 pages (16K) of our
testfile.txt
data file are resident in the cache after the single read and stay resident after the process has closed.
Fig 1: a single read from the first offset of a file will cache multiple pages to reduce future sequential read latency
Random Reads
If, however, we decide to first move the offset elsewhere in the file before making our the first read()
from a file in linux, it will assume you intend to make random access reads so will load only the page relevant to the bytes being read at the current offset3 into the cache due to a cache miss (aka page fault and will behave similarly for subsequent reads.
Let’s take an example and see what happens when we open()
a 100K file, move our offset with a lseek()
, read()
a single byte, and close()
it.
Example: Rust
To observe the system calls being made by the process after building we can run
strace target/debug/nonsequential-reader
To observe the impact to the page cache we can use
vmtouch testfile.txt
to observe the number of pages (and bytes) that were cached to fulfill that request.As expected, since we began a read after the first page (after a 4K seek) only 1 page (4K) of our
testfile.txt
data file is resident in the cache.
Example: Java
To observe the system calls being made by the process after building we run
strace -f java NonSequentialReader
or filter just the syscalls we care about for now with:strace -f -eread,write,close,/seek,/open java NonSequentialReader
To observe the impact to the page cache we can use
vmtouch testfile.txt
to observe the number of pages (and bytes) that were cached to fulfill that request.As expected, since we began a read after the first page (after a 4K seek) only 1 page (4K) of our
testfile.txt
data file is resident in the cache.
Fig 2: a single read after a seek of a file will cache a single page as a random access pattern is assumed in future
Efficient Reads
So given what we now know, making good use of the page cache is the key to efficient reads.
For sequential reads: it would make sense that if we wanted to read a whole file then reading from the beginning to the end using a buffer sized as some page-aligned multiple of 4K would work on a file of any size given default options.
For example we could iterate using 16K byte buffer blocks over a 100K file:
Example: Rust
Produces the following
strace
output when executed on a 100K test data file.
Example: Java
Produces the following
strace
when executed on a 100K test data file:
In practice, though, big data isn’t in the region of kilobytes, it’s gigabytes, terabytes, petabytes of data that must be read and processed efficiently. So picking buffer sizes of 64KB to 256KB for distributed systems, or 512KB to 1MB on local SDDs, or even 4MB buffers for large sequential reads is the norm.
What must be considered, as the buffer sizes gets larger, is the impact on other processes in the system and the cache as well as the CPU, and memory utilization.
For random reads: we would want to have way to have more control what it being cached and when, or at all. We could take advantage of the fact that a page cache is shared for a given having a process, or thread read it ahead of its use in the processing thread. We could even, employ the same concepts by pre-fetching and pooling the data ourselves to be read by our processing thread.
Whilst we won’t look at that here it’s interesting that most languages have some buffered calls for textual data or “read all” calls for binary data that further abstract away the back and forth process of calling read()
, picking the optimal behavior for a given file size.
For, example to read a whole file rust has std::fs::read
and Java has FileInputStream.readAllBytes
(which will crash for anything over 2GB). What is telling is that they exhibit some similar characteristics when a file size goes beyond a given limit:
Example: Rust
strace
for 100K file:
strace
for 1GB file:Automatically,
rust
has decided to make use of some additional system callsmmap
andmunmap
.
Example: Java
Running an
strace
for a 100K file:Running an
strace
for a 1G file leads to lots ofmmap
calls (not all listed):
When dealing with larger files, particularly larger-than-memory files, most I/O libraries tend to utilize mmap() — a system call that will create a direct mapping for the file in the virtual memory space of the process. Why? We’ll consider that in Part II of our series in Big Bytes, Big Data, Intensive IO Part II - Velocity.
Efficient Writes
Until now, we’ve only been considering reading data. However, the behavior behind the scenes when writing data is important to know. We might have read the data to make calculations, analyze, display it, or compress it, or even encrypt it. We may even be building an ETL pipeline or simply just be generating a huge amount of data ourselves.
Does a write()
call just write directly to the device or does it use the page cache? Reasonably given a read()
will go via the page cache, then it would make sense that the write()
would too, since if process A is reading from a file that process B is writing to, then it would (eventually) see those writes and not just a past snapshot view.
Fig 3: a single write() will write data to the page cache to be later flushed to the system using a sync() or fsync()
We can verify this with a code example:
Example: Rust
Output from
strace
:
Example: Java
Output from
strace
Note, in the output of our examples sync()
or fsync()
does not get called. Although a close()
operation may flush our pages into the cache, it will not guarantee it makes it to the disk immediately. That responsibility is left to the kernel unless explicitly called by the process.
At this point we have a good idea about how we can get a large volume of data moving through a system. However, at this point it is still slow.
As we mentioned at the outset, every time we need to make a system call or switch from kernel to user mode and back it comes with cost — permissions checks, superfluous copies, and context switching. All of this can waste time in a data intensive application.
How can we handle volume whilst paying attention to velocity? The next article will discuss this Big Bytes, Big Data, Intensive IO Part II - Velocity.
Disclaimer:
Any views and opinions expressed in this blog are based on my personal experiences and knowledge acquired throughout my career. They do not necessarily reflect the views of or experiences at my current or past employers
Footnotes
-
“What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets.” doi: 10.1177/2053951716631130. Available: https://journals.sagepub.com/doi/epub/10.1177/2053951716631130. [Accessed: Dec. 07, 2024] ↩
-
“Linux Statistics: Dominance in Supercomputers, Cloud, and Mobile Devices | Gitnux.org,” Nov. 26, 2024. Available: https://gitnux.org/linux-statistics/. [Accessed: Dec. 07, 2024] ↩
-
D. P. Bovet and M. Cesati, Understanding the Linux Kernel, 3rd ed. Sebastopol: O’Reilly Media, Inc, 2008. ↩ ↩2 ↩3
-
This doesn’t come pre-installed on Linux systems but you can build it from scratch or use
sudo apt-get install vmtouch
. See https://github.com/hoytech/vmtouch. Alternatively, for checking the number of pages of a file in virtual memory you can usefincore
- though this doesn’t have the ability to evict pages. ↩ -
“Essential Linux Page Cache theory,” Viacheslav Biriukov. Available: https://biriukov.dev/docs/page-cache/2-essential-page-cache-theory/. [Accessed: Dec. 07, 2024] ↩