Tudor Brindus

Sometimes, the kernel lies about process memory usage

2021-07-05T00:00:00+00:00

Here's a short systems debugging story.

On dmoj.ca, we run user-submitted solutions to algorithmic programming problems against a set of input files, and judge their output for correctness. One metric by which solutions are ranked on our leaderboards is memory usage. A user recently reported that some code they had submitted was reported as having consumed 4 KiB of memory, despite their code allocating a 128 KiB array. How come?

This is a story about how sometimes, the kernel lies about memory usage — all in the name of performance.

To start with, here's the (slightly edited) submission in question, solving¹ this problem in Zig:

const std = @import("std");

pub fn main() !void {
    @setRuntimeSafety(false);
    const allocator: *std.mem.Allocator = std.heap.page_allocator;
    const input: []u8 = try allocator.alloc(u8, 131072);
    
    const stdin = std.io.getStdIn().inStream();
    const n = try stdin.readAll(input);

    var s: u32 = 0;
    var t: u32 = 0;
    for (input[0..n]) |b| {
        if (b == 's' or b == 'S') s += 1;
        if (b == 't' or b == 'T') t += 1;
    }

    const stdout = std.io.getStdOut().outStream();
    try stdout.writeAll(if (s >= t) "French" else "English");
}

Now, I don't actually know Zig, but this code seems to pretty clearly allocate a 128 KiB array on the heap.

An early thought I had was, "what if the array is only allocated virtual address space for, but never faulted in its entirety since the input to the program is small?" Then I checked, and it turns out the input to this problem is quite large.

The way the judge determines memory usage is by wait4(2)ing on the submission process until it exits, and then parsing the contents of /proc/${pid}/status for VmHWM² — the "high watermark RSS usage" of the process.

Anyway, enough beating around the bush, let's run some code in GDB.

$ zig build-exe --release-safe test.zig --name test
$ gdb ./test
(gdb) catch syscall exit_group
Catchpoint 1 (syscall 'exit_group' [231])
(gdb) run
Starting program: /tmp/test 

I set up a breakpoint on exit_group(2) so that we can inspect the state right before the process exits.

Since the program asked for some input, I gave it some sample test data from the problem.

3
The red cat sat on the mat.
Why are you so sad cat?
Don't ask that.
^D

English

Catchpoint 1 (call to syscall exit_group), std.os.linux.exit_group (status=<optimized out>)
    at /opt/zig/lib/zig/std/os/linux.zig:556
556	   unreachable;

The program read the input, outputted the answer (English), and hit our breakpoint on exit_group(2). Time to poke around and see what we can find.

First, we can confirm that VmRSS and VmHWM for this process indeed still say 4 KiB.

$ grep -E 'Vm(HWM|RSS)' /proc/9253/status 
VmHWM:	     4 kB
VmRSS:	     4 kB

That's certainly odd, but confirms what the user reported.

Back to GDB, where is this input array actually located?

(gdb) info proc mappings
process 9253
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
            0x200000           0x202000     0x2000        0x0 /tmp/test
            0x202000           0x213000    0x11000     0x1000 /tmp/test
            0x213000           0x214000     0x1000    0x11000 /tmp/test
      0x7ffff7fd9000     0x7ffff7ff9000    0x20000        0x0 
      0x7ffff7ff9000     0x7ffff7ffd000     0x4000        0x0 [vvar]
      0x7ffff7ffd000     0x7ffff7fff000     0x2000        0x0 [vdso]
      0x7ffffffde000     0x7ffffffff000    0x21000        0x0 [stack]

Making an educated guess, it's located at 0x7ffff7fd9000, since the size 0x20000 is 128 KiB.

(gdb) x/32c 0x7ffff7fd9000
0x7ffff7fd9000:	51 '3'	10 '\n'	84 'T'	104 'h'	101 'e'	32 ' '	114 'r'	101 'e'
0x7ffff7fd9008:	100 'd'	32 ' '	99 'c'	97 'a'	116 't'	32 ' '	115 's'	97 'a'
0x7ffff7fd9010:	116 't'	32 ' '	111 'o'	110 'n'	32 ' '	116 't'	104 'h'	101 'e'
0x7ffff7fd9018:	32 ' '	109 'm'	97 'a'	116 't'	46 '.'	10 '\n' 87  'W' 104 'h'

Bingo, that's our sample input.

After staring at this for a few minutes, I had a bit of inspiration and took a look at /proc/${pid}/smaps, which reports per-segment information in more detail.

$ cat /proc/9253/smaps
...
7ffff7fd9000-7ffff7ff9000 rw-p 00000000 00:00 0 
Size:                128 kB
...
Rss:                 128 kB
...

Here, Rss is clearly being reported as 128 KiB for our allocation. Why is VmRSS not agreeing?

Armed with the knowledge that something funky is up with the RSS reporting, I turned to the time-honored tradition of grepping the kernel source for vague strings. In this case, searching for rss quickly brings up a likely culprit in lines 191-200 of mm/memory.c (as of kernel v5.10).

/* sync counter once per 64 page faults */
#define TASK_RSS_EVENTS_THRESH	(64)
static void check_sync_rss_stat(struct task_struct *task)
{
	if (unlikely(task != current))
		return;
	if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH))
		sync_mm_rss(task->mm);
}
#else /* SPLIT_RSS_COUNTING */

"Sync counter once every 64 page faults", well, that'd do it. When, and why was this code added?

((v5.10)) $ git log -L191,+1:mm/memory.c
commit 34e55232e59f7b19050267a05ff1226e5cd122a5
Author: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date:   Fri Mar 5 13:41:40 2010 -0800

    mm: avoid false sharing of mm_counter
    
    Considering the nature of per mm stats, it's the shared object among
    threads and can be a cache-miss point in the page fault path.
    
    This patch adds per-thread cache for mm_counter.  RSS value will be
    counted into a struct in task_struct and synchronized with mm's one at
    events.
    
    Now, in this patch, the event is the number of calls to handle_mm_fault.
    Per-thread value is added to mm at each 64 calls.
    
     rough estimation with small benchmark on parallel thread (2threads) shows
     [before]
         4.5 cache-miss/faults
     [after]
         4.0 cache-miss/faults
     Anyway, the most contended object is mmap_sem if the number of threads grows.
    
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
    Cc: David Rientjes <rientjes@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -125,0 +152,1 @@
+/* sync counter once per 64 page faults */

So, this is a performance optimization. Instead of all threads contending on updates to the same global RSS counters, each thread maintains their own counters and only update the global counters every 64 page faults. From afar and without any deeper context, this seems reasonable.

Since Zig doesn't link libc in all its 1.8 MiB glory, our Zig program finishes executing having faulted less than 64 pages, and is therefore incorrectly reported in the global counters — and thus VmHWM — as having only faulted a single 4 KiB page.

Three things stand out to me from this:

There is no real workaround if you care about the HWM, like we do. /proc/${pid}/smaps provides accurate info for RSS, but not HWM (which only makes sense globally, not per-segment).
There is no way to turn this off, even at compile time. For getting accurate results on dmoj.ca, we're now running a patched kernel where we quite literally comment this code out. Maintaining patched kernels makes me sad.
The global counters are not synced on thread exit. This one kind of sounds like a bug; one can imagine a program mmap(2)ing a large chunk of memory and spinning up many threads that each fault 63 pages before exiting. VmRSS and VmHWM sound like they'd be wildly off in this case.

It turns out we're not the first to run into this inaccuracy. Prior to a patch from October 2020 titled "Document inaccurate RSS due to SPLIT_RSS_COUNTING", this behavior was totally undocumented. (The patch updates man 5 proc with a note regarding the inaccuracy in RSS accounting, but as of this writing there's still no mention in man 2 getrusage³.)

The thread is worth a read, but in summary:

There are weird cases where the accounting can be off by more than 63 pages per thread; and
There is uncertainty among the kernel maintainers about whether the performance benefit of split counters is not outweighed by the poor accounting. That's a +1 from me, at least.

…and, that's all I got for today.

This code does a "classic" competitive programming trick of pre-buffering the entire input in order to avoid calling read(2) more than necessary. System calls are expensive! ↩
Why not just use the struct rusage *rusage populated by wait4(2), and grab ru_maxrss from it instead of parsing VmHWM out of /proc/${pid}/status? This is subtle enough that Guanzhong had to point it out while reading an early draft of this post, but ru_maxrss is reset on fork(2), while VmHWM is reset on exec(2). If we were to use ru_maxrss, the minimum possible memory usage reported by a submission would be that of the judge process at fork(2) time. The judge is written in Python, so this would be tens of megabytes. ↩
Time permitting, I intend to send in a patch updating man 2 getrusage, and maybe another for syncing the counters on thread exit. Or maybe getrusage(2) is somehow fine? I haven't checked this. ↩

Peeking under the hood of GCC's `__builtin_expect`

2020-03-23T00:00:00+00:00

If you've ever poked at high-performance C code, you've probably seen GCC's __builtin_expect extension being used to manually hint the likelihood of a branch being taken a particular way.

The Linux kernel famously contains macros for likely and unlikely branches, which perform the appropriate __builtin_expect incantations.

#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr)   __builtin_expect(!!(expr), 1)

…but, how does this all work? What does "hinting" mean, exactly, and how does __builtin_expect translate to generated assembly?

Let's write a short exploratory program to find out.

int main(int argc, char **argv) {
  volatile int x;

  if (__builtin_expect(argc % 2, EXPECT)) {
    x = 1;
  } else {
    x = 0;
  }

  return x;
}

The program returns 1 if the number of parameters it is passed is even, and 0 otherwise (remember that the executable name is, by convention, always passed in argv[0]). We use a compile-time define, EXPECT, as a parameter to __builtin_expect. Our return value, x, is marked as volatile to prevent the compiler from optimizing it out.

We can compile two versions of this binary: one with EXPECT = 1 and one with EXPECT = 0, and see how they differ. Recall that for EXPECT = 1, we are telling the compiler that we expect the x = 1 branch to be more likely, and vice-versa.

$ gcc -g -O2 -DEXPECT=1 expect.c -o expect1
$ gcc -g -O2 -DEXPECT=0 expect.c -o expect0

There are many ways to view the generated assembly of a binary, but for this post I'll be using an invocation of gdb.

$ gdb -batch -ex "disassemble/m main" ./expect1

(If you are following along online, you can check out this code on gcc.godbolt.org, and play with the value of EXPECT in the top-right box.)

Without further ado, below is the disassembly of expect1:

Dump of assembler code for function main:
1	int main(int argc, char **argv) {

2	  volatile int x;

3	
4	  if (__builtin_expect(argc % 2, EXPECT)) {
   0x0000000000001040 <+0>:	and    edi,0x1
   0x0000000000001043 <+3>:	je     0x1052 <main+18>

5	    x = 1;
   0x0000000000001045 <+5>:	mov    DWORD PTR [rsp-0x4],0x1

6	  } else {
7	    x = 0;
   0x0000000000001052 <+18>:	mov    DWORD PTR [rsp-0x4],0x0
   0x000000000000105a <+26>:	jmp    0x104d <main+13>

8	  }
9	
10	  return x;
   0x000000000000104d <+13>:	mov    eax,DWORD PTR [rsp-0x4]
   0x0000000000001051 <+17>:	ret

End of assembler dump.

A short summary of what's happening here:

edi stores the value of argc (System V x86-64 ABI; recall edi is the lower 32 bits of rdi).
Since division is expensive, GCC has replaced our % 2 with & 1.
If the lowest bit in argc is 0, je will jump to a mov for x = 0.
Otherwise, x = 1 will be executed, before jmp-ing to the end of main and ret-urning.

Still, there's nothing that immediately stands out for where the branch predictor hinting is happening. There's no magical hint instruction, at any rate.

What if we were to diff the disassembly of expect0 and expect1 instead?

$ gdb -batch -ex "disassemble/m main" ./expect1 > expect_1
$ gdb -batch -ex "disassemble/m main" ./expect0 > expect_0
$ git diff expect_0 expect_1

Now we're getting somewhere!

diff --git a/expect_0 b/expect_1
index a80a1bd..17f3458 100644
--- a/expect_0
+++ b/expect_1
@@ -6,15 +6,15 @@ Dump of assembler code for function main:
 3
 4        if (__builtin_expect(argc & 1, EXPECT)) {
    0x0000000000001040 <+0>:    and    edi,0x1
-   0x0000000000001043 <+3>:    jne    0x1052 <main+18>
+   0x0000000000001043 <+3>:    je     0x1052 <main+18>

 5          x = 1;
-   0x0000000000001052 <+18>:   mov    DWORD PTR [rsp-0x4],0x1
-   0x000000000000105a <+26>:   jmp    0x104d <main+13>
+   0x0000000000001045 <+5>:    mov    DWORD PTR [rsp-0x4],0x1

 6        } else {
 7          x = 0;
-   0x0000000000001045 <+5>:    mov    DWORD PTR [rsp-0x4],0x0
+   0x0000000000001052 <+18>:   mov    DWORD PTR [rsp-0x4],0x0
+   0x000000000000105a <+26>:   jmp    0x104d <main+13>

 8        }
 9

Clearly, the branch order is reversed, and jne is used in place of je.

In other words, the "preferred path" for the branch predictor, when it has no historical branch data to base a prediction off of, is the fall-through path (i.e. the else branch). __builtin_expect then simply reorders code such that the else branch contains the programmer-specified most-likely path, and negates the if operand as necessary.

To me at least, this behaviour does seem pretty magical, and I was surprised in not being able to readily find this mentioned online in the conventional sources of programmer wisdom (i.e. StackOverflow).

If one digs deep enough in the Intel 64 and IA-32 Architectures Optimization Reference Manual, one can find a reference for this behaviour on page 105 (emphasis mine):

3.4.1.6 Branch Type Selection

The default predicted target for indirect branches and calls is the fall-through path. Fall-through prediction is overridden if and when a hardware prediction is available for that branch. The predicted branch target from branch prediction hardware for an indirect branch is the previously executed branch target.

The more you know.

On online judging, part 5: optimizing `ptrace` filtering with `seccomp`

2019-01-04T00:00:00+00:00

In part 1 of this series, I mentioned that the overhead of a pure ptrace-based sandbox is about 10%. In hindsight, this number is very optimistic — it can be as high as 50% for some workloads — but understanding why requires a bit of background on how the judge keeps track of submission time.

In this post, we'll discuss both submission time-keeping, and a simple but effective method to reduce sandboxing overhead using seccomp alongside ptrace.

Understanding the Problem

Any judge needs to keep track of how long a submission runs, to implement things like time limits.

A simple method of accounting for time spent in a submission involves continuously waiting for a process to be signalled, and keeping track of how long was spent wait-ing. This is more "fair" than strictly timing how long it takes until the process exits, since it excludes the time spent filtering syscalls in the judge code.

total_time = 0
while True:
  start = time()
  wait for process to be signalled
  end = time()

  total_time += end - start
  if total_time > time_limit:
    kill process

  # If signal was for a syscall event (SIGTRAP) and not SIGWINCH, validate it
  resume process

We receive signals for all syscalls invocations when ptrace-ing, but in case a process doesn't use any for a long time, we also have set up a task to periodically send SIGWINCH (a harmless signal that's normally ignored) just to force wait to return for time-keeping purposes.

while True:
  signal process with SIGWINCH
  sleep(0.1)

Then, it's very simple to compute a naive measure of overhead: divide total_time by the total CPU time used by the process. Indeed, for most submissions, this figure will be less than 10% — but it doesn't tell the whole story. What we don't (and can't) easily measure is the overhead of the context switch from the submission to the tracer.

Every time a context switch happens, the submission's performance suffers from the invalidation of the memory it was relying on to be cached. For instance, tight loops over multi-dimensional matrices are very common, so you'll often see user code that looks like this:

matrix = ... # 10005 by 10005 matrix

for i in range(10005):
  for j in range(10005):
    val = ... matrix[i][j] ... # Some computation using the indices
    if <some condition>:
      print(val)  # This will cause a `write` syscall

This code should be very easy for the processor to optimize — the memory accesses are predictable, any there likely would be very few page faults when running as a result of the memory pages used (for matrix) being in the processor's TLB (cache).

When we force a context switch, however, we greatly increase the likelyhood of those cache entries being flushed by the time we resume it (on processors without process-context identifiers, e.g. pre-2010 Intel processors, we basically guarantee a TLB flush). So, when the process resumes, it will bear the overhead of having to miss the cache for memory accesses that should have been in cache if running without the tracer.

As a result, when comparing the time the process takes to execute without the tracer versus with the tracer, we can (for some extreme cases) see 2x or greater speedups, despite the overhead continuing to be reported as less than 10%.

A Brief Introduction to `seccomp`

Now that we've dissected the problem, let's talk about the solution: seccomp.

seccomp was initially introduced in kernel version 2.6.12 (2005), and allowed a one-way transition for a process to a secure state where it could only invoke the read, write, exit, and sigreturn syscalls. This may be useful for some secure computing projects, but it's not enough for running modern runtimes like Python. Thankfully, that's no longer the case.

In 2012, as part of the 3.5 kernel, seccomp was extended to provide programmable BPF filters. With this new functionality, it's possible to write more expressive filters to implement syscall validation in the kernel, without trapping into ptrace. This LKML article provides a good, accessible overview of how seccomp BPF works.

The key point is that we can add rules to seccomp to unconditionally allow very frequent, safe syscalls, and instruct it to hand control of the process over to a ptrace tracer when a dangerous syscall is encountered.

// In the tracer:
ptrace(PTRACE_SETOPTIONS, pid, NULL, PTRACE_O_TRACESECCOMP /* instead of PTRACE_O_TRACESYSGOOD */);

// In the submission, before `exec`-ing:
scmp_filter_ctx ctx = seccomp_init(SECCOMP_RET_TRACE(0));
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_load(ctx);
seccomp_release(ctx);
execve(...);

With the above filter, read and write will always be unconditionally allowed, while e.g. open will cause a ptrace event and stop the process for inspection. That's just what we need!

Comparing `ptrace` and `ptrace` + `seccomp` Tracing

So far, we've established that ptrace is slow, and that seccomp can make things better.

However, their interfaces are somewhat different, so it's worth discussing how standard ptrace actions (like cancelling syscalls and changing their return values) can be implemented with seccomp-enabled ptrace.

Let's say a user submission wants to open a file, and we end up denying it because it's a file they shouldn't be accessing. We want to open to return ENOENT, so that the user submission can respond accordingly. Here's roughly what happens under the hood with ptrace:

user submission calls open
[pre-syscall event transfers control to judge]
judge reads syscall number and arguments from registers; validates them
judge sets syscall number to something harmless and fast (getpid)
[judge resumes process]
kernel executes getpid
[post-syscall event transfers control to judge]
judge sets return value register to ENOENT, thereby "cancelling" the open syscall
[judge resumes process]
user submission's open call returns with ENOENT

We can see that for every syscall the user submission performs, we trap and stop the process two times. This is not at all cache-friendly to the submission.

When tracing with seccomp-enabled ptrace, however, we do not have pre- and post-syscall events — we are only notified before an event takes place. To support functionality like cancelling syscalls, seccomp allows tracers to set the syscall register to -1 on the pre-syscall event. This will instruct the kernel will skip the syscall, returning the register set as-is.

user submission calls open
[pre-syscall event transfers control to judge]
judge reads syscall number and arguments from registers; validates them
judge sets syscall number to -1, and return value register to ENOENT
[judge resumes process]
user submission's open call returns with ENOENT

Here, we only have two context switches as opposed to four, but more importantly, we only run through this logic on unsafe syscalls like open. Syscalls that dominate a submission's lifespan, like read or write, never trigger seccomp to signal the process. That's a huge win for performance.

Conclusion

As of January 2019, we've been taking a seccomp + ptrace approach in the sandbox for the DMOJ, so as always you may peek at the source code to see how the ideas expressed in this post can be implemented in practice.

Empirically, speedups have been noticeable since upgrading the sandbox to use seccomp. They are particularly apparent for interactive tasks, which require frequent flushing of standard output, but can be felt to some extent across most problems.

There are further optimizations possible with this approach (for instance, the order rules are added to the filter matters, since they're evaluated in the order they were added — so there's an incentive to having the most common syscalls listed first), but perhaps they'll be the subject of a future post.

Emulating microprocessors with macros

2018-12-11T00:00:00+00:00

Whenever I work on an emulator (having written several in the past), I try to make my life as interesting as possible. After all, implementing hundreds of opcodes can be a very dull task.

Most recently, I joked that C macros were powerful enough for it to be feasible to implement an simple architecture in them. One thing led to another, with the result being an Intel 8080 emulator core implemented purely with C macros.

In this post, I'll go over the awful hacks that helped make this monstrosity a reality… and why perhaps it's not such a bad idea to write an emulator in macros.

Quick Intel 8080 Refresher

The Intel 8080 is an 8-bit microprocessor with a 16-bit address bus, released back in 1974. It features 8-bit general-purpose registers B, C, D, E, H, and L, alongside an accumulator A. The general-purpose registers can be treated as 16-bit "pairs" BC, DE, and HL, allowing for 16-bit operations to be performed. Memory is addressed through the register pair HL.

A table of supported opcodes and their mnemonics may be found here.

Motivation for Macros

Macros don't just needlessly complicate the development of an emulator. There is, in fact, a very tangible benefit to using macros in implementing instructions — performance.

A typical 8080 instruction is encoded as an 8-bit value, with register operands embedded in the opcode. The encoding for all MOV instructions is shown below.

01 aaa bbb
|   |   └── source register
|   └── destination register
└── MOV prefix   

For instance, the assembly MOV A, B is encoded as 01 111 000; 111 identifies register A, and 000 identifies register B. A traditional emulator might implement MOV with a generic, lookup-based approach:

void MOV(uint8_t opcode) {
  uint8_t src = opcode & 0x7;
  uint8_t dst = (opcode >> 3) & 0x7;
  set_reg(dst, get_reg(src));  // set_reg and get_reg are switch-based lookups
}

From a software engineering standpoint, this is an ideal implementation. MOV is succint, and set_reg/get_reg are dedicated helpers that can be reused in future code. A+ for code quality.

And yet, this approach is suboptimal in the event that we're optimizing for performance, rather than readability. There is a large overhead (in terms of host machine cycles) for the set_reg and get_reg routines that can't easily be eliminated. The compiler might end up inlining the routines and removing the overhead of the 2 calls, but the approach still requires mapping a 3-bit register ID to the actual register, twice per instruction.

Instead, what if we used macros to generate code for all possible variants of an instruction? Illustrating with an example, MOV can be implemented like this:

#define MOV(X, Y) \
{                 \
    X = Y;        \
}

Of course, we need to make sure that we generate code for all valid permutations of X and Y, but this is easy to do programmatically.

Dispatching Instructions to Macros

At the core of every emulator is a tight loop that increments the program counter, fetches an opcode, and transfers control to the appropriate opcode handler. If all out opcodes are implemented as macros, how can we achieve this?

One option is we can make use of GCC's support for taking the address of labels by generating a label for each opcode, and storing its address in a lookup table that we can later branch into. A straightforward application of this idea would look something like this:

#define DONE goto done_opcode;
#define MOV(X, Y) \
{                 \
  X = Y;          \
  DONE;           \
}

void run_forever(void) {
  // Declare all registers, memory, etc.
  ...
  
  static void *ops[] = {
    ... &&MOV_A_B, &&MOV_A_C, &&MOV_A_D, ...
  };

  while (1) {
    goto *ops[memory[PC++]];
    done_opcode: ;
  }

  ...
  MOV_A_B:
    MOV(A, B)
  MOV_A_C:
    MOV(A, C)
  MOV_A_D:
    MOV(A, D)
  ...
}

The observant reader will notice that it's almost as if we're treating MOV as a function, substituting its invocation and return with gotos. If we didn't want to rely on GCC-specific extensions, we could instead implement this functionality with regular functions. That said, by branching within the same function, we save the overhead stack frame management imposes.

You can check out this example on godbolt.org to view a colorized disassembly of the above code. It's worth noting that in the end, the greatest overhead for our MOV instructions becomes goto *ops[memory[PC++]], which is an operation we'd have to perform regardless of how our opcode was implemented — good!

Handling Register Pairs

As I mentioned earlier, the 8080 is an 8-bit processor that can work on 16-bit data in the form of "register pairs" BC, DE, and HL.

If we want to be able to use macros everywhere, we need some efficient way to enforce consistency between the values assigned to BC, and the values assigned to the individual B and C registers. A simple solution here is to make use of a union of two 8-bit values and a 16-bit value, and some macros to hide the underlying complexity. Thankfully, GCC does a good job at optimizing these accesses.

typedef union {
  uint16_t pair;
  uint8_t reg[2];
} regpair_t;

register uint8_t A;
register regpair_t bc, de, hl;

#define B bc.reg[1]
#define C bc.reg[0]
#define BC bc.pair
#define D de.reg[1]
#define E de.reg[0]
#define DE de.pair
#define H hl.reg[1]
#define L hl.reg[0]
#define HL hl.pair

This will magically work on little-endian systems, due to the order in which the bytes of pair are stored. You're out of luck on a big-endian system, but those are pretty rare to come by these days.

Putting it All Together

With all the plumbing done, all that's left is to implement the rest of the opcodes. This isn't particularly hard, nor is it enlightening. One thing that deserves special attention is that due to our macros resembling functions, it is easy to forget that they are, in fact, macros — and that any use of a parameter re-evaluates it. This can lead to very subtle, hard-to-find bugs.

In the end, a purely macro-based approach appears to be about an order of magnitude faster than a traditional function approach. This was tested (admittedly not rigorously) by measuring the time it took to complete all the test runs of the famous 8080EXER.COM program between macro8080 (the embodiment of the ideas expressed in this post) and several other hobby emulators found on GitHub.

I guess you can sacrifice readability for performance!

Correct usage of `LD_PRELOAD` for hooking `libc` functions

2018-11-18T00:00:00+00:00

LD_PRELOAD is a very powerful feature supported by the dynamic linker on most Unixes that allows shared libraries to be loaded before others (including libc). This makes it very useful for hooking libc functions to observe or modify the behaviour of 3rd-party applications to which you do not control the source.

Unfortunately, a lot of what's been written on the subject online is subtly wrong — not wrong enough to fail outright — but just enough to bite you once when you expect it the least. In this post I'll first go over the incorrect approach often described, analyze why it's wrong, and then describe the easy fix.

A simple program

Let's consider a simple C program that we'll be using to test. Our goal will be to track what files it's opening using LD_PRELOAD.

#include <stdio.h>

int main(int argc, char **argv) {
  FILE *ptr = fopen("/etc/hosts", "r");
  fclose(ptr);
  return 0;
}

Nothing special going on here — we can save it to test.c and compile with:

$ gcc test.c -o test

(Incorrectly) using `LD_PRELOAD` to hook `fopen`

Strictly speaking, fopen is not the lowest-level you can get for opening files. open(2) (and friends) is the syscall everything eventually trickles down to, but we can't intercept the syscall directly it with an LD_PRELOAD hook — that's what ptrace(2) is for. At most, we could intercept its libc wrapper. Nonetheless, hooking fopen is enough for demonstration purposes.

#define _GNU_SOURCE

#include <stdio.h>
#include <dlfcn.h>

typedef FILE *(*fopen_t)(const char *pathname, const char *mode);
fopen_t real_fopen;

FILE *fopen(const char *pathname, const char *mode) {
  fprintf(stderr, "called fopen(%s, %s)\n", pathname, mode);
  return real_fopen(pathname, mode);
}

__attribute__((constructor)) static void setup(void) {
  real_fopen = dlsym(RTLD_NEXT, "fopen"); 
  fprintf(stderr, "called setup()\n");
}

We can compile this code as a position-independent shared library, linking libdl for dlopen. Then by passing the full-path to it into the LD_PRELOAD environment variable, it gets loaded before libc so fopen gets resolved to our declaration.

$ gcc -shared -fPIC -ldl preload_test.c -o preload_test.so
$ LD_PRELOAD=$PWD/preload_test.so ./test
called setup()
called fopen(/etc/hosts, r)

Let's provide a bit more background on what's going on. __attribute__((constructor)) is a GCC extension (that's supported by Clang too) which places a pointer to setup in preload_tests .ctors section. The loader then knows to execute the function before anything else (in particular, before main is called). In our setup function, we ask libdl for the next (RTLD_NEXT) resolution of fopen — this should be libc's — and keep a pointer to it. When our test executable runs and opens /etc/hosts, our hooked fopen is caled.

This is what a lot of articles online get wrong. Sure, it seems to work for our simple test, but let's try a "real" application, like ssh. If you're following along on your own, note that ssh may or may not exhibit this behaviour on your system, depending on how it was compiled and how your system is set up.

$ LD_PRELOAD=$PWD/preload_test.so ssh
called fopen(/proc/filesystems, r)
Segmentation fault

Oops!

So, what's going on? It's clear that our setup was never called, which means that when we try to invoke real_fopen, we're dealing with a null pointer. Basic stuff, but why? We can use valgrind to get a better idea of what's going on (some valgrind output omitted for brevity).

$ LD_PRELOAD=$PWD/preload_test.so valgrind --tool=memcheck ssh
called setup()
called setup()
called fopen(/proc/filesystems, r)
==2108== Jump to the invalid address stated on the next line
==2108==    at 0x0: ???
==2108==    by 0x5048B0D: selinuxfs_exists (in /lib/x86_64-linux-gnu/libselinux.so.1)
==2108==    by 0x5040D97: ??? (in /lib/x86_64-linux-gnu/libselinux.so.1)
==2108==    by 0x400F859: call_init.part.0 (dl-init.c:72)
==2108==    by 0x400F96A: call_init (dl-init.c:30)
==2108==    by 0x400F96A: _dl_init (dl-init.c:120)
==2108==    by 0x4000C59: ??? (in /lib/x86_64-linux-gnu/ld-2.24.so)
==2108==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==2108== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault

This paints a clear picture of what's happening. ssh depends on libselinux, which defines its own constructor that tries fopen-ing /proc/filesystems. At this point in time, our setup has not been called by the linker, but fopen has been resolved to ours. As a result, we end up invoking an uninitialized pointer and segfault.

(Correctly) using `LD_PRELOAD` to hook `fopen`

With our investigation over, the fix is very simple: don't depend on a constructor to resolve libc's fopen, and do it on demand when it's first needed.

#define _GNU_SOURCE

#include <stdio.h>
#include <dlfcn.h>

typedef FILE *(*fopen_t)(const char *pathname, const char *mode);
fopen_t real_fopen;

FILE *fopen(const char *pathname, const char *mode) {
  fprintf(stderr, "called fopen(%s, %s)\n", pathname, mode);
  if (!real_fopen) {
    real_fopen = dlsym(RTLD_NEXT, "fopen");
  }

  return real_fopen(pathname, mode);
}

__attribute__((constructor)) static void setup(void) {
  fprintf(stderr, "called setup()\n");
}

And now, after recompiling we can see that it works as expected:

$ gcc -shared -fPIC -ldl preload_test.c -o preload_test.so
$ LD_PRELOAD=$PWD/preload_test.so ssh
called fopen(/proc/filesystems, r)
called fopen(/proc/mounts, r)
called setup()
called fopen(/etc/passwd, rme)
usage: ssh [-1246AaCfGgKkMNnqsTtVvXxYy] [-b bind_address] [-c cipher_spec]
           [-D [bind_address:]port] [-E log_file] [-e escape_char]
           [-F configfile] [-I pkcs11] [-i identity_file]
           [-J [user@]host[:port]] [-L address] [-l login_name] [-m mac_spec]
           [-O ctl_cmd] [-o option] [-p port] [-Q query_option] [-R address]
           [-S ctl_path] [-W host:port] [-w local_tun[:remote_tun]]
           [user@]hostname [command]

libselinux's constructor opens /proc/filesystems and /proc/mounts, before the linker passes control to our setup, and /etc/passwd is read as part of ssh's initalization procedures.

Overall, this is a simple fix to a problem that might otherwise go undetected during testing, but I hope the analysis of what can go wrong when relying on constructors to execute in a particular order was entertaining to read.

Low-latency static sites with Scaleway and Cloudflare

2018-09-03T00:00:00+00:00

For a while now, I'd been searching for a cheap but reliable hosting solution for this website.

The option of hosting with Github Pages and similar services exists and has a minimal barrier to entry, but I like to be in control of my servers, so that I can occasionally use them for other tasks than just purely hosting. For instance, the machine serving this page runs both a Tor relay and acts as a backup for my large but non-sensitive files.

Now, I think I've found a good solution: a €2.99/mo Scaleway plan coupled with Cloudflare for fast page load times worldwide.

Setting up the Server

Scaleway's lowest-tier C1 plan offers 4 baremetal ARMv7 cores, 2GB RAM, 50GB SSD and unmetered 200mbit/s bandwidth for €2.99/mo. (There are x86 plans too, but ARM is cool.) They also offer extra SSD storage priced at €1/50GB/mo. That's a pretty sweet deal, with the downside that their only datacenters are located in Paris and Amsterdam — at least an extra 100ms away for users in North America compared to more traditional hosting options like New York or Montreal.

That's where Cloudflare comes in.

If you're hosting a site, chances are you're already using Cloudflare, or heard of it. In short, it acts as a proxy in front of your site, so that requests to your domain are routed through Cloudflare before hitting your server. This allows Cloudflare to filter traffic and protect you against DDoS attacks, but at first glance it would seem that an extra proxy step would only increase latency to your content.

For dynamic websites, this may well be true. However, if you're running a mostly-static site, you can leverage Cloudflare's edge node caching to speed things up tremendously. By default, Cloudflare will cache typically-static content like images, CSS, JavaScript, etc. — that means that your server would only be hit for the HTML markup of your site, while static content would be served directly from Cloudflare's edge nodes around the world. It's also free.

One upside of running a static website is you can easily get Cloudflare to cache your HTML, too. For most requests, your users would experience only the latency to their local Cloudflare edge node (check here to see yours). In principle, a user from Australia should have the same fast loading time as a user from Canada, despite your server being a cheapo €2.99/mo box in Europe.

Configuring the Site

Configuration on Cloudflare's end is easy. Simply navigate to Page Rules and add a new rule targetting your desired pages. Specify Cache Level as Cache Everything to force HTML caching and Edge Cache TTL: a day (or something similarly long), and you're off!

That leaves making your site interact nicely with such aggressive caching. In particular, you probably don't want changes to your site to take an entire day to propagate to your users. This is easy to deal with by triggering Cloudflare's cache purge API whenever your site is rebuilt. You can obtain an API token from the bottom of your profile page, and your site's zone ID from its main overview page, after which purging Cloudflare's cache is just a simple cURL away:

curl -X POST "https://api.cloudflare.com/client/v4/zones/${zone_id}/purge_cache" \
     -H "X-Auth-Email: ${auth_email}" \
     -H "X-Auth-Key: ${auth_key}" \
     -H "Content-Type: application/json" \
     --data '{"purge_everything":true}'

This should play well with most static site generators.

You could build upon this to only purge pages that were changed via a filesystem watching process, and so on — for my purposes, purging everything was acceptable.

Wrapping Up

That's all I have to say on this subject, hopefully you found it interesting :)

It's not revolutionary by any means, but I know at least myself and some of my colleagues were surprised at just how effective this approach was to lowering page load times while saving on hosting bills.

Till next time!

Mining for Tor v3 onions in the cloud

2018-03-22T00:00:00+00:00

Tor supports a new hidden service protocol as of v0.3.2.1-alpha, released back in October 2017, and is now in stable branches. Dubbed the "v3" onion service protocol, among other changes, it replaces SHA1/DH/RSA1024 with SHA3/ed25519/curve25519 for much improved cryptographic security.

I already had a v2 onion site up at tbrindus6tjv6wpi.onion, so I thought it would be an interesting exercise to mine a v3 vanity domain prefixed with tbrindus. For this, I set up 15 servers to mine for a matching prefix — more on this below!

It took well over a week of mining, but as of today, this site can also be accessed through the v3 hidden service tbrindusxnnqwmzov5qof56hyion6usmciqwykffxqsawswhk73aq5yd.onion!

A bit of background

Tor hidden service domain "names" aren't really domain names as most are used to. You can enter them in your (Tor) browser, but you can't buy a particular domain you want — a hidden service hostname is a prefix of the base32-encoded public key of the service.

If you want a particular onion, you must randomly generate billions of keys until one happens to hash into a string starting with the prefix you're looking for. In the case of tbrindus, an 8-letter prefix, there are $32^8 = 1\,099\,511\,627\,776$ possible combinations. Every additional letter increases the space (and hence expected computation time) by a factor of 32.

V2 onions have been around for a long time, so there exist GPU-based miners like Scallion which can hash at frightening (several gigahashes a second) rates. In fact, Scallion was used to brute force 32-bit GPG key ids to demonstrate that 32-bit ids are insecure (evil32.com for more on that).

Tor's switch to ed25519 means that existing tools for generating vanity names like Scallion can't be used — at the time of writing, the best bet for v3 vanity names is mkp224o, a CPU-based miner.

I expected mkp224o to be orders of magnitudes slower than GPU-based mining, so I spun up 15 servers across several providers (I'm looking for a new host, and thought this would be a good opportunity to test some new ones out).

Setting up the servers

Getting mkp224o set up and running is fairly simple. On most development machines you'd probably have everything required preinstalled, with perhaps the exception of libsodium-dev.

On a typical Debian-based distro, you can get everything you need to get running with:

$ apt install autoconf build-essential git libsodium-dev
$ git clone https://github.com/cathugger/mkp224o.git
$ cd mkp224o
$ ./autogen.sh
$ ./configure # see below
$ make

For ARM servers, I passed --enable-donna to configure, while for x86_64 boxes I used either --enable-amd64-51-30k or --enable-amd64-64-24k, whichever provided the greatest hashrate.

For mining, I specified a filter for tbrindus:

$ ./mkp224o -s -T tbrindus

…and waited. I waited a long time.

Mining results

V2 onions can be hashed incredibly fast on common GPUs with Scallion, with many cards capable of several gigahashes per second. On my laptop's GTX 960M, Scallion pulled in 1 GH/s, and mined tbrindus6tjv6wpi.onion in under 10 minutes.

For comparison, the 15 servers I ran mkp224o on for 6 days pulled in an aggregate 5 MH/s, or 0.5% of what my fairly standard laptop graphics card can compute.

Below, I've put together a table of the setups I ran to compute tbrindusxnnqwmzov5qof56hyion6usmciqwykffxqsawswhk73aq5yd.onion.

Host	Plan	OS	CPU	RAM	Hashes/s	Contrib.
Scaleway¹	C2S	Debian 9.0	4x Intel Atom C2550 @ 2.3GHz	8GB	229,400	4.76%
Scaleway	ARM64-16GB	Debian 9.0	16x ARMv8 Cavium ThunderX	16GB	1,300,000	26.97%
Scaleway	ARM64-8GB	Ubuntu 16.04	8x ARMv8 Cavium ThunderX	8GB	626,000	12.99%
Scaleway	ARM64-2GB	Ubuntu 16.04	4x ARMv8 Cavium ThunderX	2GB	314,000	6.51%
Scaleway²	ARM64-2GB	Debian 9.3	4x ARMv8 Cavium ThunderX	2GB	218,000	4.52%
Scaleway	C1	Debian 9.0	2x Intel Atom C2750 @ 2.3GHz	2GB	113,500	2.35%
DigitalOcean	Compute 4GB	Debian 9.4	2x Intel Xeon E5-2697A v4 @ 2.5GHz	4GB	470,000	9.75%
Azure	Standard B2s	Ubuntu 16.04	2x Intel Xeon E5-2673 v4 @ 2.294GHz	4GB	68,000	1.41%
Azure	Standard B2s	Debian 9.3	2x Intel Xeon E5-2673 v4 @ 2.294GHz	4GB	80,000	1.66%
Azure	Standard B2s	FreeBSD 11.1	2x Intel Xeon E5-2673 v4 @ 2.294GHz	4GB	69,000	1.43%
SSDNodes³	8GB KVM	Debian 9.3	2x Intel (Skylake, IBRS) @ 2.299GHz	8GB	274,500	5.69%
SSDNodes³	16GB KVM	Debian 9.3	4x Intel (Skylake, IBRS) @ 2.299GHz	16GB	540,000	11.20%
SSDNodes	8GB Container	Debian 9.4	4x Intel Xeon E5-2697 v3 @ 766MHz	8GB	78,000	1.62%
—⁴	Raspberry Pi 3	Raspbian 9.1	4x ARM Cortex-A53 @ 1.2GHz	1GB	70,000	1.45%
—⁴	Optiplex 960	Ubuntu 16.04	4x Intel 2 Quad Q9400 @ 2.659GHz	4GB	370,000	7.68%
—	—	—	—	—	4,820,400	100.00%

1. This was a dedicated machine.

2. This machine was provisioned with the same specs as the other ARM64-2GB instance, but was also running a Tor relay, which explains the difference in hashrate.

3. CPU steal time on these machines was constantly at 20% or higher.

4. I ran these machines uninterrupted at home.

A quick statistical analysis

OK, so it took a long time. I accumulated far more in server expenses than I had originally planned on, but at least I got a sense of pride and accomplishment from it.

The search for a hash prefix of tbrindus is probabilistic and memoryless: you never get "closer" to mining a hash; every hash has an equal probability $\frac 1 {32^{\text{length(prefix)}}} = \frac 1 {32^8}$ of matching. Since it's essentially a Poisson process, and we can use an exponential distribution to estimate how long it takes, on average, for a match to be found.

The CDF of an exponential distribution has the form $1 - e^{-\lambda x}$.

We can perform 4,820,400 hashes per second (86,400 seconds in a day) with each hash having a probability of $\frac 1 {32^8}$, so we can determine the probability that we'll find a match in $x$ days (let's call it $f(x)$ for simplicity) by taking $\lambda = \frac{86\,400 \times 4\,820\,400}{32^8}$.

$f(x) = 1 - e^{-\lambda x} = 1 - e^{-\frac{86\,400 \times 4\,820\,400}{32^8} x}$

Since I like graphs, let's graph this function.

The expected value of an exponential distribution is given by $\frac 1 \lambda$, so we can take this and plug in our $\lambda$ to find out the expected number of days for generating a prefix of 8 characters:

$\frac 1 \lambda = \frac 1 {\frac{86\,400 \times 4\,820\,400}{32^8}} \approx 2.64\text{ days}$

Alright, so I definitely overshot that.

Bonus: UnixBench of the servers

Since I had all these servers up and running already, I figured it'd be interesting to compare UnixBench scores to see how they correlated to hashrate. In the table below, I've included the hashrate of several servers I was particularly interested in, as well as their single core and multi-core performance determined by running UnixBench on an unloaded system.

Host	Plan	OS	Hashes/s	Num. Cores	Single core perf.	Multi-core perf.
Scaleway	ARM64-16GB	Debian 9.0	1,300,000	16	401.2	1641.6
Scaleway	ARM64-8GB	Ubuntu 16.04 LTS	626,000	8	380.5	1514.1
Scaleway	ARM64-2GB	Ubuntu 16.04 LTS	314,000	4	400.9	1020.3
Scaleway	C1	Debian 9.0	113,500	2	621.0	1047.7
Azure	Standard B2s	Ubuntu 16.04	68,000	2	472.2	340.0
SSDNodes	16GB KVM	Debian 9.3	540,000	4	472.3	1363.2
SSDNodes	8GB KVM	Debian 9.3	274,500	4	616.8	1382.8

I've also attached the raw UnixBench logs below, for convenience.

Scaleway — ARM64-16GB

========================================================================
   BYTE UNIX Benchmarks (Version 5.1.3)

   System: redacted: GNU/Linux
   OS: GNU/Linux -- 4.9.23-std-1 -- #1 SMP Mon Apr 24 13:18:14 UTC 2017
   Machine: aarch64 (unknown)
   Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
   05:13:38 up 3 days,  1:08,  1 user,  load average: 11.74, 15.14, 15.76; runlevel 2018-03-15

------------------------------------------------------------------------
Benchmark Run: Sun Mar 18 2018 05:13:38 - 05:41:33
16 CPUs in system; running 1 parallel copy of tests

Dhrystone 2 using register variables        8372406.5 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     1825.0 MWIPS (9.9 s, 7 samples)
Execl Throughput                               1014.4 lps   (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        181638.7 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           51750.8 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        422317.9 KBps  (30.0 s, 2 samples)
Pipe Throughput                              476739.6 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  29308.4 lps   (10.0 s, 7 samples)
Process Creation                               2046.2 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   2597.0 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   1107.5 lpm   (60.0 s, 2 samples)
System Call Overhead                         863802.9 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0    8372406.5    717.4
Double-Precision Whetstone                       55.0       1825.0    331.8
Execl Throughput                                 43.0       1014.4    235.9
File Copy 1024 bufsize 2000 maxblocks          3960.0     181638.7    458.7
File Copy 256 bufsize 500 maxblocks            1655.0      51750.8    312.7
File Copy 4096 bufsize 8000 maxblocks          5800.0     422317.9    728.1
Pipe Throughput                               12440.0     476739.6    383.2
Pipe-based Context Switching                   4000.0      29308.4     73.3
Process Creation                                126.0       2046.2    162.4
Shell Scripts (1 concurrent)                     42.4       2597.0    612.5
Shell Scripts (8 concurrent)                      6.0       1107.5   1845.8
System Call Overhead                          15000.0     863802.9    575.9
                                                                   ========
System Benchmarks Index Score                                         401.2

------------------------------------------------------------------------
Benchmark Run: Sun Mar 18 2018 05:41:33 - 06:09:37
16 CPUs in system; running 16 parallel copies of tests

Dhrystone 2 using register variables      132993486.5 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                    29057.8 MWIPS (10.0 s, 7 samples)
Execl Throughput                               7995.3 lps   (29.9 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        137360.5 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           29373.3 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        630759.5 KBps  (30.0 s, 2 samples)
Pipe Throughput                             7424668.0 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 401144.7 lps   (10.0 s, 7 samples)
Process Creation                              10546.1 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                  15213.0 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   2003.4 lpm   (60.2 s, 2 samples)
System Call Overhead                        1277419.8 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0  132993486.5  11396.2
Double-Precision Whetstone                       55.0      29057.8   5283.2
Execl Throughput                                 43.0       7995.3   1859.4
File Copy 1024 bufsize 2000 maxblocks          3960.0     137360.5    346.9
File Copy 256 bufsize 500 maxblocks            1655.0      29373.3    177.5
File Copy 4096 bufsize 8000 maxblocks          5800.0     630759.5   1087.5
Pipe Throughput                               12440.0    7424668.0   5968.4
Pipe-based Context Switching                   4000.0     401144.7   1002.9
Process Creation                                126.0      10546.1    837.0
Shell Scripts (1 concurrent)                     42.4      15213.0   3588.0
Shell Scripts (8 concurrent)                      6.0       2003.4   3339.0
System Call Overhead                          15000.0    1277419.8    851.6
                                                                   ========
System Benchmarks Index Score                                        1641.6

Scaleway — ARM64-8GB

   BYTE UNIX Benchmarks (Version 5.1.3)

   System: redacted: GNU/Linux
   OS: GNU/Linux -- 4.4.121-mainline-rev1 -- #1 SMP Sun Mar 11 16:44:34 UTC 2018
   Machine: aarch64 (aarch64)
   Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
   05:13:17 up 2 days, 53 min,  1 user,  load average: 5.56, 7.47, 7.82; runlevel 2018-03-16

------------------------------------------------------------------------
Benchmark Run: Sun Mar 18 2018 05:13:17 - 05:41:24
8 CPUs in system; running 1 parallel copy of tests

Dhrystone 2 using register variables        8502417.0 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     1741.0 MWIPS (10.1 s, 7 samples)
Execl Throughput                               1112.8 lps   (29.9 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        165427.5 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           54377.8 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        343939.2 KBps  (30.0 s, 2 samples)
Pipe Throughput                              462211.7 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  14746.0 lps   (10.0 s, 7 samples)
Process Creation                               2370.8 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   2677.5 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   1050.2 lpm   (60.0 s, 2 samples)
System Call Overhead                         998124.5 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0    8502417.0    728.6
Double-Precision Whetstone                       55.0       1741.0    316.6
Execl Throughput                                 43.0       1112.8    258.8
File Copy 1024 bufsize 2000 maxblocks          3960.0     165427.5    417.7
File Copy 256 bufsize 500 maxblocks            1655.0      54377.8    328.6
File Copy 4096 bufsize 8000 maxblocks          5800.0     343939.2    593.0
Pipe Throughput                               12440.0     462211.7    371.6
Pipe-based Context Switching                   4000.0      14746.0     36.9
Process Creation                                126.0       2370.8    188.2
Shell Scripts (1 concurrent)                     42.4       2677.5    631.5
Shell Scripts (8 concurrent)                      6.0       1050.2   1750.4
System Call Overhead                          15000.0     998124.5    665.4
                                                                   ========
System Benchmarks Index Score                                         380.5

------------------------------------------------------------------------
Benchmark Run: Sun Mar 18 2018 05:41:24 - 06:09:38
8 CPUs in system; running 8 parallel copies of tests

Dhrystone 2 using register variables       67785992.9 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                    13990.1 MWIPS (10.1 s, 7 samples)
Execl Throughput                               5098.5 lps   (29.8 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        285233.4 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           73046.0 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       1005166.1 KBps  (30.0 s, 2 samples)
Pipe Throughput                             3663311.5 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 222918.1 lps   (10.0 s, 7 samples)
Process Creation                               8125.0 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                  10717.2 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   1391.7 lpm   (60.2 s, 2 samples)
System Call Overhead                        3636949.3 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   67785992.9   5808.6
Double-Precision Whetstone                       55.0      13990.1   2543.7
Execl Throughput                                 43.0       5098.5   1185.7
File Copy 1024 bufsize 2000 maxblocks          3960.0     285233.4    720.3
File Copy 256 bufsize 500 maxblocks            1655.0      73046.0    441.4
File Copy 4096 bufsize 8000 maxblocks          5800.0    1005166.1   1733.0
Pipe Throughput                               12440.0    3663311.5   2944.8
Pipe-based Context Switching                   4000.0     222918.1    557.3
Process Creation                                126.0       8125.0    644.8
Shell Scripts (1 concurrent)                     42.4      10717.2   2527.6
Shell Scripts (8 concurrent)                      6.0       1391.7   2319.6
System Call Overhead                          15000.0    3636949.3   2424.6
                                                                   ========
System Benchmarks Index Score                                        1514.1

Scaleway — ARM64-2GB

========================================================================
   BYTE UNIX Benchmarks (Version 5.1.3)

   System: redacted: GNU/Linux
   OS: GNU/Linux -- 4.4.121-mainline-rev1 -- #1 SMP Sun Mar 11 16:44:34 UTC 2018
   Machine: aarch64 (aarch64)
   Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
   05:14:10 up 3 days,  7:45,  1 user,  load average: 2.75, 3.74, 3.91; runlevel 2018-03-14

------------------------------------------------------------------------
Benchmark Run: Sun Mar 18 2018 05:14:10 - 05:42:12
4 CPUs in system; running 1 parallel copy of tests

Dhrystone 2 using register variables        8555429.5 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     1747.9 MWIPS (10.1 s, 7 samples)
Execl Throughput                               1224.4 lps   (29.9 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        184524.9 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           58246.7 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        438788.5 KBps  (30.0 s, 2 samples)
Pipe Throughput                              465226.2 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  14792.3 lps   (10.0 s, 7 samples)
Process Creation                               2629.9 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   3095.2 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    884.2 lpm   (60.0 s, 2 samples)
System Call Overhead                        1011139.0 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0    8555429.5    733.1
Double-Precision Whetstone                       55.0       1747.9    317.8
Execl Throughput                                 43.0       1224.4    284.7
File Copy 1024 bufsize 2000 maxblocks          3960.0     184524.9    466.0
File Copy 256 bufsize 500 maxblocks            1655.0      58246.7    351.9
File Copy 4096 bufsize 8000 maxblocks          5800.0     438788.5    756.5
Pipe Throughput                               12440.0     465226.2    374.0
Pipe-based Context Switching                   4000.0      14792.3     37.0
Process Creation                                126.0       2629.9    208.7
Shell Scripts (1 concurrent)                     42.4       3095.2    730.0
Shell Scripts (8 concurrent)                      6.0        884.2   1473.6
System Call Overhead                          15000.0    1011139.0    674.1
                                                                   ========
System Benchmarks Index Score                                         400.9

------------------------------------------------------------------------
Benchmark Run: Sun Mar 18 2018 05:42:12 - 06:10:18
4 CPUs in system; running 4 parallel copies of tests

Dhrystone 2 using register variables       34136207.1 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     6989.1 MWIPS (10.2 s, 7 samples)
Execl Throughput                               3526.3 lps   (29.6 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        218968.8 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           61412.5 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        830973.8 KBps  (30.0 s, 2 samples)
Pipe Throughput                             1848545.0 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 121851.3 lps   (10.0 s, 7 samples)
Process Creation                               6271.4 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   7046.2 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    955.8 lpm   (60.1 s, 2 samples)
System Call Overhead                        3570647.2 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   34136207.1   2925.1
Double-Precision Whetstone                       55.0       6989.1   1270.7
Execl Throughput                                 43.0       3526.3    820.1
File Copy 1024 bufsize 2000 maxblocks          3960.0     218968.8    553.0
File Copy 256 bufsize 500 maxblocks            1655.0      61412.5    371.1
File Copy 4096 bufsize 8000 maxblocks          5800.0     830973.8   1432.7
Pipe Throughput                               12440.0    1848545.0   1486.0
Pipe-based Context Switching                   4000.0     121851.3    304.6
Process Creation                                126.0       6271.4    497.7
Shell Scripts (1 concurrent)                     42.4       7046.2   1661.8
Shell Scripts (8 concurrent)                      6.0        955.8   1593.0
System Call Overhead                          15000.0    3570647.2   2380.4
                                                                   ========
System Benchmarks Index Score                                        1020.3

Scaleway — C1

========================================================================
   BYTE UNIX Benchmarks (Version 5.1.3)

   System: redacted: GNU/Linux
   OS: GNU/Linux -- 4.9.20-std-1 -- #1 SMP Tue Apr 4 12:56:17 UTC 2017
   Machine: x86_64 (unknown)
   Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
   CPU 0: Intel(R) Atom(TM) CPU C2750 @ 2.40GHz (4787.8 bogomips)
          x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET
   CPU 1: Intel(R) Atom(TM) CPU C2750 @ 2.40GHz (4787.8 bogomips)
          x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET
   05:14:11 up 3 days,  1:28,  1 user,  load average: 2.01, 2.14, 2.06; runlevel 2018-03-15

------------------------------------------------------------------------
Benchmark Run: Sun Mar 18 2018 05:14:12 - 05:42:08
2 CPUs in system; running 1 parallel copy of tests

Dhrystone 2 using register variables       12323865.3 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     2014.1 MWIPS (9.9 s, 7 samples)
Execl Throughput                               1223.1 lps   (29.8 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        415672.5 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          120361.9 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        985611.5 KBps  (30.0 s, 2 samples)
Pipe Throughput                             1170708.3 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  46541.0 lps   (10.0 s, 7 samples)
Process Creation                               3049.4 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   3348.8 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    685.8 lpm   (60.1 s, 2 samples)
System Call Overhead                        1446516.0 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   12323865.3   1056.0
Double-Precision Whetstone                       55.0       2014.1    366.2
Execl Throughput                                 43.0       1223.1    284.4
File Copy 1024 bufsize 2000 maxblocks          3960.0     415672.5   1049.7
File Copy 256 bufsize 500 maxblocks            1655.0     120361.9    727.3
File Copy 4096 bufsize 8000 maxblocks          5800.0     985611.5   1699.3
Pipe Throughput                               12440.0    1170708.3    941.1
Pipe-based Context Switching                   4000.0      46541.0    116.4
Process Creation                                126.0       3049.4    242.0
Shell Scripts (1 concurrent)                     42.4       3348.8    789.8
Shell Scripts (8 concurrent)                      6.0        685.8   1142.9
System Call Overhead                          15000.0    1446516.0    964.3
                                                                   ========
System Benchmarks Index Score                                         621.0

------------------------------------------------------------------------
Benchmark Run: Sun Mar 18 2018 05:42:08 - 06:10:06
2 CPUs in system; running 2 parallel copies of tests

Dhrystone 2 using register variables       24552470.3 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     4016.2 MWIPS (10.0 s, 7 samples)
Execl Throughput                               2918.3 lps   (29.8 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        485532.6 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          131304.9 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       1365028.4 KBps  (30.0 s, 2 samples)
Pipe Throughput                             2329059.7 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 116038.9 lps   (10.0 s, 7 samples)
Process Creation                               7104.8 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   5589.5 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    722.6 lpm   (60.1 s, 2 samples)
System Call Overhead                        2260798.6 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   24552470.3   2103.9
Double-Precision Whetstone                       55.0       4016.2    730.2
Execl Throughput                                 43.0       2918.3    678.7
File Copy 1024 bufsize 2000 maxblocks          3960.0     485532.6   1226.1
File Copy 256 bufsize 500 maxblocks            1655.0     131304.9    793.4
File Copy 4096 bufsize 8000 maxblocks          5800.0    1365028.4   2353.5
Pipe Throughput                               12440.0    2329059.7   1872.2
Pipe-based Context Switching                   4000.0     116038.9    290.1
Process Creation                                126.0       7104.8    563.9
Shell Scripts (1 concurrent)                     42.4       5589.5   1318.3
Shell Scripts (8 concurrent)                      6.0        722.6   1204.3
System Call Overhead                          15000.0    2260798.6   1507.2
                                                                   ========
System Benchmarks Index Score                                        1047.7

Azure — Standard B2S

========================================================================
   BYTE UNIX Benchmarks (Version 5.1.3)

   System: redacted: GNU/Linux
   OS: GNU/Linux -- 4.13.0-1011-azure -- #14-Ubuntu SMP Thu Feb 15 16:15:39 UTC 2018
   Machine: x86_64 (x86_64)
   Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
   CPU 0: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz (4589.4 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET
   CPU 1: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz (4589.4 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET
   05:22:33 up 6 days,  8:37,  1 user,  load average: 0.08, 0.62, 1.38; runlevel 2018-03-11

------------------------------------------------------------------------
Benchmark Run: Sun Mar 18 2018 05:22:33 - 05:50:38
2 CPUs in system; running 1 parallel copy of tests

Dhrystone 2 using register variables       28065805.5 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     3310.3 MWIPS (8.7 s, 7 samples)
Execl Throughput                               2546.1 lps   (29.8 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        257690.1 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           55889.7 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        535177.7 KBps  (30.0 s, 2 samples)
Pipe Throughput                              315663.7 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  25281.3 lps   (10.0 s, 7 samples)
Process Creation                               3911.9 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   2343.0 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    862.3 lpm   (60.0 s, 2 samples)
System Call Overhead                         268361.9 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   28065805.5   2405.0
Double-Precision Whetstone                       55.0       3310.3    601.9
Execl Throughput                                 43.0       2546.1    592.1
File Copy 1024 bufsize 2000 maxblocks          3960.0     257690.1    650.7
File Copy 256 bufsize 500 maxblocks            1655.0      55889.7    337.7
File Copy 4096 bufsize 8000 maxblocks          5800.0     535177.7    922.7
Pipe Throughput                               12440.0     315663.7    253.7
Pipe-based Context Switching                   4000.0      25281.3     63.2
Process Creation                                126.0       3911.9    310.5
Shell Scripts (1 concurrent)                     42.4       2343.0    552.6
Shell Scripts (8 concurrent)                      6.0        862.3   1437.2
System Call Overhead                          15000.0     268361.9    178.9
                                                                   ========
System Benchmarks Index Score                                         472.2

------------------------------------------------------------------------
Benchmark Run: Sun Mar 18 2018 05:50:38 - 06:18:55
2 CPUs in system; running 2 parallel copies of tests

Dhrystone 2 using register variables       12561408.5 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     1364.4 MWIPS (10.5 s, 7 samples)
Execl Throughput                               1285.0 lps   (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        108284.8 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           29067.9 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        813617.5 KBps  (30.0 s, 2 samples)
Pipe Throughput                              195193.3 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  59307.3 lps   (10.0 s, 7 samples)
Process Creation                               2751.5 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   3681.4 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    322.3 lpm   (60.1 s, 2 samples)
System Call Overhead                         280762.9 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   12561408.5   1076.4
Double-Precision Whetstone                       55.0       1364.4    248.1
Execl Throughput                                 43.0       1285.0    298.8
File Copy 1024 bufsize 2000 maxblocks          3960.0     108284.8    273.4
File Copy 256 bufsize 500 maxblocks            1655.0      29067.9    175.6
File Copy 4096 bufsize 8000 maxblocks          5800.0     813617.5   1402.8
Pipe Throughput                               12440.0     195193.3    156.9
Pipe-based Context Switching                   4000.0      59307.3    148.3
Process Creation                                126.0       2751.5    218.4
Shell Scripts (1 concurrent)                     42.4       3681.4    868.3
Shell Scripts (8 concurrent)                      6.0        322.3    537.1
System Call Overhead                          15000.0     280762.9    187.2
                                                                   ========
System Benchmarks Index Score                                         340.0

SSDNodes — KVM 16GB

========================================================================
   BYTE UNIX Benchmarks (Version 5.1.3)

   System: redacted: GNU/Linux
   OS: GNU/Linux -- 4.9.0-5-amd64 -- #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04)
   Machine: x86_64 (unknown)
   Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
   CPU 0: Intel Core Processor (Skylake, IBRS) (4600.0 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET
   CPU 1: Intel Core Processor (Skylake, IBRS) (4600.0 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET
   CPU 2: Intel Core Processor (Skylake, IBRS) (4600.0 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET
   CPU 3: Intel Core Processor (Skylake, IBRS) (4600.0 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET
   05:32:38 up 24 days,  9:44,  2 users,  load average: 0.86, 0.95, 2.01; runlevel 2018-02-21

------------------------------------------------------------------------
Benchmark Run: Sun Mar 18 2018 05:32:39 - 06:00:50
4 CPUs in system; running 1 parallel copy of tests

Dhrystone 2 using register variables       18638854.5 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     3603.5 MWIPS (9.3 s, 7 samples)
Execl Throughput                                543.0 lps   (29.9 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        326203.0 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          107831.8 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        782124.6 KBps  (30.0 s, 2 samples)
Pipe Throughput                              772372.4 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  27040.7 lps   (10.0 s, 7 samples)
Process Creation                               1912.9 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   1867.0 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    685.7 lpm   (60.1 s, 2 samples)
System Call Overhead                         603214.3 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   18638854.5   1597.2
Double-Precision Whetstone                       55.0       3603.5    655.2
Execl Throughput                                 43.0        543.0    126.3
File Copy 1024 bufsize 2000 maxblocks          3960.0     326203.0    823.7
File Copy 256 bufsize 500 maxblocks            1655.0     107831.8    651.6
File Copy 4096 bufsize 8000 maxblocks          5800.0     782124.6   1348.5
Pipe Throughput                               12440.0     772372.4    620.9
Pipe-based Context Switching                   4000.0      27040.7     67.6
Process Creation                                126.0       1912.9    151.8
Shell Scripts (1 concurrent)                     42.4       1867.0    440.3
Shell Scripts (8 concurrent)                      6.0        685.7   1142.9
System Call Overhead                          15000.0     603214.3    402.1
                                                                   ========
System Benchmarks Index Score                                         472.3

------------------------------------------------------------------------
Benchmark Run: Sun Mar 18 2018 06:00:50 - 06:29:14
4 CPUs in system; running 4 parallel copies of tests

Dhrystone 2 using register variables       63227839.0 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                    14671.5 MWIPS (9.4 s, 7 samples)
Execl Throughput                               4394.5 lps   (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        347374.8 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          109273.0 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        830966.2 KBps  (30.0 s, 2 samples)
Pipe Throughput                             2702931.3 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 307088.8 lps   (10.0 s, 7 samples)
Process Creation                               4009.3 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   6331.9 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   1825.1 lpm   (60.1 s, 2 samples)
System Call Overhead                        2090415.5 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   63227839.0   5418.0
Double-Precision Whetstone                       55.0      14671.5   2667.5
Execl Throughput                                 43.0       4394.5   1022.0
File Copy 1024 bufsize 2000 maxblocks          3960.0     347374.8    877.2
File Copy 256 bufsize 500 maxblocks            1655.0     109273.0    660.3
File Copy 4096 bufsize 8000 maxblocks          5800.0     830966.2   1432.7
Pipe Throughput                               12440.0    2702931.3   2172.8
Pipe-based Context Switching                   4000.0     307088.8    767.7
Process Creation                                126.0       4009.3    318.2
Shell Scripts (1 concurrent)                     42.4       6331.9   1493.4
Shell Scripts (8 concurrent)                      6.0       1825.1   3041.8
System Call Overhead                          15000.0    2090415.5   1393.6
                                                                   ========
System Benchmarks Index Score                                        1363.2

SSDNodes — KVM 8GB

========================================================================
   BYTE UNIX Benchmarks (Version 5.1.3)

   System: redacted: GNU/Linux
   OS: GNU/Linux -- 4.9.0-5-amd64 -- #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04)
   Machine: x86_64 (unknown)
   Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
   CPU 0: Intel Core Processor (Skylake, IBRS) (4600.0 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET
   CPU 1: Intel Core Processor (Skylake, IBRS) (4600.0 bogomips)
          Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET
   05:27:18 up 24 days,  9:39,  2 users,  load average: 1.83, 2.76, 2.63; runlevel 2018-02-21

------------------------------------------------------------------------
Benchmark Run: Sun Mar 18 2018 05:27:18 - 05:55:29
2 CPUs in system; running 1 parallel copy of tests

Dhrystone 2 using register variables       20712375.2 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     4089.5 MWIPS (10.0 s, 7 samples)
Execl Throughput                                869.8 lps   (29.6 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        414717.4 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          118528.4 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       1037781.8 KBps  (30.0 s, 2 samples)
Pipe Throughput                              839599.1 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  39673.2 lps   (10.0 s, 7 samples)
Process Creation                               2367.3 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   3917.3 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   1015.1 lpm   (60.0 s, 2 samples)
System Call Overhead                         646058.8 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   20712375.2   1774.8
Double-Precision Whetstone                       55.0       4089.5    743.5
Execl Throughput                                 43.0        869.8    202.3
File Copy 1024 bufsize 2000 maxblocks          3960.0     414717.4   1047.3
File Copy 256 bufsize 500 maxblocks            1655.0     118528.4    716.2
File Copy 4096 bufsize 8000 maxblocks          5800.0    1037781.8   1789.3
Pipe Throughput                               12440.0     839599.1    674.9
Pipe-based Context Switching                   4000.0      39673.2     99.2
Process Creation                                126.0       2367.3    187.9
Shell Scripts (1 concurrent)                     42.4       3917.3    923.9
Shell Scripts (8 concurrent)                      6.0       1015.1   1691.8
System Call Overhead                          15000.0     646058.8    430.7
                                                                   ========
System Benchmarks Index Score                                         616.8

------------------------------------------------------------------------
Benchmark Run: Sun Mar 18 2018 05:55:29 - 06:23:42
2 CPUs in system; running 2 parallel copies of tests

Dhrystone 2 using register variables       38935462.3 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     8156.1 MWIPS (10.0 s, 7 samples)
Execl Throughput                               4726.3 lps   (29.9 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks        692577.9 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          203840.1 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       1799195.8 KBps  (30.0 s, 2 samples)
Pipe Throughput                             1621602.5 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 211656.3 lps   (10.0 s, 7 samples)
Process Creation                               9135.5 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   7138.5 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   1195.9 lpm   (60.1 s, 2 samples)
System Call Overhead                        1202392.1 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   38935462.3   3336.4
Double-Precision Whetstone                       55.0       8156.1   1482.9
Execl Throughput                                 43.0       4726.3   1099.1
File Copy 1024 bufsize 2000 maxblocks          3960.0     692577.9   1748.9
File Copy 256 bufsize 500 maxblocks            1655.0     203840.1   1231.7
File Copy 4096 bufsize 8000 maxblocks          5800.0    1799195.8   3102.1
Pipe Throughput                               12440.0    1621602.5   1303.5
Pipe-based Context Switching                   4000.0     211656.3    529.1
Process Creation                                126.0       9135.5    725.0
Shell Scripts (1 concurrent)                     42.4       7138.5   1683.6
Shell Scripts (8 concurrent)                      6.0       1195.9   1993.2
System Call Overhead                          15000.0    1202392.1    801.6
                                                                   ========
System Benchmarks Index Score                                        1382.8

These benchmarks should be taken with a grain of salt, since UnixBench tests a fair bit more than just CPU throughput. However, what appears to be fairly clear is that though the ARMv8 cores are 20-30% slower than the mixture of competing x86_64 cores in a contest of single core performance, they win out in multi-core hashrate simply due to their number.

I suppose this isn't really a thrilling discovery — it makes immediate sense — but I found it fairly interesting that it's cheaper to scale out in number of cores rather than up in per-core performance… at least when it comes to mining vanity Tor domains.

Conclusion

Overall, this was a larger undertaking than I would have assumed at first, and I spent a long time monitoring (nonexistent) progress. In the end, it was fun to do, so hopefully it was fun to read about too!

Setting up an SSTP VPN on Windows Server with LetsEncrypt

2018-03-08T00:00:00+00:00

Setting up a VPN on Windows Server for remote access to company resources comes up often enough, and a great deal has been written on the subject online.

However, back when I first went through the whole process, I found it time-consuming to sift through all the outdated information floating around, so I created this document for personal reference. I've had the opportunity to test them out on a number of fresh installs, and worked out a bunch of kinks that way.

These instructions assume a brand new install of Windows Server 2016, but they should be easily adaptable to other scenarios.

Installing the Necessary Software

IIS and RRAS

In Server Manager, Manager → Add Roles and Features, check Remote Access and Web Server (IIS).

Under the Features pane, select Remote Server Administration Tools and all submodules, and under Remote Access Role Services, select DirectAccess and VPN and Routing.

Install.

win-acme

Grab a copy of win-acme from Github; we'll be using it to streamline the requesting of SSL certificates from LetsEncrypt.

Setting up the Routing and Remote Access Service

First, we must get RRAS set up.

Run rrasmgmt.msc.
Right click server → Configure → Custom Configuration → VPN Access & Demand-dial connections
Start the service
Right click the server → Properties
IPv4 tab, select static address pool and choose an appropriate IP range for VPN clients (e.g. 192.168.26.0 — 192.168.26.50)

Next, ensure that the Default Web Site host in IIS has an HTTPS binding, and furthermore has its Server Name Identification box unticked — the host used for an SSTP VPN must not require SNI.

To begin, we should get rid of any certificates for the VPN host.

$ $hostname = "vpn.company.com"
$ Get-ChildItem -Path Cert:\LocalMachine\My | Where-Object {$_.Subject -match $hostname} | Remove-Item

Typically, certificates for IIS are stored in the WebHosting certificate store. However, RRAS can only use certificates under the Personal certificate store, so we must ask win-acme to place the certificate in the Personal store explicitly.

$ ./letsencrypt.exe --plugin iisbinding --manualhost $hostname  --certificatestore My --notaskscheduler

Then, we can fetch a PowerShell object referencing our certificate. By default win-acme removes old (expired) certificates when requesting a new one with the same host, so we can just filter by hostname.

$ $cert = Get-ChildItem -Path Cert:\LocalMachine\My | Where-Object {$_.Subject -match $hostname}

Finally, we can import the RRAS module, and set our RRAS cert to the one we just created.

$ Import-Module RemoteAccess
$ Stop-Service RemoteAccess
$ Set-RemoteAccess -SslCertificate $cert
$ Start-Service RemoteAccess

LetsEncrypt certificates expire every 3 months, so it's a good idea to make this script run periodically in Task Scheduler, so that you're not faced with unexpected VPN outages.

Next, since RRAS doesn't start up by default on a machine boot, we should make it do so,

Open up Services
Find the Remote Access Connection Manager service, right-click → Properties → Startup type: Automatic

Testing the VPN

Now, you may set up the VPN on a Windows machine, and attempt connecting. The VPN should connect, without any connectivity to the internet or the host machine. If the VPN immediately disconnects upon a connection attempt, you can use the rasdial (rasdial /? for usage help) command in a command prompt on the client to get more detailed error information than from the regular Windows interface.

At this point, you may or may not be able to ping the host machine from your client when connected to the VPN (you can use ipconfig /all on the host to determine its VPN IP, and try ping-ing it from the client).

Note that you will not be able to access the internet. To fix this, you must configure your client not to attempt to use the server gateway (because it doesn't exist). Open the Network and Sharing Center, and click into Change adapter settings.

Right-click the VPN connection you just created, and select "Properties". Switch to the Networking tab. Select the Internet Protocol Version 4 (TCP/IPv4) list item, then click the Properties button. Click Advanced, and uncheck Use default gateway on remote network.

Troubleshooting

VPN user must be allowed to dial-in

Run mmc.exe
Add the Local Users and Groups snap-in from the File menu
Click into your user account, then right-click Properties
Dial-in tab, Allow access under Network Access Permission

Network Policy Server must allow VPN connections

If you have NPS enabled, you will have to configure it to allow VPN connections.

Under the NPS snap-in from mmc.exe → Advanced Configuration → Network Policies → Grant access to both policies relating to VPN connections (they are deny by default).

Host machine must be discoverable

Open up the Network and Sharing Center
Click Advanced sharing settings
Expand the Private and Guest or Public groups, and turn on Network Discovery and File and printer sharing on both

Wrapping Up

At this point, clients should be able to connect to the VPN host, and any file shares created on it should be mountable. A minor caveat to be aware of is that LetsEncrypt certificates expire every 3 months, so you must either have a reminder in your calendar to renew the certificate, or have a scheduled script to request a new certificate and reconfigure RRAS to use it.

Blazing-fast Java2D rendering

2017-10-18T00:00:00+00:00

Anyone who has ever attempted to draw anything more than almost-static scenes with Java2D can attest that it sluggishly chugs along. Some will even say it's even unusable for repainting at 60Hz or higher without taking a toll on CPU.

Today, we'll look at what we can do to speed up rendering, in ways that (at the time of writing) I have not seen discussed anywhere online. Probably because it's a big hack.

Vanilla Java2D Rendering, and Caveats

The reader may be familiar with the way Java handles drawing in swing. If not, the official documentation is a good starting place.

What's important to note is that when you wish to update a frame in your Java2D application, you must first request a repaint, which is processed by putting a repaint event onto the event queue. If there are things earlier in the event queue, they must first get processed before you can repaint.

It's not even guaranteed that a repaint request will cause a repaint — sometimes, multiple repaint events can get "squashed" into one, causing jittery animations. Of course, there are workarounds like paintImmediately, for example, but none provide outstanding performance for even the simplest scenes: there's simply too much abstraction, which is a killer when every millisecond counts to obtain an immersive rendering experience.

A simple benchmark

Below is a simple Swing application that does nothing more than draw a red, full-window rectangle. We'll be using it as a benchmark for the purposes of this post — though it is not a particularly good real-life example, the effects of Java2D abstraction are fairly uniform across the entire API: if we can get this to run quickly, so too will everything else.

class PaintFrame extends JFrame {
    int frameCount;

    {
        setSize(720, 680);
        JPanel canvas;
        add(canvas = new JPanel() {
            @Override
            public void paint(Graphics gfx) {
                gfx.setColor(Color.RED);
                gfx.fillRect(0, 0, getWidth(), getHeight());
                frameCount++;
            }
        }, BorderLayout.CENTER);
        setLocationRelativeTo(null);
        setVisible(true);

        new Thread(() -> {
            while (true) {
                canvas.paintImmediately(0, 0, getWidth(), getHeight());
            }
        }).start();
    }
}

We can also hook our frame up to a simple, but illustrative, test.

public class PaintTest {
    public static void main(String[] argv) {
        PaintFrame frame = new PaintFrame();
        new Timer().schedule(new TimerTask() {
            double seconds;

            @Override
            public void run() {
                seconds++;
                System.out.printf("Averaging %.2f fps!\n", frame.frameCount / seconds);
            }
        }, 1000, 1000);
    }
}

Running on my Intel HD 5500 integrated graphics, the example code above averages ~1,200 frames per second. This quickly drops to below 400fps when making the window fullscreen, which is unacceptable for anything where framerate matters.

The significance of this deserves a bit more explanation (400fps is far more than the human eye can see!) Here, we're doing nothing more than drawing a red rectangle as fast as possible; there is no application logic taking up resources at the same time, and a single frame takes 2.5ms to process.

To maintain a 120fps framerate, each frame should be processed in ~8.3ms. If we're taking 2.5ms just to draw a single red rectangle, that leaves 5.8ms per frame for application logic: rendering would consume ~30% of application time. Naturally, rendering time increases the more you have to draw per frame, and our 2.5ms measurement is for a single rectangle.

Now that we've seen how vanilla Java2D rendering performs, let's see if we can do better.

A Hack for Fast Rendering

Java2D provides output with OpenGL, Direct3D, GDI, and more, depending on platform. Most of these are inherently active-rendering APIs, so there should be no technical barrier preventing us from rendering directly to them… except for abstraction.

Disclaimer: if your application needs to run on more than just Oracle's VM (or equivalently, OpenJDK), your mileage may vary with this approach. As I mentioned earlier, it's a hack specific to the internals of the Oracle API implementation, so it's unlikely to work anywhere else.

Let's start off with an observation. If we try printing out a Graphics object passed to paint, we'll see that it's implemented by sun.java2d.SunGraphics2D. We can also see that we're passed a different object each frame, so that's already a waste of GC resources, if we're pumping out hundreds of frames a second.

If we could construct our ownSunGraphics2D object, we'd be able to reuse it and any underlying resources outside of our paint method, directly in our rendering thread. The SunGraphics2D constructor is pretty benign, so that's promising.

public SunGraphics2D(SurfaceData sd, Color fg, Color bg, Font f)

At first glance, this seems fairly mild for such a fundamental class. The only thing that appears tricky is the SurfaceData parameter.

Obtaining a `SurfaceData`

SurfaceData sounds exactly like what one would expect an abstraction of a native surface to be called, and if we dig into its source, it becomes evident that SurfaceData implementations (the class itself is marked abstract) do the heavy lifting in rendering Java2D. If we search for implementations, we get names like D3DWindowSurfaceData, GDIWindowSurfaceData, XSurfaceData, and so on.

It's clear that any rendering we do will have to be platform-dependent, so let's stick to the GDIWindowSurfaceData for now. Naturally, this is will work only on Windows, but idea is what's important, and generalizes to other platform-specific surface implementations.

If we take a look at the source for GDIWindowSurfaceData, we find a very helpful function:

public static GDIWindowSurfaceData createData(WComponentPeer peer) {
	SurfaceType sType = getSurfaceType(peer.getDeviceColorModel());
	return new GDIWindowSurfaceData(peer, sType);
}

…and all we need to use it is a WComponentPeer, which we can obtain from our panel's (deprecated) getPeer method! Note: getPeer is removed in the Java 9 EAP; equivalently, you can use reflection to fetch the peer field directly.

Importantly, all SurfaceData implementations provide a createData static method, so it's possible to use reflection to make accessing code more portable and elegant. But, that's beyond the scope of this post.

An improved benchmark

Putting these together, we can come up with a solution that allows us to draw at our own pace, outside of Swing/AWT entirely.

class PaintFrame extends JFrame {
    int frameCount;

    {
        setSize(720, 680);
	// Note that this now a heavyweight Panel: JPanels don't have real native peers
        Panel canvas = new Panel();
        add(canvas, BorderLayout.CENTER);
        setLocationRelativeTo(null);
        setVisible(true);

        ComponentPeer peer = canvas.getPeer();
        SurfaceData surfaceData = GDIWindowSurfaceData.createData((WComponentPeer) peer);
        SunGraphics2D gfx = new SunGraphics2D(surfaceData, Color.BLACK, Color.BLACK, null);

        new Thread(() -> {
            gfx.setColor(Color.RED);
            while(true) {
                gfx.fillRect(0, 0, getWidth(), getHeight());
                frameCount++;
            }
        }).start();
    }
}

On the same hardware, our new rendering approach can pump out ~14,000 frames per second, which drops to ~6,000fps when fullscreen. That's a 20x speed improvement over regular rendering!

A Practical Conclusion

It's nice to be able to say we can render 20x faster by employing this approach. But, it's not without caveats: you need to implement a backend for each platform you wish to be able to render on, or at least provide a fallback to regular Swing drawing when you cannot.

In other words, it's not practical for simple one-off Swing applications. Nor is it practical for speeding up general rendering of Swing components. However, if your task involves repainting a large component as fast as possible, this is definitely the fastest you can get without linking 3rd party libraries to perform Direct3D/OpenGL rendering yourself.

You can view a more complete implementation of the ideas expressed in this post in a Gameboy Color emulator I wrote, where Java2D was taking more time to render than the rest of the emulation combined (which spurred me to develop this approach).

Java internals, or when `true != true`

2017-10-14T00:00:00+00:00

Most programmers have heard jokes about inserting a Greek question mark (;, U+037E) into Java code in place of a semicolon to cause "inexplicable" compilation errors.

But, it's too easy to discover. What about something that manifests itself at runtime, but when inspected — either by printing to stdout or through a debugger — shows nothing amiss?

Using the internal sun.misc.Unsafe (and targetting Hotspot VMs), we can create a boolean that compares equal to neither true nor false, but when inspected, will always manifest itself as true.

Let's take a look.

import sun.misc.Unsafe;
import java.lang.reflect.Field;

public class Tainted {
    public static boolean toTaint = true;

    public static void main(String[] argv) throws Exception {
        Field _unsafe = Class.forName("sun.misc.Unsafe").getDeclaredField("theUnsafe");
        _unsafe.setAccessible(true);

        Unsafe unsafe = (Unsafe) _unsafe.get(null);
        unsafe.putInt(Tainted.class, unsafe.staticFieldOffset(Tainted.class.getDeclaredField("toTaint")), 2);

        test(toTaint, false);
        test(toTaint, true);
        test(toTaint, toTaint);
    }

    public static void test(boolean a, boolean b) {
        System.out.printf("%s == %s: %s\n", a, b, a == b);
    }
}

The output of the above code is shown below.

true == false: false
true == true: false
true == true: true

So, what's going on?

The Unsafe class allows us to play around with the raw data backing Java objects. Since this is inherently unsafe, we have to jump around a few hoops: specifically, we must use reflection to grab the Unsafe instance (this can be blocked by a security manager, for security-concious applications). The alternative is to set our classes as part of the bootclasspath and use Unsafe.getUnsafe() directly, but that's less elegant.

Once we have our Unsafe instance, we can use it to determine the offset in memory from the base of our class of our toTaint boolean. Then, we can use putInt to set the value of toTaint to the integer 2.

But what does this mean?

If we look into the internals of the JVM, we can find the declaration of jboolean (the internal representation of a boolean object) in jni.h as an unsigned char.

...
typedef unsigned char   jboolean;
typedef unsigned short  jchar;
typedef short           jshort;
typedef float           jfloat;
typedef double          jdouble;
...

This makes sense: there's no data type for storing just one bit of data, and an unsigned char is guaranteed to be at least 8 bits. That is, a boolean can actually store any number in the range 0 to 255, and we're setting it to the integer value 2.

Internally, when the JVM does equality comparisons, it doesn't only check one specific bit of both boolean values (that'd be a silly waste of time); instead, it simply compares all 8 bits. A real true value has only the least significant bit set (i.e., is equal to the integer 1). So, a real true will not compare equal to our tainted boolean (set to 2), nor will it to a real false (stored as 0).

However, this boolean is functionally equivalent otherwise: conditional branching operations look to see only if the value is nonzero, so an if (toTaint) block of code would still execute as expected.

With that in mind, we can take a look at the code of the Boolean class to explain the final bit of the puzzle:

...
public static String toString() {
    return value ? "true" : "false";
}
...

When we're printing out our boolean, internally toString must be called on our object, so the boolean is autoboxed to a Boolean, and the above code is called. As we've discussed already, branch operations treat any nonzero value as true, so our boolean will always be represented by the string true.

And that wraps up our goal! The Unsafe class has many practical uses for legitimate applications, but sometimes trying out illegitimate things is the best way to learn something new — which hopefully this post has helped with!

Tudor Brindus

Sometimes, the kernel lies about process memory usage

Peeking under the hood of GCC's `__builtin_expect`

On online judging, part 5: optimizing `ptrace` filtering with `seccomp`

Understanding the Problem

A Brief Introduction to seccomp

Comparing ptrace and ptrace + seccomp Tracing

Conclusion

Emulating microprocessors with macros

Quick Intel 8080 Refresher

Motivation for Macros

Dispatching Instructions to Macros

Handling Register Pairs

Putting it All Together

Correct usage of `LD_PRELOAD` for hooking `libc` functions

A simple program

(Incorrectly) using LD_PRELOAD to hook fopen

(Correctly) using LD_PRELOAD to hook fopen

Low-latency static sites with Scaleway and Cloudflare

Setting up the Server

Configuring the Site

Wrapping Up

Mining for Tor v3 onions in the cloud

A bit of background

Setting up the servers

Mining results

A quick statistical analysis

Bonus: UnixBench of the servers

Conclusion

Setting up an SSTP VPN on Windows Server with LetsEncrypt

Installing the Necessary Software

IIS and RRAS

win-acme

Setting up the Routing and Remote Access Service

Testing the VPN

Troubleshooting

VPN user must be allowed to dial-in

Network Policy Server must allow VPN connections

Host machine must be discoverable

Wrapping Up

Blazing-fast Java2D rendering

Vanilla Java2D Rendering, and Caveats

A simple benchmark

A Hack for Fast Rendering

Obtaining a SurfaceData

An improved benchmark

A Practical Conclusion

Java internals, or when `true != true`

A Brief Introduction to `seccomp`

Comparing `ptrace` and `ptrace` + `seccomp` Tracing

(Incorrectly) using `LD_PRELOAD` to hook `fopen`

(Correctly) using `LD_PRELOAD` to hook `fopen`

Obtaining a `SurfaceData`