Planet HantsLUG

Syndicate content
Planet HantsLUG - http://hantslug.org.uk/planet/
Updated: 1 hour 1 min ago

Andy Smith: bcache and lvmcache

Wed, 19/07/2017 - 07:06
Background

Over at BitFolk we offer both SSD-backed storage and HDD-backed archive storage. The SSDs we use are really nice and basically have removed all IO performance problems we have ever encountered in the past. I can’t begin to describe how pleasant it is to just never have to think about that whole class of problems.

The main downside of course is that SSD capacity is still really expensive. That’s why we introduced the HDD-backed archive storage: for bulk storage of things that didn’t need to have high performance.

You’d really think though that by now there would be some kind of commodity tiered storage that would allow a relatively small amount of fast SSD storage to accelerate a much larger pool of HDDs. Of course there are various enterprise solutions, and there is also ZFS where SSDs could be used for the ZIL and L2ARC while HDDs are used for the pool.

ZFS is quite appealing but I’m not ready to switch everything to that yet, and I’m certainly not going to buy some enterprise storage solution. I also don’t necessarily want to commit to putting all storage behind such a system.

I decided to explore Linux’s own block device caching solutions.

Scope

I’ve restricted the scope to the two solutions which are part of the mainline Linux kernel as of July 2017, these being bcache and lvmcache.

lvmcache is based upon dm-cache which has been included with the mainline kernel since April 2013. It’s quite conservative, and having been around for quite a while is considered stable. It has the advantage that it can work with any LVM logical volume no matter what the contents. That brings the disadvantage that you do need to run LVM.

bcache has been around for a little longer but is a much more ambitious project. Being completely dedicated to accelerating slow block devices with fast ones it is claimed to be able to achieve higher performance than other caching solutions, but as it’s much more complicated than dm-cache there are still bugs being found. Also it requires you format your block devices as bcache before you use them for anything.

Test environment

I’m testing this on a Debian testing (buster) Xen virtual machine with a 20GiB xvda virtual disk containing the main operating system. That disk is backed by a software (md) RAID-10 composed of two Samsung sm863 SSDs. It was also used for testing the baseline SSD performance from the directory /srv/ssd.

The virtual machine had 1GiB of memory but the pagecache was cleared between each test run in an attempt to prevent anything being cached in memory.

A 5GiB xvdc virtual disk was provided, backed again on the SSD RAID. This was used for the cache role both in bcache and lvmcache.

A 50GiB xvdd virtual disk was provided, backed by a pair of Seagate ST4000LM016-1N2170 HDDs in software RAID-1. This was used for the HDD backing store in each of the caching implementations. The resulting cache device was mounted at /srv/cache.

Finally a 50GiB xvde virtual disk also backed on HDD was used to test baseline HDD performance, mounted at /srv/slow.

The filesystem in use in all cases was ext4 with default options. In dom0, deadline scheduler was used in all cases.

TL;DR, I just want graphs

In case you can’t be bothered to read the rest of this article, here’s just the graphs with some attempt at interpreting them. Down at the tests section you’ll find details of the actual testing process and more commentary on why certain graphs were produced.

git test graphs

Times to git clone and git grep.

fio IOPS graphs

These are graphs of IOPS across the 30 minutes of testing. There’s two important things to note about these graphs:

  1. They’re a Bezier curve fitted to the data points which are one per second. The actual data points are all over the place, because achieved IOPS depends on how many cache hits/misses there were, which is statistical.
  2. Only the IOPS for the first job is graphed. Even when using the per_job_logs=0 setting my copy of fio writes a set of results for each job. I couldn’t work out how to easily combine these so I’ve shown only the results for the first job.

    For all tests except bcache (sequential_cutoff=0) you just have to bear in mind that there is a second job working in parallel doing pretty much the same amount of IOPS. Strangely for that second bcache test the second job only managed a fraction of the IOPS (though still more than 10k IOPS) and I don’t know why.

IOPS over time for all tests

Well, those results are so extreme that it kind of makes it hard to distinguish between the low-end results.

A couple of observations:

  • SSD is incredibly and consistently fast.
  • For everything else there is a short steep section at the beginning which is likely to be the effect of HDD drive cache.
  • With sequential_cutoff set to 0, bcache very quickly reaches near-SSD performance for this workload (4k reads, 90% hitting 10% of data that fits entirely in the bcache). This is probably because the initial write put data straight into cache as it’s set to writeback.
  • When starting with a completely empty cache, lvmcache is no slouch either. It’s not quite as performant as bcache but that is still up near the 48k IOPS per process region, and very predictable.
  • When sequential_cutoff is left at its default of 4M, bcache performs much worse though still blazing compared to an HDD on its own. At the end of this 30 minute test performance was still increasing so it might be worth performing a longer test
  • The performance of lvmcache when starting with a cache already full of junk data seems to be not that much better than HDD baseline.
IOPS over time for low-end results

Leaving the high-performers out to see if there is anything interesting going on near the bottom of the previous graph.

Apart from the initial spike, HDD results are flat as expected.

Although the lvmcache (full cache) results in the previous graph seemed flat too, looking closer we can see that performance is still increasing, just very slowly. It may be interesting to test for longer to see if performance does continue to increase.

Both HDD and lvmcache have a very similar spike at the start of the test so let’s look closer at that.

IOPS for first 30 seconds

For all the lower-end performers the first 19 seconds are steeper and I can only think this is the effect of HDD drive cache. Once that is filled, HDD remains basically flat, lvmcache (full cache) increases performance more slowly and bcache with the default sequential_cutoff starts to take off.

SSDs don’t have the same sort of cache and bcache with no sequential_cutoff spikes up too quickly to really be noticeable at this scale.

3-hour lvmcache test

Since it seemed like lvmcache with a full cache device was still slowly increasing in performance I did a 3-hour testing on that one.

Skipping the first 20 minutes which show stronger growth, even after 3 hours there is still some performance increase happening. It seems like even a full cache would eventually promote read hot spots, but it could take a very very long time.

The tests

This being caching, we would only really expect it to be effective for reads of already-known data. Sequential reads on an HDD aren’t too bad but what really makes things feel slow are random reads all over the place: the time taken to seek to a new location on an HDD is really a killer for performance. So, we’d like a test that involves multiple reads of the same data but non-sequential.

git tests

I decided to do a git clone of the Linux kernel and then do a git grep of this, which would be guaranteed to be random access. The git grep step was done repeatedly to see if we could spot the cache being warmed up, also with the kernel’s pagecache being cleared after each step to avoid a situation where the entirety of the kernel source ended up in RAM. The clone process was also timed just to see if there would be any effect on write time.

fio tests

Despite git being a real application, the workload itself (searching for a string over and over) seemed rather too synthetic and simplistic. I decided to also try using the excellent fio to gather some detailed statistics. This is also synthetic of course, relating to no real workload, but it’s quite sophisticated and hopefully of interest.

The fio job spec used in all cases was as follows:

[partial] ioengine=libaio direct=1 iodepth=8 numjobs=2 readwrite=randread random_distribution=zipf:1.2 bs=4k size=2g unlink=1 runtime=1800 time_based=1 per_job_logs=1 log_avg_msec=1000 write_iops_log=/var/tmp/fio-${FIOTEST}

This causes two 2GiB files of random data to be written out and then two processes read from them for 30 minutes with up to 8 IOs in flight at once for each process. Between the use of direct=1 and the fact that there is 4GiB of data and only 1GiB of RAM, that should be enough to avoid testing any memory cache.

log_avg_msec=1000 is specified in order to restrict the IOPS log to one entry per second, otherwise you get one entry per IO. As that is kept in memory until fio completes, unless the number of entries is reduced then due to the sheer number of IOs the SSDs can service the fio process will cause the VM to run out of memory. This directive causes just one averaged entry per 1000ms, which is adequate for graphing.

unlink is used to make sure that data files are deleted after each test run. Leaving the files in place would cause additional promotions into cache, which would favour tests that are run multiple times.

By default when told to do a random read, fio would ensure it reads the entire file once only (before starting again, if the test is time-based). That would not be any use for testing cache warming as there would only ever be one read from each part of the file. To avoid this, a Zipf distribution is used so that read locations are chosen according to a somewhat repetitive distribution, simulating hotspots. I have no real idea about this part so if anyone has any input on that I’d be interested to hear, but my understanding is that a Zipf distribution with exponent 1.2 should cause about 91% of the reads to be from about 10% of the blocks.

In fact I would be very interested to hear of other fio job configurations that people think would be useful here.

If you’re unfamiliar with the tool, fio output can appear quite daunting. There’s a wealth of information in there though, so it’s worth looking into if you ever need to test storage performance. Too many times people just reach for dd or bonnie as performance testing tools but apart from the most trivial of tests they are just too simple. I found this article by Albert Tobey useful in understanding fio‘s output.

Each test run was performed against:

  1. A purely SSD-backed filesystem.
  2. A purely HDD-backed filesystem.
  3. A filesystem on top of a bcache device, sequential_cutoff=4M (default).
  4. A filesystem on top of a bcache device, sequential_cutoff=0.
  5. A filesystem on top of an lvmcache device, empty cache.
  6. A filesystem on top of an lvmcache device, cache full of junk.
Baseline SSD git grep test $ cd /srv/ssd $ /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \ > git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git Cloning into 'linux-stable'... remote: Counting objects: 5948690, done. remote: Compressing objects: 100% (124799/124799), done. remote: Total 5948690 (delta 427756), reused 549517 (delta 425675) Receiving objects: 100% (5948690/5948690), 1.21 GiB | 39.61 MiB/s, done. Resolving deltas: 100% (4961965/4961965), done. Checking out files: 100% (59844/59844), done. 195.45 user 22.10 system 2:37.89 elapsed 137% CPU $ cd /srv/ssd/linux-stable $ for i in {1..20}; do > echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null ; \ > /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \ > git --no-pager grep -q blah ; \ > sleep 1; \ > done 0.98 user 1.28 system 0:02.28 elapsed 99% CPU 0.95 user 1.15 system 0:02.18 elapsed 96% CPU 0.89 user 1.30 system 0:02.23 elapsed 98% CPU 1.04 user 1.13 system 0:02.25 elapsed 96% CPU 0.95 user 1.22 system 0:02.24 elapsed 96% CPU 0.95 user 1.23 system 0:02.23 elapsed 97% CPU 1.00 user 1.16 system 0:02.24 elapsed 96% CPU 1.00 user 1.16 system 0:02.23 elapsed 96% CPU 0.89 user 1.28 system 0:02.22 elapsed 97% CPU 0.97 user 1.20 system 0:02.24 elapsed 97% CPU 1.00 user 1.18 system 0:02.23 elapsed 97% CPU 0.96 user 1.21 system 0:02.23 elapsed 97% CPU 0.96 user 1.22 system 0:02.24 elapsed 97% CPU 1.01 user 1.17 system 0:02.24 elapsed 97% CPU 0.94 user 1.22 system 0:02.24 elapsed 96% CPU 1.00 user 1.16 system 0:02.24 elapsed 96% CPU 0.98 user 1.18 system 0:02.24 elapsed 96% CPU 0.92 user 1.24 system 0:02.24 elapsed 97% CPU 0.92 user 1.28 system 0:02.25 elapsed 98% CPU 1.09 user 1.07 system 0:02.23 elapsed 97% CPU

The initial git clone took 157.89 seconds and each git grep reliably took about 2.24 seconds.

fio test $ cd /srv/ssd/fio && FIOTEST=ssd fio ~/partial.fio partial: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8 ... fio-2.16 Starting 2 processes partial: Laying out IO file(s) (1 file(s) / 2048MB) partial: Laying out IO file(s) (1 file(s) / 2048MB) Jobs: 2 (f=2): [r(2)] [100.0% done] [409.8MB/0KB/0KB /s] [105K/0/0 iops] [eta 00m:00s] partial: (groupid=0, jobs=1): err= 0: pid=29306: Wed Jul 19 05:17:39 2017 read : io=379033MB, bw=215628KB/s, iops=53906, runt=1800001msec slat (usec): min=2, max=1547, avg= 4.38, stdev= 1.68 clat (usec): min=26, max=4350, avg=141.54, stdev=29.67 lat (usec): min=94, max=4353, avg=146.39, stdev=29.65 clat percentiles (usec): | 1.00th=[ 103], 5.00th=[ 108], 10.00th=[ 112], 20.00th=[ 119], | 30.00th=[ 124], 40.00th=[ 129], 50.00th=[ 135], 60.00th=[ 141], | 70.00th=[ 149], 80.00th=[ 161], 90.00th=[ 179], 95.00th=[ 197], | 99.00th=[ 243], 99.50th=[ 262], 99.90th=[ 314], 99.95th=[ 338], | 99.99th=[ 406] lat (usec) : 50=0.01%, 100=0.14%, 250=99.08%, 500=0.77%, 750=0.01% lat (usec) : 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.01% cpu : usr=14.30%, sys=39.70%, ctx=50601996, majf=0, minf=16 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=97032448/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 partial: (groupid=0, jobs=1): err= 0: pid=29307: Wed Jul 19 05:17:39 2017 read : io=358188MB, bw=203769KB/s, iops=50942, runt=1800001msec slat (usec): min=2, max=225, avg= 4.46, stdev= 1.71 clat (usec): min=21, max=4385, avg=150.08, stdev=43.00 lat (usec): min=69, max=4391, avg=155.01, stdev=42.98 clat percentiles (usec): | 1.00th=[ 104], 5.00th=[ 110], 10.00th=[ 115], 20.00th=[ 121], | 30.00th=[ 127], 40.00th=[ 133], 50.00th=[ 137], 60.00th=[ 145], | 70.00th=[ 155], 80.00th=[ 171], 90.00th=[ 199], 95.00th=[ 237], | 99.00th=[ 322], 99.50th=[ 350], 99.90th=[ 414], 99.95th=[ 438], | 99.99th=[ 498] lat (usec) : 50=0.01%, 100=0.10%, 250=95.82%, 500=4.06%, 750=0.01% lat (usec) : 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.01% cpu : usr=13.80%, sys=37.97%, ctx=48695297, majf=0, minf=15 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=91696255/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 Run status group 0 (all jobs): READ: io=737221MB, aggrb=419396KB/s, minb=203769KB/s, maxb=215627KB/s, mint=1800001msec, maxt=1800001msec Disk stats (read/write): xvda: ios=188724848/22, merge=0/28, ticks=27221784/0, in_queue=27267336, util=100.00%

This provides a nice illustration of just how fast the SSDs are: each process managed a very consistent 52 thousand IOPS and between the two of them they managed to read ~720GiB of data in 30 minutes.

Baseline HDD git grep test $ cd /srv/slow $ /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \ > git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git Cloning into 'linux-stable'... remote: Counting objects: 5948690, done. remote: Compressing objects: 100% (124799/124799), done. remote: Total 5948690 (delta 427756), reused 549517 (delta 425675) Receiving objects: 100% (5948690/5948690), 1.21 GiB | 38.36 MiB/s, done. Resolving deltas: 100% (4961965/4961965), done. Checking out files: 100% (59844/59844), done. 194.61 user 21.48 system 3:42.23 elapsed 97% CPU $ cd /srv/slow/linux-stable $ for i in {1..20}; do > echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null ; \ > /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \ > git --no-pager grep -q blah ; \ > sleep 1; \ > done 1.22 user 1.60 system 0:23.60 elapsed 11% CPU 1.35 user 1.32 system 0:23.08 elapsed 11% CPU 1.24 user 1.43 system 0:22.49 elapsed 11% CPU 1.15 user 1.57 system 0:22.37 elapsed 12% CPU 1.25 user 1.44 system 0:22.09 elapsed 12% CPU 1.26 user 1.48 system 0:22.60 elapsed 12% CPU 1.20 user 1.52 system 0:22.47 elapsed 12% CPU 1.21 user 1.48 system 0:22.82 elapsed 11% CPU 1.14 user 1.55 system 0:22.45 elapsed 12% CPU 1.28 user 1.49 system 0:22.32 elapsed 12% CPU 1.21 user 1.48 system 0:22.43 elapsed 12% CPU 1.26 user 1.51 system 0:22.95 elapsed 12% CPU 1.21 user 1.49 system 0:22.33 elapsed 12% CPU 1.23 user 1.45 system 0:22.54 elapsed 11% CPU 1.19 user 1.51 system 0:22.08 elapsed 12% CPU 1.26 user 1.42 system 0:22.42 elapsed 12% CPU 1.20 user 1.51 system 0:22.47 elapsed 12% CPU 1.19 user 1.53 system 0:22.42 elapsed 12% CPU 1.15 user 1.56 system 0:22.48 elapsed 12% CPU 1.24 user 1.47 system 0:22.10 elapsed 12% CPU

The initial git clone took 222.23 seconds and each git grep reliably took about 22.53 seconds.

Of course, a significant part of the clone time was spent downloading the data.

fio test $ cd /srv/slow/fio && FIOTEST=hdd fio ~/partial.fio partial: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8 ... fio-2.16 Starting 2 processes partial: Laying out IO file(s) (1 file(s) / 2048MB) partial: Laying out IO file(s) (1 file(s) / 2048MB) Jobs: 2 (f=2): [r(2)] [100.0% done] [3795KB/0KB/0KB /s] [948/0/0 iops] [eta 00m:00s] partial: (groupid=0, jobs=1): err= 0: pid=29158: Wed Jul 19 04:33:19 2017 read : io=3276.3MB, bw=1863.8KB/s, iops=465, runt=1800078msec slat (usec): min=2, max=132, avg=13.00, stdev= 8.10 clat (usec): min=59, max=830656, avg=17151.07, stdev=36630.33 lat (usec): min=64, max=830668, avg=17164.94, stdev=36630.39 clat percentiles (usec): | 1.00th=[ 70], 5.00th=[ 78], 10.00th=[ 88], 20.00th=[ 111], | 30.00th=[ 143], 40.00th=[ 205], 50.00th=[ 988], 60.00th=[ 1496], | 70.00th=[15424], 80.00th=[26240], 90.00th=[50432], 95.00th=[82432], | 99.00th=[177152], 99.50th=[224256], 99.90th=[346112], 99.95th=[403456], | 99.99th=[544768] lat (usec) : 100=15.10%, 250=26.82%, 500=3.79%, 750=1.99%, 1000=2.47% lat (msec) : 2=10.84%, 4=0.39%, 10=2.91%, 20=10.92%, 50=14.68% lat (msec) : 100=6.53%, 250=3.21%, 500=0.34%, 750=0.02%, 1000=0.01% cpu : usr=0.39%, sys=1.17%, ctx=822685, majf=0, minf=16 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=838728/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 partial: (groupid=0, jobs=1): err= 0: pid=29159: Wed Jul 19 04:33:19 2017 read : io=3328.4MB, bw=1893.4KB/s, iops=473, runt=1800076msec slat (usec): min=2, max=157, avg=13.20, stdev= 8.22 clat (usec): min=4, max=1046.5K, avg=16882.45, stdev=35737.35 lat (usec): min=64, max=1046.6K, avg=16896.52, stdev=35737.42 clat percentiles (usec): | 1.00th=[ 70], 5.00th=[ 78], 10.00th=[ 88], 20.00th=[ 111], | 30.00th=[ 143], 40.00th=[ 211], 50.00th=[ 996], 60.00th=[ 1528], | 70.00th=[15296], 80.00th=[25728], 90.00th=[49920], 95.00th=[81408], | 99.00th=[173056], 99.50th=[218112], 99.90th=[329728], 99.95th=[391168], | 99.99th=[522240] lat (usec) : 10=0.01%, 50=0.01%, 100=15.03%, 250=26.59%, 500=3.86% lat (usec) : 750=2.07%, 1000=2.49% lat (msec) : 2=10.70%, 4=0.36%, 10=3.03%, 20=11.45%, 50=14.46% lat (msec) : 100=6.49%, 250=3.17%, 500=0.30%, 750=0.01%, 1000=0.01% lat (msec) : 2000=0.01% cpu : usr=0.47%, sys=1.15%, ctx=834258, majf=0, minf=14 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=852044/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 Run status group 0 (all jobs): READ: io=6604.6MB, aggrb=3757KB/s, minb=1863KB/s, maxb=1893KB/s, mint=1800076msec, maxt=1800078msec Disk stats (read/write): xvde: ios=1690665/12, merge=0/15, ticks=28764332/2848, in_queue=28767992, util=100.00%

Both jobs together managed an average of 938 IOPS here, which is pretty weak but still considerably more than a pair of HDDs should be able to do (100–150 IOPS each, so 300 if you were amazingly lucky). This indicates that the drive cache must have been involved, and indeed the IOPS rate does very rapidly increase within the first few seconds of the test run but reaches a stable maximum within 20 seconds. That marks the time when the disk cache is full and cache hit rate can’t be maintained.

A total of 6,604.6MiB was read during the 30 minute test.

bcache Setup

By default bcache uses writethrough cache mode, where writes go to both the backing store and the cache and are only acknowledged once they’ve completed at the backing store. In this mode bcache will never accelerate writes, but it is safe to lose the cache device at any time.

I felt comfortable switching the bcache device to writeback mode as the underlying storage used for the cache is already reliable. In writeback mode the writes are acknowledged as soon as they are completed on the cache device and are later sequentially written to backing device.

This should result in higher write performance as well as the higher read performance we’d expect from using a cache.

$ sudo make-bcache -B /dev/xvdd UUID: cf0353eb-ac08-45bd-a3d5-f8b92fbc0fb8 Set UUID: 8c06d317-e735-41f5-8dec-52da26a37883 version: 1 block_size: 1 data_offset: 16 $ sudo mkfs.ext4 -L cached /dev/bcache0 mke2fs 1.43.4 (31-Jan-2017) Discarding device blocks: done Creating filesystem with 13107198 4k blocks and 3276800 inodes Filesystem UUID: 7db8568e-282b-4fa2-bf19-a42143708a0d Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424 Allocating group tables: done Writing inode tables: done Creating journal (65536 blocks): done Writing superblocks and filesystem accounting information: done $ sudo make-bcache -C /dev/xvdc UUID: ae581052-4449-460d-8cf5-b11d6467403d Set UUID: 2fcc7937-9f5f-4089-9400-552675929ea5 version: 0 nbuckets: 10240 block_size: 1 bucket_size: 1024 nr_in_set: 1 nr_this_dev: 0 first_bucket: 1 $ sudo bcache-super-show /dev/xvdc sb.magic ok sb.first_sector 8 [match] sb.csum 206ADF979437131E [match] sb.version 3 [cache device] dev.label (empty) dev.uuid ae581052-4449-460d-8cf5-b11d6467403d dev.sectors_per_block 1 dev.sectors_per_bucket 1024 dev.cache.first_sector 1024 dev.cache.cache_sectors 10484736 dev.cache.total_sectors 10485760 dev.cache.ordered yes dev.cache.discard no dev.cache.pos 0 dev.cache.replacement 0 [lru] cset.uuid 2fcc7937-9f5f-4089-9400-552675929ea5 $ echo 2fcc7937-9f5f-4089-9400-552675929ea5 | sudo tee /sys/block/bcache0/bcache/attach >/dev/null $ echo writeback | sudo tee /sys/block/bcache0/bcache/cache_mode >/dev/null $ sudo mount /dev/bcache0 /srv/cache $ sudo chown andy: /srv/cache git grep test $ cd /srv/cache $ /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \ > git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git Cloning into 'linux-stable'... remote: Counting objects: 5948690, done. remote: Compressing objects: 100% (124799/124799), done. remote: Total 5948690 (delta 427755), reused 549518 (delta 425675) Receiving objects: 100% (5948690/5948690), 1.21 GiB | 40.30 MiB/s, done. Resolving deltas: 100% (4961964/4961964), done. Checking out files: 100% (59844/59844), done. 194.88 user 21.30 system 3:34.10 elapsed 100% CPU $ cd /srv/cache/linux-stable $ for i in {1..20}; do > echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null ; \ > /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \ > git --no-pager grep -q blah ; \ > sleep 1; \ > done 1.03 user 1.40 system 0:03.02 elapsed 80% CPU 1.01 user 1.25 system 0:02.55 elapsed 88% CPU 1.05 user 1.21 system 0:02.63 elapsed 85% CPU 0.90 user 1.34 system 0:02.56 elapsed 87% CPU 0.96 user 1.28 system 0:02.59 elapsed 86% CPU 0.90 user 1.35 system 0:02.54 elapsed 88% CPU 1.04 user 1.22 system 0:02.51 elapsed 90% CPU 1.00 user 1.25 system 0:02.50 elapsed 89% CPU 0.93 user 1.34 system 0:02.54 elapsed 89% CPU 1.02 user 1.26 system 0:02.53 elapsed 90% CPU 0.91 user 1.43 system 0:02.67 elapsed 88% CPU 0.94 user 1.39 system 0:02.57 elapsed 90% CPU 1.04 user 1.28 system 0:02.57 elapsed 90% CPU 1.00 user 1.37 system 0:02.58 elapsed 91% CPU 1.02 user 1.33 system 0:02.84 elapsed 82% CPU 1.11 user 1.21 system 0:02.58 elapsed 90% CPU 1.03 user 1.30 system 0:02.57 elapsed 90% CPU 0.96 user 1.37 system 0:02.58 elapsed 90% CPU 1.02 user 1.36 system 0:02.62 elapsed 90% CPU 0.92 user 1.41 system 0:02.62 elapsed 89% CPU

The initial git clone took 214.10 seconds and each git grep on average took about 2.61 seconds. It can be seen that the first run is slightly slower but very quickly things settle down.

The git grep results are very encouraging being only 11.6% of the time taken for the baseline HDD tests, or almost 9x faster. Each run was only 0.37 seconds slower than when run directly against SSD.

The initial clone results are disappointing though, being only 8.13 seconds faster than when run against baseline HDD. It looks like most data must have bypassed the cache and waited for HDD. bcache does bypass cache for sequential writes which is what the git clone would have done, and we can prove this:

$ cat /sys/block/bcache0/bcache/stats_total/bypassed 270M

So yes, there we go, all 270MiB bypassed the cache and waited on HDD completion.

We can give bcache hints about when to consider writes sequential, and at the extreme end tell it to never consider writes sequential, so let’s try that.

$ echo 0 | sudo tee /sys/block/bcache0/bcache/sequential_cutoff >/dev/null $ cd /srv/cache $ rm -r linux-stable $ /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \ > git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git Cloning into 'linux-stable'... remote: Counting objects: 5948690, done. remote: Compressing objects: 100% (124799/124799), done. remote: Total 5948690 (delta 427756), reused 549517 (delta 425675) Receiving objects: 100% (5948690/5948690), 1.21 GiB | 39.51 MiB/s, done. Resolving deltas: 100% (4961965/4961965), done. Checking out files: 100% (59844/59844), done. 195.10 user 23.53 system 2:39.87 elapsed 136% CPU $ cat /sys/block/bcache0/bcache/stats_total/bypassed 299M

This time only 29MiB of data bypassed cache, and the time for clone was 159.87 seconds. That’s 62.36 seconds faster than against HDD, and only 1.98 seconds slower than against SSD!

fio test – sequential_cutoff=4M $ cd /srv/cache/fio && FIOTEST=bcache-sc4m fio ~/partial.fio partial: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8 ... fio-2.16 Starting 2 processes partial: Laying out IO file(s) (1 file(s) / 2048MB) partial: Laying out IO file(s) (1 file(s) / 2048MB) Jobs: 2 (f=2): [r(2)] [100.0% done] [119.9MB/0KB/0KB /s] [30.7K/0/0 iops] [eta 00m:00s] partial: (groupid=0, jobs=1): err= 0: pid=8640: Thu Jul 20 18:13:08 2017 read : io=45237MB, bw=25734KB/s, iops=6433, runt=1800041msec slat (usec): min=3, max=5707, avg= 6.14, stdev=10.97 clat (usec): min=2, max=1238.4K, avg=1235.14, stdev=10675.04 lat (usec): min=70, max=1238.5K, avg=1241.70, stdev=10675.36 clat percentiles (usec): | 1.00th=[ 101], 5.00th=[ 104], 10.00th=[ 106], 20.00th=[ 108], | 30.00th=[ 111], 40.00th=[ 117], 50.00th=[ 119], 60.00th=[ 121], | 70.00th=[ 122], 80.00th=[ 124], 90.00th=[ 133], 95.00th=[ 151], | 99.00th=[35584], 99.50th=[67072], 99.90th=[148480], 99.95th=[185344], | 99.99th=[305152] lat (usec) : 4=0.01%, 50=0.01%, 100=0.37%, 250=97.11%, 500=0.08% lat (usec) : 750=0.02%, 1000=0.02% lat (msec) : 2=0.03%, 4=0.01%, 10=0.13%, 20=0.65%, 50=0.87% lat (msec) : 100=0.46%, 250=0.24%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2000=0.01% cpu : usr=1.81%, sys=6.79%, ctx=10722687, majf=0, minf=17 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=11580733/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 partial: (groupid=0, jobs=1): err= 0: pid=8641: Thu Jul 20 18:13:08 2017 read : io=43743MB, bw=24885KB/s, iops=6221, runt=1800037msec slat (usec): min=3, max=5703, avg= 6.27, stdev=10.53 clat (usec): min=2, max=1081.7K, avg=1277.44, stdev=10884.69 lat (usec): min=70, max=1081.8K, avg=1284.13, stdev=10884.99 clat percentiles (usec): | 1.00th=[ 100], 5.00th=[ 103], 10.00th=[ 104], 20.00th=[ 106], | 30.00th=[ 107], 40.00th=[ 109], 50.00th=[ 111], 60.00th=[ 117], | 70.00th=[ 120], 80.00th=[ 122], 90.00th=[ 129], 95.00th=[ 145], | 99.00th=[37632], 99.50th=[69120], 99.90th=[150528], 99.95th=[189440], | 99.99th=[317440] lat (usec) : 4=0.01%, 20=0.01%, 50=0.01%, 100=0.68%, 250=96.73% lat (usec) : 500=0.07%, 750=0.02%, 1000=0.02% lat (msec) : 2=0.03%, 4=0.01%, 10=0.13%, 20=0.68%, 50=0.90% lat (msec) : 100=0.48%, 250=0.25%, 500=0.02%, 750=0.01%, 1000=0.01% lat (msec) : 2000=0.01% cpu : usr=1.74%, sys=6.57%, ctx=10425775, majf=0, minf=15 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=11198290/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 Run status group 0 (all jobs): READ: io=88981MB, aggrb=50618KB/s, minb=24884KB/s, maxb=25734KB/s, mint=1800037msec, maxt=1800041msec Disk stats (read/write): bcache0: ios=22775854/880, merge=0/0, ticks=28668080/36768, in_queue=197531580, util=100.00%, aggrios=11393419/289009, aggrmerge=0/182, aggrticks=14329508/435284, aggrin_queue=14764868, aggrutil=100.00% xvdc: ios=22230796/570128, merge=0/348, ticks=2656456/37312, in_queue=2693812, util=67.39% xvdd: ios=556042/7890, merge=0/17, ticks=26002560/833256, in_queue=26835924, util=100.00%

The ~15k IOPS per process achieved towards the end of this test are quite impressive and suggest that there is more scope for performance increase in a longer test. The combined average IOPS across 30 minutes were “only” 12.7k though. Vastly superior to a HDD and we would be glad to take it.

22,779,023 IOs were issued (11,580,733+11,198,290) of which 22,230,796 (97.6%) hit the cache. That’s a really great result, but we learned earlier that setting sequential_cutoff to 0 allows us to get data we write straight into the cache, so let’s see how that changes things.

fio test – sequential_cutoff=0 $ cd /srv/cache/fio && FIOTEST=bcache-sc0 fio ~/partial.fio partial: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8 ... fio-2.16 Starting 2 processes partial: Laying out IO file(s) (1 file(s) / 2048MB) partial: Laying out IO file(s) (1 file(s) / 2048MB) Jobs: 2 (f=2): [r(2)] [100.0% done] [276.1MB/0KB/0KB /s] [70.1K/0/0 iops] [eta 00m:00s] partial: (groupid=0, jobs=1): err= 0: pid=9190: Thu Jul 20 20:34:49 2017 read : io=367029MB, bw=208799KB/s, iops=52199, runt=1800001msec slat (usec): min=3, max=350673, avg= 5.49, stdev=54.90 clat (usec): min=22, max=558335, avg=145.49, stdev=593.01 lat (usec): min=67, max=558342, avg=151.41, stdev=597.19 clat percentiles (usec): | 1.00th=[ 102], 5.00th=[ 106], 10.00th=[ 109], 20.00th=[ 116], | 30.00th=[ 120], 40.00th=[ 123], 50.00th=[ 127], 60.00th=[ 133], | 70.00th=[ 139], 80.00th=[ 153], 90.00th=[ 185], 95.00th=[ 223], | 99.00th=[ 310], 99.50th=[ 338], 99.90th=[ 442], 99.95th=[ 868], | 99.99th=[20352] lat (usec) : 50=0.01%, 100=0.27%, 250=96.55%, 500=3.11%, 750=0.02% lat (usec) : 1000=0.01% lat (msec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01% cpu : usr=13.23%, sys=41.79%, ctx=46793234, majf=0, minf=17 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=93959531/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 partial: (groupid=0, jobs=1): err= 0: pid=9191: Thu Jul 20 20:34:49 2017 read : io=116645MB, bw=66357KB/s, iops=16589, runt=1800020msec slat (usec): min=3, max=86150, avg= 6.31, stdev=22.07 clat (usec): min=16, max=581877, avg=473.49, stdev=3466.14 lat (usec): min=67, max=581884, avg=480.26, stdev=3466.62 clat percentiles (usec): | 1.00th=[ 102], 5.00th=[ 107], 10.00th=[ 110], 20.00th=[ 116], | 30.00th=[ 120], 40.00th=[ 123], 50.00th=[ 126], 60.00th=[ 131], | 70.00th=[ 135], 80.00th=[ 143], 90.00th=[ 161], 95.00th=[ 183], | 99.00th=[14656], 99.50th=[23680], 99.90th=[47360], 99.95th=[59648], | 99.99th=[94720] lat (usec) : 20=0.01%, 50=0.01%, 100=0.23%, 250=97.81%, 500=0.37% lat (usec) : 750=0.03%, 1000=0.02% lat (msec) : 2=0.02%, 4=0.02%, 10=0.18%, 20=0.66%, 50=0.58% lat (msec) : 100=0.08%, 250=0.01%, 500=0.01%, 750=0.01% cpu : usr=4.86%, sys=16.81%, ctx=24164179, majf=0, minf=15 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=29861123/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 Run status group 0 (all jobs): READ: io=483674MB, aggrb=275153KB/s, minb=66357KB/s, maxb=208798KB/s, mint=1800001msec, maxt=1800020msec Disk stats (read/write): bcache0: ios=123818417/28, merge=0/0, ticks=27945560/5768, in_queue=429639892, util=100.00%, aggrios=61919282/259095, aggrmerge=0/25, aggrticks=13927904/125270, aggrin_queue=14059068, aggrutil=100.00% xvdc: ios=123345393/500268, merge=0/35, ticks=17064408/30904, in_queue=17107112, util=99.99% xvdd: ios=493172/17923, merge=0/15, ticks=10791400/219636, in_queue=11011024, util=100.00%

123,820,654 (93,959,531+29,861,123) IOs issued, 123,345,393 (99.62%) hit cache. It’s just ~2% higher than the previous test, but what a difference it makes. The IOPS log indicates that performance very quickly increases to near-SSD levels, averaging 68.8k IOPS across 30 minutes.

Interestingly unlike all other tests (even unlike bcache with the default sequential_cutoff), the second job doesn’t manage anywhere near the same level of performance as the first job. It’s 52.2k IOPS vs. 16.6k IOPS. Perhaps at this high level of cache utilisation there is some sort of locking going on. Might be worth following up with upstream.

lvmcache Setup $ sudo pvcreate /dev/xvdc Physical volume "/dev/xvdc" successfully created. $ sudo pvcreate /dev/xvdd Physical volume "/dev/xvdd" successfully created. $ sudo vgcreate cachedvg /dev/xvd{d,c} Volume group "cachedvg" successfully created $ sudo lvcreate -L50m -n lv_cache_meta cachedvg /dev/xvdc Rounding up size to full physical extent 52.00 MiB Logical volume "lv_cache_meta" created. $ sudo lvcreate -L4g -n lv_cache cachedvg /dev/xvdc Logical volume "lv_cache" created. $ sudo lvcreate -l100%PVS -n lv cachedvg /dev/xvdd Logical volume "lv" created. $ sudo lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert lv cachedvg -wi-a----- 50.00g lv_cache cachedvg -wi-a----- 4.00g lv_cache_meta cachedvg -wi-a----- 52.00m $ sudo lvconvert --type cache-pool --cachemode writeback --poolmetadata cachedvg/lv_cache_meta cachedvg/lv_cache Converted cachedvg/lv_cache to cache pool. $ sudo lvconvert --type cache --cachepool cachedvg/lv_cache cachedvg/lv Logical volume cachedvg/lv is now cached. $ sudo lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices lv cachedvg Cwi-a-C--- 50.00g [lv_cache] [lv_corig] 0.00 1.01 0.00 lv_corig(0) [lv_cache] cachedvg Cwi---C--- 4.00g 0.00 1.01 0.00 lv_cache_cdata(0) [lv_cache_cdata] cachedvg Cwi-ao---- 4.00g /dev/xvdc(13) [lv_cache_cmeta] cachedvg ewi-ao---- 52.00m /dev/xvdc(0) [lv_corig] cachedvg owi-aoC--- 50.00g /dev/xvdd(0) [lvol0_pmspare] cachedvg ewi------- 52.00m /dev/xvdc(1037) $ sudo mkfs.ext4 -L cached /dev/cachedvg/lv mke2fs 1.43.4 (31-Jan-2017) Creating filesystem with 13106176 4k blocks and 3276800 inodes Filesystem UUID: bff28530-339a-403d-889d-dd42f4465ab5 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424 Allocating group tables: done Writing inode tables: done Creating journal (65536 blocks): done Writing superblocks and filesystem accounting information: done $ sudo mount /dev/cachedvg/lv /srv/cache $ sudo chown andy: /srv/cache git grep test $ cd /srv/cache $ /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \ > git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git Cloning into 'linux-stable'... remote: Counting objects: 5948690, done. remote: Compressing objects: 100% (124799/124799), done. remote: Total 5948690 (delta 427755), reused 549518 (delta 425675) Receiving objects: 100% (5948690/5948690), 1.21 GiB | 35.05 MiB/s, done. Resolving deltas: 100% (4961964/4961964), done. Checking out files: 100% (59844/59844), done. 196.00 user 23.40 system 2:43.13 elapsed 134% CPU $ cd /srv/cache/linux-stable $ for i in {1..20}; do > echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null ; \ > /usr/bin/time -f '%U user %S system %E elapsed %P CPU' \ > git --no-pager grep -q blah ; \ > sleep 1; \ > done 1.00 user 1.58 system 0:06.41 elapsed 40% CPU 0.93 user 1.46 system 0:04.22 elapsed 56% CPU 1.09 user 1.30 system 0:04.09 elapsed 58% CPU 1.02 user 1.39 system 0:03.89 elapsed 62% CPU 0.97 user 1.41 system 0:03.89 elapsed 61% CPU 1.03 user 1.34 system 0:03.52 elapsed 67% CPU 1.06 user 1.30 system 0:03.60 elapsed 65% CPU 0.94 user 1.46 system 0:03.79 elapsed 63% CPU 1.05 user 1.33 system 0:03.28 elapsed 72% CPU 1.00 user 1.39 system 0:03.81 elapsed 62% CPU 1.03 user 1.37 system 0:03.25 elapsed 74% CPU 1.05 user 1.37 system 0:03.30 elapsed 73% CPU 1.07 user 1.30 system 0:03.18 elapsed 74% CPU 1.07 user 1.32 system 0:03.28 elapsed 72% CPU 0.96 user 1.46 system 0:02.95 elapsed 82% CPU 1.07 user 1.31 system 0:03.14 elapsed 76% CPU 0.98 user 1.40 system 0:03.04 elapsed 78% CPU 0.96 user 1.42 system 0:03.19 elapsed 74% CPU 1.00 user 1.40 system 0:03.12 elapsed 77% CPU 0.89 user 1.48 system 0:02.89 elapsed 82% CPU

The initial git clone took 163.13 seconds, not far off native SSD timings (157.68 seconds). Each git grep on average took about 3.35 seconds. These started off slower (6.41 seconds) but settled down with the last 15 runs averaging at 3.29 seconds.

fio test – empty cache $ cd /srv/cache/fio && FIOTEST=lvmcache-empty fio ~/partial.fio partial: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8 ... fio-2.16 Starting 2 processes partial: Laying out IO file(s) (1 file(s) / 2048MB) partial: Laying out IO file(s) (1 file(s) / 2048MB) Jobs: 2 (f=2): [r(2)] [100.0% done] [373.5MB/0KB/0KB /s] [95.7K/0/0 iops] [eta 00m:00s] partial: (groupid=0, jobs=1): err= 0: pid=6652: Thu Jul 20 12:14:06 2017 read : io=333831MB, bw=189912KB/s, iops=47478, runt=1800001msec slat (usec): min=2, max=245, avg= 6.41, stdev= 2.17 clat (usec): min=29, max=234093, avg=159.69, stdev=401.74 lat (usec): min=69, max=234098, avg=166.57, stdev=401.75 clat percentiles (usec): | 1.00th=[ 103], 5.00th=[ 108], 10.00th=[ 113], 20.00th=[ 119], | 30.00th=[ 124], 40.00th=[ 129], 50.00th=[ 135], 60.00th=[ 143], | 70.00th=[ 153], 80.00th=[ 175], 90.00th=[ 221], 95.00th=[ 274], | 99.00th=[ 370], 99.50th=[ 410], 99.90th=[ 502], 99.95th=[ 996], | 99.99th=[16768] lat (usec) : 50=0.01%, 100=0.16%, 250=92.76%, 500=6.97%, 750=0.05% lat (usec) : 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.02%, 50=0.01% lat (msec) : 100=0.01%, 250=0.01% cpu : usr=12.04%, sys=44.39%, ctx=42073879, majf=0, minf=16 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=85460611/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 partial: (groupid=0, jobs=1): err= 0: pid=6653: Thu Jul 20 12:14:06 2017 read : io=313896MB, bw=178572KB/s, iops=44642, runt=1800001msec slat (usec): min=2, max=335, avg= 6.55, stdev= 2.25 clat (usec): min=22, max=212646, avg=170.22, stdev=721.29 lat (usec): min=63, max=212656, avg=177.25, stdev=721.32 clat percentiles (usec): | 1.00th=[ 103], 5.00th=[ 108], 10.00th=[ 112], 20.00th=[ 118], | 30.00th=[ 122], 40.00th=[ 127], 50.00th=[ 131], 60.00th=[ 137], | 70.00th=[ 145], 80.00th=[ 157], 90.00th=[ 177], 95.00th=[ 197], | 99.00th=[ 258], 99.50th=[ 314], 99.90th=[12096], 99.95th=[15424], | 99.99th=[28032] lat (usec) : 50=0.01%, 100=0.22%, 250=98.57%, 500=0.91%, 750=0.01% lat (usec) : 1000=0.01% lat (msec) : 2=0.01%, 4=0.02%, 10=0.11%, 20=0.11%, 50=0.03% lat (msec) : 100=0.01%, 250=0.01% cpu : usr=11.36%, sys=42.77%, ctx=41146564, majf=0, minf=14 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=80357365/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 Run status group 0 (all jobs): READ: io=647726MB, aggrb=368484KB/s, minb=178571KB/s, maxb=189912KB/s, mint=1800001msec, maxt=1800001msec Disk stats (read/write): dm-2: ios=165815070/5, merge=0/0, ticks=27172896/564, in_queue=27297688, util=100.00%, aggrios=55273566/7259, aggrmerge=0/0, aggrticks=8986252/1498, aggrin_queue=9007888, aggrutil=100.00% dm-1: ios=2/19047, merge=0/0, ticks=0/3072, in_queue=3072, util=0.14%, aggrios=165379217/17690, aggrmerge=0/4083, aggrticks=23890596/3616, aggrin_queue=23926592, aggrutil=100.00% xvdc: ios=165379217/17690, merge=0/4083, ticks=23890596/3616, in_queue=23926592, util=100.00% dm-0: ios=165379215/2726, merge=0/0, ticks=23917816/860, in_queue=23979056, util=100.00% dm-3: ios=441483/6, merge=0/0, ticks=3040940/564, in_queue=3041536, util=75.31%, aggrios=441483/5, aggrmerge=0/1, aggrticks=3040896/464, aggrin_queue=3041312, aggrutil=75.30% xvdd: ios=441483/5, merge=0/1, ticks=3040896/464, in_queue=3041312, util=75.30%

We can see that the first job issued 85,460,611 IOs and the second job issued 80,357,365 IOs, for a total of 165817976 IOs. Of these, 165,379,217 (99.74%) came from xvdc, the cache (SSD) device. Only 441,483 (0.27%) came from xvdd, the backing (HDD) device. Again this is very aggressive cache population, with a combined average of 92k IOPS being achieved over 30 minutes.

80k IOPS are achieved within 20 seconds of the start of testing, which is probably helped by the disk cache, but the steady maximum of ~90k IOPS is still achieved by 40 seconds in. Average IOPS ended up at 88% that of the SSD.

This is incredible performance, but I could only get it to happen the first time, against an empty cache device. Since the next test run will unlink the previous run’s files, the cache device will at that point be full of data that is of no use as it’s been deleted. The next test will write out 4GiB of new, uncached data.

fio test – full cache $ cd /srv/cache/fio && FIOTEST=lvmcache-full fio ~/partial.fio partial: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8 ... fio-2.16 Starting 2 processes partial: Laying out IO file(s) (1 file(s) / 2048MB) partial: Laying out IO file(s) (1 file(s) / 2048MB) Jobs: 2 (f=2): [r(2)] [100.0% done] [6072KB/0KB/0KB /s] [1518/0/0 iops] [eta 00m:00s] partial: (groupid=0, jobs=1): err= 0: pid=6857: Thu Jul 20 12:58:00 2017 read : io=5266.1MB, bw=2996.2KB/s, iops=749, runt=1800111msec slat (usec): min=3, max=117, avg=12.93, stdev= 7.50 clat (usec): min=58, max=1060.8K, avg=10663.38, stdev=31025.49 lat (usec): min=73, max=1060.8K, avg=10676.99, stdev=31025.92 clat percentiles (usec): | 1.00th=[ 105], 5.00th=[ 112], 10.00th=[ 117], 20.00th=[ 125], | 30.00th=[ 127], 40.00th=[ 133], 50.00th=[ 137], 60.00th=[ 153], | 70.00th=[ 179], 80.00th=[14144], 90.00th=[33024], 95.00th=[59136], | 99.00th=[144384], 99.50th=[191488], 99.90th=[337920], 99.95th=[411648], | 99.99th=[618496] lat (usec) : 100=0.29%, 250=74.34%, 500=0.46%, 750=0.20%, 1000=0.21% lat (msec) : 2=0.47%, 4=0.03%, 10=1.30%, 20=7.26%, 50=9.23% lat (msec) : 100=4.09%, 250=1.88%, 500=0.22%, 750=0.02%, 1000=0.01% lat (msec) : 2000=0.01% cpu : usr=0.48%, sys=1.52%, ctx=1335232, majf=0, minf=17 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=1348336/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 partial: (groupid=0, jobs=1): err= 0: pid=6858: Thu Jul 20 12:58:00 2017 read : io=5171.2MB, bw=2941.7KB/s, iops=735, runt=1800133msec slat (usec): min=3, max=320, avg=13.05, stdev= 7.56 clat (usec): min=63, max=1145.3K, avg=10861.20, stdev=30827.62 lat (usec): min=72, max=1145.3K, avg=10874.93, stdev=30828.07 clat percentiles (usec): | 1.00th=[ 109], 5.00th=[ 116], 10.00th=[ 120], 20.00th=[ 126], | 30.00th=[ 131], 40.00th=[ 133], 50.00th=[ 141], 60.00th=[ 155], | 70.00th=[ 185], 80.00th=[14912], 90.00th=[33536], 95.00th=[59136], | 99.00th=[142336], 99.50th=[189440], 99.90th=[333824], 99.95th=[411648], | 99.99th=[602112] lat (usec) : 100=0.33%, 250=73.29%, 500=0.50%, 750=0.22%, 1000=0.22% lat (msec) : 2=0.52%, 4=0.03%, 10=1.38%, 20=7.58%, 50=9.63% lat (msec) : 100=4.18%, 250=1.88%, 500=0.21%, 750=0.02%, 1000=0.01% lat (msec) : 2000=0.01% cpu : usr=0.43%, sys=1.55%, ctx=1310390, majf=0, minf=15 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=1323812/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=8 Run status group 0 (all jobs): READ: io=10438MB, aggrb=5937KB/s, minb=2941KB/s, maxb=2996KB/s, mint=1800111msec, maxt=1800133msec Disk stats (read/write): dm-2: ios=2671993/27, merge=0/0, ticks=28764200/8380, in_queue=28773988, util=100.00%, aggrios=891886/15335, aggrmerge=0/0, aggrticks=9387100/6312, aggrin_queue=9393456, aggrutil=100.00% dm-1: ios=173/42639, merge=0/0, ticks=16/9316, in_queue=9336, util=0.24%, aggrios=1967684/42191, aggrmerge=0/3788, aggrticks=291076/9440, aggrin_queue=300476, aggrutil=15.68% xvdc: ios=1967684/42191, merge=0/3788, ticks=291076/9440, in_queue=300476, util=15.68% dm-0: ios=1967511/3340, merge=0/0, ticks=291184/1240, in_queue=292496, util=15.55% dm-3: ios=707976/28, merge=0/0, ticks=27870100/8380, in_queue=27878536, util=100.00%, aggrios=707976/12, aggrmerge=0/16, aggrticks=27870028/2908, aggrin_queue=27872968, aggrutil=100.00% xvdd: ios=707976/12, merge=0/16, ticks=27870028/2908, in_queue=27872968, util=100.00%

Testing with an already-full cache produces dramatically different results. We can see that a total of 2,672,148 (1,348,336+1,323,812) IOs were issued, with 1,967,684 (73.64%) coming from the cache. 707,976 (26.49%) came from the backing device. Although on the face of it that sounds like a good hit rate, it was poor enough to severely limit the average IOPS to 1484 across both jobs. We saw how much the difference was between 99.6% and 97.6% cache hit rate, and now we see what the difference is between ~99% and ~73%.

1,484 IOPS represents a 158% increase on the baseline HDD IOPS of 938 so it’s nothing to be sniffed at, and as I say it may be the case that it only gets better over much longer time span.

Even after this 30 minute test the performance of the lvmcache was still very slowly going up, so I decided to run another test across 3 hours instead of 30 seconds. The performance did continue to increase very very slowly peaking at about 1,640 aggregate IOPS after 3 hours. It seems like it is quite difficult to evict data from the cache in favour of new data that is being read, whereas when the cache was empty it promoted everything that was read immediately. Possibly if this was left for long enough there would eventually be enough cache hits to achieve similar performance to the first run.

bcache did not suffer from this problem: every test run against bcache was the same, except for the effect of preloading the cache device by messing with sequential_cutoff. I suspect that this is because bcache is a write cache; when set to writethrough or writeback mode, writes are going through the cache and immediately being cached. lvmcache is not a write cache; it will only promote things when they are read.

It seems like lvmcache needs a very large number of reads before it will evict data from the cache and bring new data in.

Further tests?

If you have any further tests you’d like to suggest please leave a comment or a tweet. I think it would be good to come up with more fio tests for this, but bear in mind that it really needs to test random read IO with a repetitive element so that a cache can actually be effective. If you read every part of 100GiB equally as often then there will be no hot spots and a comparatively small cache will not be effective, no matter what technology is used.

Also real-world benchmarks would be good, so perhaps a really complex SQL SELECT on a data set that won’t fit in (1GiB) memory but will fit in (5GiB) SSD cache.

It might be interesting to use a cache device much smaller than the working set. These tests laid out two 2GiB files then used a Zipf distribution to randomly read 10% of each (so two sets of ~205MiB data) while the cache devices were 4 or 5GiB in size. It is difficult to guess what the hot spots would be on a production file system, but a 90/10 or 80/20 distribution does seem reasonable.

What may be less reasonable is starting off with such a small data set. What would it be like to try to cache 1TiB with 10GiB of SSD? I’d rather not write out 1TiB of data to check, but I could simulate by writing out 51GiB (5%) and having a cache device that is only 512MiB. On a 90/10 distribution it would still want to cache ~5.2GiB and only have 512MiB to put it in.

Conclusions

SSDs are lovely but they’re still really expensive. Although they are fun for turning out impressive benchmarks, in the real world you only need things to be fast enough. If you want consistently great performance and you have the money, buy some decent SSDs. If you don’t have the money then there is a whole world of trade-offs.

bcache from what I can measure provides really impressive performance but I am concerned about relying upon it. The existing implementation is no longer the focus of the main developer’s attention as the developer has now moved on to bcachefs, a complete filesystem in its own right that builds upon the bcache concept.

I’m really conservative when it comes to filesystems and I know it can take years, even decades, to get to feature parity with all the nasty bugs hammered out. I also worry about continued maintenance of a separate filesystem.

lvmcache on the other hand is providing only a very modest performance boost. In many cases it won’t be worth it. I do like its simplicity though and I’m already very familiar with LVM.

The way that any logical volume can have a cache attached without needing to start over with the filesystem makes lvmcache very flexible; it also means that some volumes can be cached and others not. Furthermore with a slow PV and a fast PV both in the volume group, if necessary an individual logical volume can be allocated entirely from the fast PV.

I may experiment with bcache some more in situations where I can quickly restore the data in event of calamity, or fairly easily move to a different solution if bcache becomes abandonware. I may experiment with lvmcache in places where a modest speedup would be welcome.

Categories: LUG Community Blogs

Steve Kemp: bind9 considered harmful

Mon, 10/07/2017 - 22:00

Recently there was another bind9 security update released by the Debian Security Team. I thought that was odd, so I've scanned my mailbox:

  • 11 January 2017
    • DSA-3758 - bind9
  • 26 February 2017
    • DSA-3795-1 - bind9
  • 14 May 2017
    • DSA-3854-1 - bind9
  • 8 July 2017
    • DSA-3904-1 - bind9

So in the year to date there have been 7 months, in 3 of them nothing happened, but in 4 of them we had bind9 updates. If these trends continue we'll have another 2.5 updates before the end of the year.

I don't run a nameserver. The only reason I have bind-packages on my system is for the dig utility.

Rewriting a compatible version of dig in Perl should be trivial, thanks to the Net::DNS::Resolver module:

These are about the only commands I ever run:

dig -t a steve.fi +short dig -t aaaa steve.fi +short dig -t a steve.fi @8.8.8.8

I should do that then. Yes.

Categories: LUG Community Blogs

Steve Kemp: I've never been more proud

Tue, 04/07/2017 - 22:00

This morning I remembered I had a beefy virtual-server setup to run some kernel builds upon (when I was playing with Linux security moduels), and I figured before I shut it down I should use the power to run some fuzzing.

As I was writing some code in Emacs at the time I figured "why not fuzz emacs?"

After a few hours this was the result:

deagol ~ $ perl -e 'print "`" x ( 1024 * 1024 * 12);' > t.el deagol ~ $ /usr/bin/emacs --batch --script ./t.el .. .. Segmentation fault (core dumped)

Yup, evaluating an lisp file caused a segfault, due to a stack-overflow (no security implications). I've never been more proud, even though I have contributed code to GNU Emacs in the past.

Categories: LUG Community Blogs

Debian Bits: New Debian Developers and Maintainers (May and June 2017)

Sun, 02/07/2017 - 13:30

The following contributors got their Debian Developer accounts in the last two months:

  • Alex Muntada (alexm)
  • Ilias Tsitsimpis (iliastsi)
  • Daniel Lenharo de Souza (lenharo)
  • Shih-Yuan Lee (fourdollars)
  • Roger Shimizu (rosh)

The following contributors were added as Debian Maintainers in the last two months:

  • James Valleroy
  • Ryan Tandy
  • Martin Kepplinger
  • Jean Baptiste Favre
  • Ana Cristina Custura
  • Unit 193

Congratulations!

Categories: LUG Community Blogs

Steve Kemp: Yet more linux security module craziness ..

Wed, 28/06/2017 - 22:00

I've recently been looking at linux security modules. My first two experiments helped me learn:

My First module - whitelist_lsm.c

This looked for the presence of an xattr, and if present allowed execution of binaries.

I learned about the Kernel build-system, and how to write a simple LSM.

My second module - hashcheck_lsm.c

This looked for the presence of a "known-good" SHA1 hash xattr, and if it matched the actual hash of the file on-disk allowed execution.

I learned how to hash the contents of a file, from kernel-space.

Both allowed me to learn things, but both were a little pointless. They were not fine-grained enough to allow different things to be done by different users. (i.e. If you allowed "alice" to run "wget" you'd also allow www-data to do the same.)

So, assuming you wanted to do your security job more neatly what would you want? You'd want to allow/deny execution of commands based upon:

  • The user who was invoking them.
  • The path of the binary itself.

So your local users could run "bad" commands, but "www-data" (post-compromise) couldn't.

Obviously you don't want to have to recompile your kernel to change the rules of who can execute what. So you think to yourself "I'll write those rules down in a file". But of course reading a file from kernel-space is tricky. And parsing any list of rules, in a file, from kernel-space would prone to buffer-related problems.

So I had a crazy idea:

  • When a user attempts to execute a program.
  • Call back to user-space to see if that should be permitted.
    • Give the user-space binary the UID of the invoker, and the path to the command they're trying to execute.

Calling userspace? Every time a command is to be executed? Crazy. But it just might work.

One problem I had with this approach is that userspace might not even be available, when you're booting. So I setup a flag to enable this stuff:

# echo 1 >/proc/sys/kernel/can-exec/enabled

Now the kernel will invoke the following on every command:

/sbin/can-exec $UID $PATH

Because the kernel waits for this command to complete - as it reads the exit-code - you cannot execute any child-processes from it as you'd end up in recursive hell, but you can certainly read files, write to syslog, etc. My initial implementionation was as basic as this:

int main( int argc, char *argv[] ) { ... // Get the UID + Program int uid = atoi( argv[1] ); char *prg = argv[2]; // syslog openlog ("can-exec", LOG_CONS | LOG_PID | LOG_NDELAY, LOG_LOCAL1); syslog (LOG_NOTICE, "UID:%d CMD:%s", uid, prg ); // root can do all. if ( uid == 0 ) return 0; // nobody if ( uid == 65534 ) { if ( ( strcmp( prg , "/bin/sh" ) == 0 ) || ( strcmp( prg , "/usr/bin/id" ) == 0 ) ) { fprintf(stderr, "Allowing 'nobody' access to shell/id\n" ); return 0; } } fprintf(stderr, "Denied\n" ); return -1; }

Although the UIDs are hard-code it actually worked! Yay!

I updated the code to convert the UID to a username, then check executables via the file /etc/can-exec/$USERNAME.conf, and this also worked.

I don't expect anybody to actually use this code, but I do think I've reached a point where I can pretend I've written a useful (or non-pointless) LSM at last. That means I can stop.

Categories: LUG Community Blogs

Steve Kemp: Linux security modules, round two.

Sat, 24/06/2017 - 22:00

So recently I wrote a Linux Security Module (LSM) which would deny execution of commands, unless an extended attribute existed upon the filesystem belonging to the executables.

The whitelist-LSM worked well, but it soon became apparent that it was a little pointless. Most security changes are pointless unless you define what you're defending against - your "threat model".

In my case it was written largely as a learning experience, but also because I figured it seemed like it could be useful. However it wasn't actually as useful because you soon realize that you have to whitelist too much:

  • The redis-server binary must be executable, to the redis-user, otherwise it won't run.
  • /usr/bin/git must be executable to the git user.

In short there comes a point where user alice must run executable blah. If alice can run it, then so can mallory. At which point you realize the exercise is not so useful.

Taking a step back I realized that what I wanted to to prevent was the execution of unknown/unexpected, and malicious binaries How do you identify known-good binaries? Well hashes & checksums are good. So for my second attempt I figured I'd not look for a mere "flag" on a binary, instead look for a valid hash.

Now my second LSM is invoked for every binary that is executed by a user:

  • When a binary is executed the sha1 hash is calculated of the files contents.
  • If that matches the value stored in an extended attribute the execution is permitted.
    • If the extended-attribute is missing, or the checksum doesn't match, then the execution is denied.

In practice this is the same behaviour as the previous LSM - a binary is either executable, because there is a good hash, or it is not, because it is missing or bogus. If somebody deploys a binary rootkit this will definitely stop it from executing, but of course there is a huge hole - scripting-languages:

  • If /usr/bin/perl is whitelisted then /usr/bin/perl /tmp/exploit.pl will succeed.
  • If /usr/bin/python is whitelisted then the same applies.

Despite that the project was worthwhile, I can clearly describe what it is designed to achieve ("Deny the execution of unknown binaries", and "Deny binaries that have been modified"), and I learned how to hash a file from kernel-space - which was surprisingly simple.

(Yes I know about IMA and EVM - this was a simple project for learning purposes. Public-key signatures will be something I'll look at next/soon/later. :)

Perhaps the only other thing to explore is the complexity in allowing/denying actions based on the user - in a human-readable fashion, not via UIDs. So www-data can execute some programs, alice can run a different set of binaries, and git can only run /usr/bin/git.

Of course down that path lies apparmour, selinux, and madness..

Categories: LUG Community Blogs

Debian Bits: Hewlett Packard Enterprise Platinum Sponsor of DebConf17

Fri, 23/06/2017 - 15:15

We are very pleased to announce that Hewlett Packard Enterprise (HPE) has committed support to DebConf17 as a Platinum sponsor.

"Hewlett Packard Enterprise is excited to support Debian's annual developer conference again this year", said Steve Geary, Senior Director R&D at Hewlett Packard Enterprise. "As Platinum sponsors and member of the Debian community, HPE is committed to supporting Debconf. The conference, community and open distribution are foundational to the development of The Machine research program and will our bring our Memory Driven Computing agenda to life."

HPE is one of the largest computer companies in the world, providing a wide range of products and services, such as servers, storage, networking, consulting and support, software, and financial services.

HPE is also a development partner of Debian, and provides hardware for port development, Debian mirrors, and other Debian services (hardware donations are listed in the Debian machines page).

With this additional commitment as Platinum Sponsor, HPE contributes to make possible our annual conference, and directly supports the progress of Debian and Free Software helping to strengthen the community that continues to collaborate on Debian projects throughout the rest of the year.

Thank you very much Hewlett Packard Enterprise, for your support of DebConf17!

Become a sponsor too!

DebConf17 is still accepting sponsors. Interested companies and organizations may contact the DebConf team through sponsors@debconf.org, and visit the DebConf17 website at https://debconf17.debconf.org.

Categories: LUG Community Blogs

Debian Bits: Debian 9.0 Stretch has been released!

Sun, 18/06/2017 - 07:25

Let yourself be embraced by the purple rubber toy octopus! We're happy to announce the release of Debian 9.0, codenamed Stretch.

Want to install it? Choose your favourite installation media among Blu-ray Discs, DVDs, CDs and USB sticks. Then read the installation manual.

Already a happy Debian user and you only want to upgrade? You can easily upgrade from your current Debian 8 Jessie installation, please read the release notes.

Do you want to celebrate the release? Share the banner from this blog in your blog or your website!

Categories: LUG Community Blogs

Debian Bits: Upcoming Debian 9.0 Stretch!

Fri, 16/06/2017 - 23:30

The Debian Release Team in coordination with several other teams are preparing the last bits needed for releasing Debian 9 Stretch. Please, be patient! Lots of steps are involved and some of them take some time, such as building the images, propagating the release through the mirror network, and rebuilding the Debian website so that "stable" points to Debian 9.

Follow the live coverage of the release on https://micronews.debian.org or the @debian profile in your favorite social network! We'll spread the word about what's new in this version of Debian 9, how the release process is progressing during the weekend and facts about Debian and the wide community of volunteer contributors that make it possible.

Categories: LUG Community Blogs

Steve Kemp: Porting pfctl to Linux

Wed, 14/06/2017 - 22:00

If you have a bunch of machines running OpenBSD for firewalling purposes, which is pretty standard, you might start to use source-control to maintain the rulesets. You might go further, and use some kind of integration testing to deploy changes from your revision control system into production.

Of course before you deploy any pf.conf file you need to test that the file contents are valid/correct. If your integration system doesn't run on OpenBSD though you have a couple of choices:

  • Run a test-job that SSH's to the live systems, and tests syntax.
    • Via pfctl -n -f /path/to/rules/pf.conf.
  • Write a tool on your Linux hosts to parse and validate the rules.

I looked at this last year and got pretty far, but then got distracted. So the other day I picked it up again. It turns out that if you're patient it's not hard to use bison to generate some C code, then glue it together such that you can validate your firewall rules on a Linux system.

deagol ~/pf.ctl $ ./pfctl ./pf.conf ./pf.conf:298: macro 'undefined_variable' not defined ./pf.conf:298: syntax error

Unfortunately I had to remove quite a lot of code to get the tool to compile, which means that while some failures like that above are caught others are missed. The example above reads:

vlans="{vlan1,vlan2}" .. pass out on $vlans proto udp from $undefined_variable

Unfortunately the following line does not raise an error:

pass out on vlan12 inet proto tcp from <unknown> to $http_server port {80,443}

That comes about because looking up the value of the table named unknown just silently fails. In slowly removing more and more code to make it compile I lost the ability to keep track of table definitions - both their names and their values - Thus the fetching of a table by name has become a NOP, and a bogus name will result in no error.

Now it is possible, with more care, that you could use a hashtable library, or similar, to simulate these things. But I kinda stalled, again.

(Similar things happen with fetching a proto by name, I just hardcoded inet, gre, icmp, icmp6, etc. Things that I'd actually use.)

Might be a fun project for somebody with some time anyway! Download the OpenBSD source, e.g. from a github mirror - yeah, yeah, but still. CVS? No thanks! - Then poke around beneath sbin/pfctl/. The main file you'll want to grab is parse.y, although you'll need to setup a bunch of headers too, and write yourself a Makefile. Here's a hint:

deagol ~/pf.ctl $ tree . ├── inc │   ├── net │   │   └── pfvar.h │   ├── queue.h │   └── sys │   ├── _null.h │   ├── refcnt.h │   └── tree.h ├── Makefile ├── parse.y ├── pf.conf ├── pfctl.h ├── pfctl_parser.h └── y.tab.c 3 directories, 11 files
Categories: LUG Community Blogs

Steve Kemp: So I accidentally wrote a linux security module

Thu, 01/06/2017 - 22:00

Tonight I read this weeks LWN quotes-page a little later than usual because I was busy at work for most of the day. Anyway as always LWNs content was awesome, and this particular list lead to an interesting discussion about a new Linux-Security-Module (LSM).

That read weirdly, what I was trying to say was that every Thursday morning I like to read LWN at work. Tonight is the first chance I had to get round to it.

One of the later replies in the thread was particularly interesting as it said:

Suggestion: Create an security module that looks for the attribute security.WHITELISTED on things being executed/mmapped and denys it if the attribute isn't present. Create a program (whitelistd) that reads /etc/whitelist.conf and scans the system to ensure that only things on the list have the attribute.

So I figured that was a simple idea, and it didn't seem too hard even for myself as a non-kernel non-developer. There are several linux security modules included in the kernel-releases, beneath the top-level security/ directory, so I assumed I could copy & paste code around them to get something working.

During the course of all this work, which took about 90 minutes from start to Finnish (that pun never gets old), this online documentation was enormously useful:

Brief attr primer

If you're not familiar with the attr tool it's pretty simple. You can assign values to arbitrary labels on files. The only annoying thing is you have to use extra-flags to commands like rsync, tar, cp, etc, to preserve the damn things.

Set three attributes on the file named moi:

$ touch moi $ attr -s forename -V "Steve" moi $ attr -s surname -V "Kemp" moi $ attr -s name -V "Steve Kemp" moi

Now list the attributes present:

$ attr -l moi Attribute "name" has a 10 byte value for moi Attribute "forename" has a 5 byte value for moi Attribute "surname" has a 4 byte value for moi

And retrieve one?

$ attr -q -g name moi Steve Kemp LSM Skeleton

My initial starting point was to create "steve_lsm.c", with the following contents:

#include <linux/lsm_hooks.h> /* * Log things for the moment. */ static int steve_bprm_check_security(struct linux_binprm *bprm) { printk(KERN_INFO "STEVE LSM check of %s\n", bprm->filename); return 0; } /* * Only check exec(). */ static struct security_hook_list steve_hooks[] = { LSM_HOOK_INIT(bprm_check_security, steve_bprm_check_security), }; /* * Somebody set us up the bomb. */ static void __init steve_init(void) { security_add_hooks(steve_hooks, ARRAY_SIZE(steve_hooks), "steve"); printk(KERN_INFO "STEVE LSM initialized\n"); }

With that in place I had to modify the various KBuild files beneath security/ to make sure this could be selected as an LSM, and add in a Makefile to the new directory security/steve/.

With the boiler-plate done though, and the host machine rebooted into my new kernel it was simple to test things out.

Obviously the first step, post-boot, is to make sure that the module is active, which can be done in two ways, looking at the output of dmesg, and explicitly listing the modules available:

~# dmesg | grep STEVE | head -n2 STEVE LSM initialized STEVE LSM check of /init $ echo $(cat /sys/kernel/security/lsm) capability,steve Making the LSM functional

The next step was to make the module do more than mere logging. In short this is what we want:

  • If a binary is invoked by root - allow it.
    • Although note that this might leave a hole, if the user can enter a new namespace where their UID is 0..
  • If a binary is invoked by a non-root user look for an extended attribute on the target-file named security.WHITELISTED.
    • If this is present we allow the execution.
    • If this is missing we deny the execution.

NOTE we don't care what the content of the extended attribute is, we just care whether it exists or not.

Reading the extended attribute is pretty simple, using the __vfs_getxattr function. All in all our module becomes this:

#include <linux/xattr.h> #include <linux/binfmts.h> #include <linux/lsm_hooks.h> #include <linux/sysctl.h> #include <linux/ptrace.h> #include <linux/prctl.h> #include <linux/ratelimit.h> #include <linux/workqueue.h> #include <linux/string_helpers.h> #include <linux/task_work.h> #include <linux/sched.h> #include <linux/spinlock.h> #include <linux/lsm_hooks.h> /* * Perform a check of a program execution/map. * * Return 0 if it should be allowed, -EPERM on block. */ static int steve_bprm_check_security(struct linux_binprm *bprm) { // The current task & the UID it is running as. const struct task_struct *task = current; kuid_t uid = task->cred->uid; // The target we're checking struct dentry *dentry = bprm->file->f_path.dentry; struct inode *inode = d_backing_inode(dentry); // The size of the label-value (if any). int size = 0; // Root can access everything. if ( uid.val == 0 ) return 0; size = __vfs_getxattr(dentry, inode, "user.whitelisted", NULL, 0); if ( size >= 0 ) { printk(KERN_INFO "STEVE LSM check of %s resulted in %d bytes from 'user.whitelisted' - permitting access for UID %d\n", bprm->filename, size, uid.val ); return 0; } printk(KERN_INFO "STEVE LSM check of %s denying access for UID %d [ERRO:%d] \n", bprm->filename, uid.val, size ); return -EPERM; } /* * The hooks we wish to be installed. */ static struct security_hook_list steve_hooks[] = { LSM_HOOK_INIT(bprm_check_security, steve_bprm_check_security), }; /* * Initialize our module. */ void __init steve_add_hooks(void) { /* register ourselves with the security framework */ security_add_hooks(steve_hooks, ARRAY_SIZE(steve_hooks), "steve"); printk(KERN_INFO "STEVE LSM initialized\n"); }

Once again we reboot with this new kernel, and we test that the LSM is active. After the basic testing, as before, we can now test real functionality. By default no binaries will have the attribute we look for present - so we'd expect ALL commands to fail, unless executed by root. Let us test that:

~# su - nobody -s /bin/sh No directory, logging in with HOME=/ Cannot execute /bin/sh: Operation not permitted

That looks like it worked. Let us allow users to run /bin/sh:

~# attr -s whitelisted -V 1 /bin/sh

Unfortunately that fails, because symlinks are weird, but repeating the test with /bin/dash works as expected:

~# su - nobody -s /bin/dash No directory, logging in with HOME=/ Cannot execute /bin/dash: Operation not permitted ~# attr -s whitelisted -V 1 /bin/dash ~# attr -s whitelisted -V 1 /usr/bin/id ~# su - nobody -s /bin/dash No directory, logging in with HOME=/ $ id uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup) $ uptime -su: 2: uptime: Operation not permitted

And our logging shows the useful results as we'd expect:

STEVE LSM check of /usr/bin/id resulted in 1 bytes from 'user.WHITELISTED' - permitting access for UID 65534 STEVE LSM check of /usr/bin/uptime denying access for UID 65534 [ERRO:-95] Surprises

If you were paying careful attention you'll see that we changed what we did part-way through this guide.

  • The initial suggestion said to look for security.WHITELISTED.
  • But in the kernel module I look for user.whitelisted.
    • And when setting the attribute I only set whitelisted.

Not sure what is going on there, but it was very confusing. It appears to be the case that when you set an attribute a secret user. prefix is added to the name.

Could be worth some research by somebody with more time on their hands than I have.

Anyway I don't expect this is a terribly useful module, but it was my first, and I think it should be pretty stable. Feedback on my code certainly welcome!

Categories: LUG Community Blogs

Steve Kemp: Security is hard ..

Mon, 29/05/2017 - 22:00
3D-Printing

I continued to be impressed with local vendors, found on 3dhubs. I've had several more things printed out, including an "internet button", and some card-holders for Settlers of Catan.

The most recent print I had made was a collection of display cases, for holding an OLED display, as well as an ESP8266 device.

Unfortunately at the same time as I was falling in love with the service I discovered a glaring XSS attack against the site itself.

Anybody who viewed my profile page could have arbitrary javascript executed, which in some cases would actually disclose their private details - such as:

  • Their forename & surname.
  • Their email-address.
  • Their telephone number.
  • Their GeoIP details.

Discovering this took minutes, writing it up took an hour, and a month later it is still unfixed.

I've deleted my account.

Categories: LUG Community Blogs

Steve Kemp: Getting ready for Stretch

Wed, 24/05/2017 - 22:00

I run about 17 servers. Of those about six are very personal and the rest are a small cluster which are used for a single website. (Partly because the code is old and in some ways a bit badly designed, partly because "clustering!", "high availability!", "learning!", "fun!" - seriously I had a lot of fun putting together a fault-tolerant deployment with haproxy, ucarp, etc, etc. If I were paying for it the site would be both retired and static!)

I've started the process of upgrading to stretch by picking a bunch of hosts that do things I could live without for a few days - in case there were big problems, or I needed to restore from backups.

So far I've upgraded:

  • master.steve
    • This is a puppet-master, so while it is important killing it wouldn't be too bad - after all my nodes are currently setup properly, right?
    • Upgrading this host changed the puppet-server from 3.x to 4.x.
    • That meant I had to upgrade all my client-systems, because puppet 3.x won't talk to a 4.x master.
    • Happily jessie-backports contains a recent puppet-client.
    • It also meant I had to rework a lot of my recipes, in small ways.
  • builder.steve
    • This is a host I use to build packages upon, via pbuilder.
    • I have chroots setup for wheezy, jessie, and stretch, each in i386 and amd64 flavours.
  • git.steve
    • This is a host which stores my git-repositories, via gitbucket.
    • While it is an important host in terms of functionality, the software it needs is very basic: nginx proxies to a java application which runs on localhost:XXXX, with some caching magic happening to deal with abusive clients.
    • I do keep considering using gitlab, because I like its runners, etc. But that is pretty resource intensive.
    • On the other hand If I did switch I could drop my builder.steve host, which might mean I'd come out ahead in terms of used resources.
  • leave.steve
    • Torrent-box.
    • Upgrading was painless, I only run rtorrent, and a simple object storage system of my own devising.

All upgrades were painless, with only one real surprise - the attic-backup software was removed from Debian.

Although I do intend to retry using Larss' excellent obnum in the near future pragmatically I wanted to stick with what I'm familiar with. Borg backup is a fork of attic I've been aware of for a long time, but I never quite had a reason to try it out. Setting it up pretty much just meant editing my backup-script:

s/attic/borg/g

Once I did that, and created some new destinations all was good:

borg@rsync.io ~ $ borg init /backups/git.steve.org.uk.borg/ borg@rsync.io ~ $ borg init /backups/master.steve.org.uk.borg/ borg@rsync.io ~ $ ..

Upgrading other hosts, for example my website(s), and my email-box, will be more complex and fiddly. On that basis they will definitely wait for the formal stretch release.

But having a couple of hosts running the frozen distribution is good for testing, and to let me see what is new.

Categories: LUG Community Blogs

Steve Kemp: Some minor updates ..

Sun, 14/05/2017 - 22:00

The past few weeks have been randomly busy, nothing huge has happened, but several minor diversions.

Coding

I made a new release of my console-based mail-client, with integrated Lua scripting, this is available for download over at https://lumail.org/

I've also given a talk (!!) on using a literate/markdown configuration for GNU Emacs. In brief I created two files:

~/.emacs/init.md

This contains both my configuration of GNU Emacs as well as documentation for the same. Neat.

~/.emacs/init.el

This parse the previous file, specifically looking for "code blocks" which are then extracted and evaluated.

This system is easy to maintain, and I'm quite happy with it :)

Fuzzing

Somebody nice took the time to report a couple of bugs against my simple bytecode-intepretting virtual-machine project - all found via fuzzing.

I've done some fun fuzzing of my own in the past, so this was nice to see. I've now resolved those bugs, and updated the README.md file to include instructions on fuzzing it. (Which I started doing myself, after receiving the first of the reports )

Finally I have more personal news too: I had a pair of CT-scans carried out recently, and apparently here in sunny Finland (that's me being ironic, it was snowing in the first week of May) when you undergo a CT-scan you can pay to obtain your data on CD-ROM.

I'm 100% definitely going to get a copy of my brain-scan data. I'll be able to view a 3d-rendered model of my own brain on my desktop. (Once upon a time I worked for a company that produced software, sold to doctors/surgeons, for creating 3d-rendered volumes from individual slices. I confirmed with the radiologist that handled my tests that they do indeed use the standard DICOM format. Small world.)

Categories: LUG Community Blogs

Debian Bits: New Debian Developers and Maintainers (March and April 2017)

Sun, 14/05/2017 - 13:39

The following contributors got their Debian Developer accounts in the last two months:

  • Guilhem Moulin (guilhem)
  • Lisa Baron (jeffity)
  • Punit Agrawal (punit)

The following contributors were added as Debian Maintainers in the last two months:

  • Sebastien Jodogne
  • Félix Lechner
  • Uli Scholler
  • Aurélien Couderc
  • Ondřej Kobližek
  • Patricio Paez

Congratulations!

Categories: LUG Community Blogs

Debian Bits: Bursary applications for DebConf17 are closing in 48 hours!

Mon, 08/05/2017 - 21:30

This is a final reminder: if you intend to apply for a DebConf17 bursary and have not yet done so, please proceed as soon as possible.

Bursary applications for DebConf17 will be accepted until May 10th at 23:59 UTC. Applications submitted after this deadline will not be considered.

You can apply for a bursary when you register for the conference.

Remember that giving a talk is considered towards your bursary; if you have a submission to make, submit it even if it is only sketched-out. You will be able to detail it later.

Please make sure to double-check your accommodation choices (dates and venue). Details about accommodation arrangements can be found on the wiki.

Note: For DebCamp we only have on-site accommodation available. The option chosen in the registration system will only be for the DebConf period (August 5 to 12).

See you in Montréal!

Categories: LUG Community Blogs

Steve Kemp: 3d-Printing is cool

Wed, 19/04/2017 - 22:00

I've heard about 3d-printing a lot in the past, although the hype seems to have mostly died down. My view has always been "That seems cool", coupled with "Everybody says making the models is very hard", and "the process itself is fiddly & time-consuming".

I've been sporadically working on a project for a few months now which displays tram-departure times, this is part of my drive to "hardware" things with Arduino/ESP8266 devices . Most visitors to our flat have commented on it, at least once, and over time it has become gradually more and more user-friendly. Initially it was just a toy-project for myself, so everything was hard-coded in the source but over time that changed - which I mentioned here, (specifically the Access-point setup):

  • When it boots up, unconfigured, it starts as an access-point.
    • So you can connect and configure the WiFi network it should join.
  • Once it's up and running you can point a web-browser at it.
    • This lets you toggle the backlight, change the timezone, and the tram-stop.
    • These values are persisted to flash so reboots will remember everything.

I've now wired up an input-button to the device too, experimenting with the different ways that a single button can carry out multiple actions:

  • Press & release - toggle the backlight.
  • Press & release twice - a double-click if you like - show a message.
  • Press, hold for 1 second, then release - re-sync the date/time & tram-data.

Anyway the software is neat, and I can't think of anything obvious to change. So lets move onto the real topic of this post: 3D Printing.

I randomly remembered that I'd heard about an online site holding 3D-models, and on a whim I searched for "4x20 LCD". That lead me to this design, which is exactly what I was looking for. Just like open-source software we're now living in a world where you can get open-source hardware! How cool is that?

I had to trust the dimensions of the model, and obviously I was going to mount my new button into the box, rather than the knob shown. But having a model was great. I could download it, for free, and I could view it online at viewstl.com.

But with a model obtained the next step was getting it printed. I found a bunch of commercial companies, here in Europe, who would print a model, and ship it to me, but when I uploaded the model they priced it at €90+. Too much. I'd almost lost interest when I stumbled across a site which provides a gateway into a series of individual/companies who will print things for you, on-demand: 3dhubs.

Once again I uploaded my model, and this time I was able to select a guy in the same city as me. He printed my model for 1/3-1/4 of the price of the companies I'd found, and sent me fun pictures of the object while it was in the process of being printed.

To recap I started like this:

Then I boxed it in cardboard which looked better than nothing, but still not terribly great:

Now I've found an online case-design for free, got it printed cheaply by a volunteer (feels like the wrong word, after-all I did pay him), and I have something which look significantly more professional:

Inside it looks as neat as you would expect:

Of course the case still cost 5 times as much as the actual hardware involved (button: €0.05, processor-board €2.00 and LCD I2C display €3.00). But I've gone from being somebody who had zero experience with hardware-based projects 4 months ago, to somebody who has built a project which is functional and "pretty".

The internet really is a glorious thing. Using it for learning, and coding is good, using it for building actual physical parts too? That's something I never could have predicted a few years ago and I can see myself doing it more in the future.

Sure the case is a little rough around the edges, but I suspect it is now only a matter of time until I learn how to design my own models. An obvious extension is to add a status-LED above the switch, for example. How hard can it be to add a new hole to a model? (Hell I could just drill it!)

Categories: LUG Community Blogs

Debian Bits: Call for Proposals for DebConf17 Open Day

Tue, 18/04/2017 - 08:00

The DebConf team would like to call for proposals for the DebConf17 Open Day, a whole day dedicated to sessions about Debian and Free Software, and aimed at the general public. Open Day will preceed DebConf17 and will be held in Montreal, Canada, on August 5th 2017.

DebConf Open Day will be a great opportunity for users, developers and people simply curious about our work to meet and learn about the Debian Project, Free Software in general and related topics.

Submit your proposal

We welcome submissions of workshops, presentations or any other activity which involves Debian and Free Software. Activities in both English and French are accepted.

Here are some ideas about content we'd love to offer during Open Day. This list is not exhaustive, feel free to propose other ideas!

  • An introduction to various aspects of the Debian Project
  • Talks about Debian and Free Software in art, education and/or research
  • A primer on contributing to Free Software projects
  • Free software & Privacy/Surveillance
  • An introduction to programming and/or hardware tinkering
  • A workshop about your favorite piece of Free Software
  • A presentation about your favorite Free Software-related project (user group, advocacy group, etc.)

To submit your proposal, please fill the form at https://debconf17.debconf.org/talks/new/

Volunteer

We need volunteers to help ensure Open Day is a success! We are specifically looking for people familiar with the Debian installer to attend the Debian installfest, as resources for people seeking help to install Debian on their devices. If you're interested, please add your name to our wiki: https://wiki.debconf.org/wiki/DebConf17/OpenDay#Installfest

Attend

Participation to Open Day is free and no registration is required.

The schedule for Open Day will be announced in June 2017.

Categories: LUG Community Blogs

Debian Bits: DPL elections 2017, congratulations Chris Lamb!

Sun, 16/04/2017 - 17:40

The Debian Project Leader elections finished yesterday and the winner is Chris Lamb!

Of a total of 1062 developers, 322 developers voted using the Condorcet method.

More information about the result is available in the Debian Project Leader Elections 2017 page.

The current Debian Project Leader, Mehdi Dogguy, congratulated Chris Lamb in his Final bits from the (outgoing) DPL message. Thanks, Mehdi, for the service as DPL during this last twelve months!

The new term for the project leader starts on April 17th and expires on April 16th 2018.

Categories: LUG Community Blogs

Debian Bits: Unknown parallel universe uses Debian

Sat, 01/04/2017 - 14:30

This post was an April Fools' Day joke.

The space agencies running the International Space Station (ISS) reported that a laptop accidentally threw to space as waste in 2013 from the International State Station may have connected with a parallel Universe. This laptop was running Debian 6 and the ISS engineers managed to track its travel through the outer space. In early January, the laptop signal was lost but recovered back two weeks later in the same place. ISS engineers suspect that the laptop may had met and crossed a wormhole arriving a parallel Universe from where "somebody" sent it back later.

Eventually the laptop was recovered and in an first analysis the ISS engineers found that the laptop have a dual boot: a partition running the Debian installation made by them and a second partition running what seems to be a Debian fork or derivative totally unknown until now.

The engineers have been in contact with the Debian Project in the last weeks and a Debian group formed with delegates from different Debian teams have begun to study this new Debian derivative system. From the early results of this research, we can proudly say that somebody (or a group of beings) in a parallel universe understand Earth computers, and Debian, enough to:

  • Clone the existing Debian system in a new partition and provide a dual boot using Grub.
  • Change the desktop wallpaper from the previous Spacefun theme to one in rainbow colors.
  • Fork all the packages whose source code was present in the initial Debian system, patch multiple bugs in those packages and some patches more for some tricky security problems.
  • Add ten new language locales that do not correspond to any language spoken in Earth, with full translation for four of them.
  • A copy of the Debian website repository, migrated to the git version control system and perfectly running, has been found in the /home/earth0/Documents folder. This new repo includes code to show the Debian micronews in the home page and many other improvements, keeping the style of not needing JavaScript and providing a nice control of up-to-date/outdated translations, similar to the one existing in Debian.

The work towards knowing better this new Universe and find a way to communicate with them has just began; all the Debian users and contributors are invited to join the effort to study the operating system found. We want to prepare our Community and our Universe to live and work peacefully and respectfully with the parallel Universe communities, in the true spirit of Free Software.

In the following weeks a General Resolution will be proposed for updating our motto to "the multiversal operating system".

Categories: LUG Community Blogs