You can read other Terninger posts which outline my progress building a the Fortuna CPRNG, or see the source code.
So far, I’ve put Terninger into production in makemeapassword.ligos.net. And persistent state is working in the core generator
But I’d really like to use persistent state in entropy sources for fun and profit!In particular, PingStatsSource
has a limitation which can be overcome using persistent state: there is a hard coded list of servers it pings, which needs regular maintenance.
PingStatsSource
to track which servers are working using persistent state.PingStatsSource
to discover new servers by randomly scanning the Internet.The convention is: any IEntropySource
which also implements IPersistentStateSource
will have persistent state loaded & saved automatically by PooledEntropyCprngGenerator
.
There is already support to initialise objects from state on loading, so this should be pretty easy!Simply call this method during initialisation:
private void InitialiseEntropySourcesFromPersistentState(PersistentItemCollection persistentState) |
That’s easy!
Except, you can add additional IEntropySource
objects to the generator after it starts.And these should also be initialised.Exactly once.
That’s a bit more tricky.
Fortunately, there was already SourceAndMetadata
, which combines an entropy source with some additional data (eg: has it thrown exceptions, does it complete synchronously or asyncronously).So I added an IsExternalStateInitialised
field, and wrapped the initialisation into InitialiseFromExternalState()
.That gets called every time we poll the source, and will return early if its already been initialised.
public void InitialiseFromExternalState(PersistentItemCollection state) |
To save state, I added GetPersistentStateOrNull(PersistentEventType eventType)
to SourceAndMetadata
.And simply called it when saving all the other persistent state, remembering to put each source into a namespace:
foreach (SourceAndMetadata sm in _EntropySources) |
And with that, all IEntropySources
are wired up and ready to persist some state!
PingStatsSource
sends ICMP echo requests (aka, pings) to a hard coded list of servers.There is a way to configure a custom list of servers, but I personally don’t use it.
Unfortunately, even though I chose DNS servers as the list, which shouldn’t change very often, they do change occasionally.And so, I need to update the list every now and then.
Well, now we have persistent state, we can keep the list of servers there.If a server disappears off the internet, we simply remove it from the list and move on with life.The internal list becomes an initial seed, rather than the canonical list for all time.
OK, so how does this work?
First thing, we need to wire up IPersistentStateSource
to load and save.I won’t show the code, because its very similar to last time where we create an array-like structure.This is what ends up being saved:
PingStatsSourceTargetCountUtf8Text1024 |
The core internal state is a list of PingTarget
.We’ll add and remove to this over the lifetime of the object, including via IPersistentStateSource
.(And, I’ll come back to why the abstract PingTarget
rather than simple IPAddress
).
private List<PingTarget> _Targets = new List<PingTarget>(); |
The first time we get entropy, we check if there are any targets.If not, we load up the internal seed list to get started.
if (_Targets.Count == 0) |
When we gather entropy, we track which targets fail (timeout, network error, etc). Normally, we ping a target 6 times.Well, if we get 6 failures, we assume the target is offline, and remove it.
List<PingAndStopwatch> targetsToSample = ...; |
That’s it!Targets which fail all 6 pings will be removed, never to be seen again.
Unfortunately, given enough time, we’ll remove all the targets.Better do something about that!
In order to discover new targets, we need to find a valid IP address, and then send some pings to confirm it works.Fortunately, the IPv4 address space is so full of reachable targets, that any random 32 bit number has a good chance of being valid.
private Task DiscoverTargets(int targetCount) |
We want to add n
new targets (where n
is 8 by default).So we generate n
random IP addresses, excluding several ranges which we know are invalid up front.We then send up to 3 pings to each target.So long as any one ping succeeds, we add it to the list.
The IPv6 address space is, for all practical purposes, empty.So randomly picking 128 bit numbers isn’t going to work.For now, I’m putting IPv6 discovery in the too-hard-basket and marking it with a great big TODO
.
Moving on, let’s add a few properties to the configuration so users have the option to turn off discovery (if they want) and to control the desired number of targets (TargetsPerSample
was already there):
public class Configuration |
Finally, we wire this new method up to GetInternalEntropyAsync()
, after we gather entropy:
if (_EnableTargetDiscovery && _Targets.Count < _DesiredTargetCount) |
Now we remove targets which don’t work, and automatically discover new targets which do.And anything in _Targets
is persisted, so we don’t start from scratch next time around.
One last feature: TCP ping.
ICMP echo request is the technical name for what we refer to as ping.But routers and firewalls can be configured to ignore pings (which might provide some tiny improvement in security by pretending you aren’t on the Internet).
But there are lots of web servers out there, listening on port 80 and 443.They cannot ignore requests to those ports, because that is what web servers do.Which means there are potentially more targets out there to be discovered.
Instead of an ICMP echo request, we will do the initial 3 way TCP handshake, which establishes a new TCP connection, and then immediately drop said connection.This is roughly equivalent to a regular ICMP ping, just using TCP instead.
The code to achieve this is quite simple:
public async Task<(bool isSuccess, object error)> TcpPing(IPAddress address, int port, TimeSpan timeout) |
This works wonderfully, but we now have two kinds of PingTarget
.There is IcmpTarget
and TcpTarget
.The ICMP target only needs an IP address to function, but the TCP target needs a port as well.
Actually, there are three kinds!The third one is IpAddressTarget
is just an IP address; we don’t know if its ICMP or which TCP port to try.But, we can run the discovery process on this address to convert it into IcmpTarget
s and TcpTarget
s.These “naked” IP addresses are the seed list.
Here’s some simple inheritance to model this:
abstract class PingTarget { |
How will we serialise and parse these?
The easiest is IpAddressTarget
, it’s interchangeable with a regular IpAddress
:
1.1.1.1 |
Now we need a way to represent either a port number, or an ICMP ping.There’s a standard way to encode a TCP Endpoint, and we’ll use something similar for ICMP:
1.1.1.1:80 |
These can all be unambiguously parsed and serialised.IPv4 is easy to split on the :
, and IPv6 a bit more complex because we need to find matching square brackets.For simplicity, I use .ToString()
for serialisation (that’s not always a good idea, but good enough in my case).And the parser a static method following the Try...()
pattern common in C#, the out
parameter will be one of IcmpTarget
, TcpTarget
or IpAddressTarget
.
class PingTarget { |
We also need a Ping()
method.I’m not usually a fan of inheritance, but in this case its very effective because each type can implement ICMP or TCP ping, as required:
class PingTarget { |
The final piece of the puzzle is how to convert from the seed list containing IpAddressTarget
into IcmpTarget
and TcpTarget
?Well, as part of the main GetInternalEntropyAsync()
method, we pick a few IpAddressTarget
s, and run discovery on them:
var forDiscovery = _Targets.OfType<IpAddressTarget>().Take(_TargetsPerSample).ToList(); |
We now have persistent state wired up to any entropy source which needs it.And made meaningful improvements to PingStatsSource
so it requires less of my attention.It automatically removes servers when they go offline, and discovers new ones.
You can see the actual Terninger code in GitHub. And the main NuGet package.
After a long time developing Terninger on and off, I’m going to stop posting about it.Because the core functionality is all done!
]]>There’s been a short (well, long) delay getting this post up. Sorry about that. 🙁
You can read other Terninger posts which outline my progress building a the Fortuna CPRNG, or see the source code.
So far, I’ve put Terninger into production in makemeapassword.ligos.net.
And we’re up to the third and final part of persistent state:
We have the interfaces and implementations to load and save state from disk, and get and set that state on in-memory objects.
We now need to wire up the main PooledEntropyCprngGenerator
loop to load state on start up, and save when required.
Here are the requirements in detail:
Terninger has a main worker loop.It is a single thread which keeps running for the lifetime of a Terninger PooledEntropyCprngGenerator
to gather entropy.Basically, a giant while()
loop.
Well, now it needs a beginning and an end.To load and save state at the start and finish.
It will look roughly like:
LoadState(); |
Let’s look at the load, save and loop changes in a bit more detail.
Before the worker loop starts, we load persistent state from disk, and initialise related objects.
At the top level, there isn’t much exciting going on.The only notable thing is .GetAwaiter().GetResult()
, because we are running in a top level thread and can’t do await
.
var persistentState = TryLoadPersistentState().GetAwaiter().GetResult(); |
TryLoadPersistentState()
does the actual load, wrapped in an exception handler in case of errors.Failures can be safely ignored and the generator will act as if it was a brand new instance.
private async Task<PersistentItemCollection> TryLoadPersistentState() |
Once we have loaded our collection of items, we need to set related objects.
private void InitialiseInternalObjectsFromPersistentState(PersistentItemCollection persistentState) |
Setting simply involves casting each object to IPersistentStateSource
and calling Initialise()
.With a little namespacing going on to keep nested objects separate.
Although I am not initialising any IEntropySource
objects yet, I have removed any internal state relating to PooledEntropyCprngGenerator
, so that any future IEntropySource
objects can’t peak at potentially sensitive data.Safety first!
The call to ResetPoolZero()
is important.We’ll return to it shortly.
Moving onto the getting / saving process which runs when Terninger is stopping.There’s just one method at the top level:
GatherAndWritePeristentStateIfRequired(PersistentEventType.Stopping).GetAwaiter().GetResult(); |
Yeah, gotta look inside that method to see what it does:
private async Task GatherAndWritePeristentStateIfRequired(PersistentEventType eventType) |
This combines gathering all the state, and the actual save, into a single method.
We’ll come back to ShouldWritePersistentState()
later.But its safe to assume when we call with PersistentEventType.Stopping
, it returns true
.
We gather all the state by creating an empty PersistentItemCollection
as an accumulator, then casting related objects to IPersistentStateSource
and calling GetCurrentState()
.Again, there’s a bit of namespacing going on to keep separate data separate.
And, there’s a reminder to myself that the internal state is accumulated last, so IEntropySource
s can’t do anything naughty.
Finally, there’s a similar WriteAsync()
call wrapped in an exception handler.
Terninger instances can last for a long time (makemeapassword.ligos.net
runs for a month before a reboot; and it could run for much longer if needed). And there’s no guarantee it will be stopped cleanly (maybe the server crashes, or maybe the programmer simply doesn’t Dispose()
the object).So, every now and then, the worker loop will save state.
Here are the relevant lines of WorkerLoop()
:
this.PollSources(syncSources, asyncSources).GetAwaiter().GetResult(); |
We reuse GatherAndWritePeristentStateIfRequired()
, but pass a different PersistentEventType
depending on if we reseed the generator or not.
Time to return to ShouldWritePersistentState()
!It returns true
when we Reseed
, because that’s when entropy pools will be updated.But only returns true
for Periodic
if a certain duration has elapsed since we last saved (5 minutes by default).
So, we save state whenever the generator reseeds (which may take a while if the generator isn’t being used), or every 5 minutes.
OK, time to deal, once and for all, with the remaining security problem that external state raises:
That second problem is a tricky one to solve.While it’s very desirable to retain the state of entropy pools (it’s a core function of Fortuna), it also allows an attacker a way to influence the internal state of the generator.
And that could completely subvert the generator.
And there’s no way for Terninger to know the persistent state is genuine or malicious.Even if you encrypt, or sign, or hash the persistent state file, the Terninger code needs to be able to verify that crypto without human intervention.That means an attacker could, read crypto keys out of terninger code, and write a poisoned file.Or, if the end user could configure those keys in the config file, then the attacker would just read the keys from the config file!Or, the attacker could use Terninger’s own code to sign their own malicious file.
Cryptography does not solve this problem.
However, there are two, relatively simple approaches to mitigate this problem.
We haven’t looking into the implementation of IPersistentStateSource.Initialise()
for all objects.In particular, EntropyPool
adds some additional entropy when loading:
void IPersistentStateSource.Initialise(IDictionary<string, NamespacedPersistentItem> state) |
We don’t just accumulate the saved hash, we also add additional entropy.
Now, PortableEntropy.Get32()
isn’t a very good source of entropy.It will only add 16-32 bits of real entropy.That’s better than nothing, but not good enough on its own.
After we load all external state, there’s a call to EntropyAccumulator.ResetPoolZero()
.That puts the generator into high priority mode to gather enough entropy for a reseed.That usually completes within two seconds.With the default settings, that accumulates 384 bits of new entropy in pool zero.And should also gather a similar amount of entropy for all other pools.(384 bits is the best case, an attacker might reasonably predict some of those bits, but its very difficult to predict all of them).
So, we initialise each pool based on prior state.But then add more entropy on top of that prior state.
Essentially, the external state becomes just one source of entropy out of many.
Now an attacker also needs to deal with at least 128 bits of entropy (probably much more) on top of whatever is loaded from disk.
And, within a few minutes, it’s likely there will be another reseed.So the window of opportunity for an attacker to make use of a poisoned file is quite low.
We now have persistent state implemented! This is the last major piece of functionality in Terninger to fully implement Fortuna!And, because I’ve been very slack with my blog, it’s been in production for ~18 months!
You can see the actual Terninger code in GitHub. And the main NuGet package.
Next up: a major improvement to PingStatsSource
which uses persisted state.
There’s been a short (well, long) delay getting this post up. Sorry about that. 🙁
You can read other Terninger posts which outline my progress building a the Fortuna CPRNG, or see the source code.
So far, I’ve put Terninger into production in makemeapassword.ligos.net.
And we’re up to the second part of persistent state:
Now we are able to load and save state to a file on disk, we need a way to gather that state from in-memory objects (before we save), and to set state on in-memory objects (after we load).
Essentially, we need a way to get and set properties / fields on objects based on the file.
Here are the requirements in detail:
PooledEntropyCprngGenerator
and related classes, such that entropy state is persisted across server restarts.Unlike loading and saving, getting and setting will always be implemented on the same objects, so it makes sense to only have one interface.The simplest thing that could possibly work is:
public interface IPersistentStateSource { |
The simplest thing for a getter and setter is… well… a getter and a setter!
Initialise()
will only be called once, when the implemented object is loading, after reading from an IPersistentStateReader
implementation.While GetCurrentState()
will be called regularly over the lifetime of the generator, as state will change over time.The result of GetCurrentState()
can be passed to an implementation of IPersistentStateWriter
to save.
A short digression is in order at this point.
There are two use cases I have in mind for persistent state:
PooledEntropyCprngGenerator
, such that we accumulate entropy across machine restarts.IEntropySource
).Because the state in a PooledEntropyCprngGenerator
is always changing as entropy accumulates, the simple interface will work fine.
But entropy sources may not update their persistent state very often.Many sources execute on a period measured in minutes or hours, so nothing may have changed since the last save.
To support slightly more efficient operation with entropy sources, I add a HasUpdates
property.
public interface IPersistentStateSource { |
This allows any objects which supports persistent state to communicate if calling GetCurrentState()
will returns something different since last time.And means that entropy sources which rarely change state can stay dormant for longer.
Finally, there is a context enum passed to GetCurrentState()
:
public enum PersistentEventType |
This gives the object some context about what is happening, and hints at what state it should return.In particular, Stopping
is the last opportunity to save state before the generator stops - so you should return something in at least that case!
This enum is similar to EntropyPriority
, which allows an entropy source to decide how aggressively it returns entropy, depending on the needs of the generator.
However, after completing all the persistent state functionality, I can’t think why GetCurrentState()
would not return every piece of state every time it is called.HasUpdates
is a better way to signal “I have nothing new”.
Oh well, it’s in the API now.
Here is the final interface:
public interface IPersistentStateSource { |
Just defining an interface is the easy part. We also need to implement it on required classes!
The main PooledEntropyCprngGenerator
has two persistent fields.There are a number of other fields, but they are contained in other classes.
UniqueId
which is a Guid
.BytesRequested
counter, which is an Int128
.The implementation goes like so:
void Initialise(IDictionary<string, NamespacedPersistentItem> state) |
Initialising is a repetitive parsing process. Try read a field from the state dictionary, try parse the value, and if everything succeeds, set the appropriate property. Other classes have more state, but similar repetitive defensive code.If anything cannot be parsed correctly, it is simply ignored.
HasUpdates always returns true. That’s a bit a lie, but who cares about efficiency with just two fields.
GetCurrentState simply returns a collection of each field, either as a byte[]
or string
.
The PooledEntropyCprngGenerator
has an EntropyAccumulator
member, which is the really important part of the generator.However, the accumulator is made up of many EntropyPool
objects.This is a nested array of complex objects, so how do we serialise it?
We do a bunch of copying and create some “array like” keys.Which is a bit hacky, but its the only place we need to deal with nested arrays.
We save values to define the pool counts and then array like keys for nested data:
Pooled...Generator.Accumulator <TAB> LinearPoolCount <TAB> Utf8Text <TAB> 20 |
Each EntropyPool
returns its 3 fields directly:
IEnumerable<NamespacedPersistentItem> GetCurrentState(PersistentEventType eventType) |
And the EntropyAccumulator
copies that data into array like keys:
IEnumerable<NamespacedPersistentItem> GetCurrentState(PersistentEventType eventType) |
It’s a bit of work, but effective.If there were more nested arrays or objects, I’d consider a more robust approach.
OK, time to deal with the security problems that external state raises:
In simpler terms: reading or writing persistent state is working with untrusted and potentially tainted data.
The first problem is dealt with in EntropyPool.GetCurrentState()
: we don’t save the current pool (hash) state, instead we save a hash of the hash.This is fine when we re-read the hash into the pool, because the hash of the hash is just as random as the original hash.But it hides the internal state of the generator because a hash function cannot be (easily) reversed.
IEnumerable<NamespacedPersistentItem> GetCurrentState(PersistentEventType eventType) |
I’ll address the second problem in the next post.
But I will note one thing which won’t work: file security bits / ACLs.While using some kind of file system based security might mitigate the problem, it can’t be relied upon - perhaps the state is being stored somewhere with no security.Or, perhaps the attacker has the same (or higher) security context than Terninger, so they can write to the file anyway.
Potential points for improvement:
IPersistentStateSource
for an IEntropySource
. I have a particular use case in mind.We now have a way to get and set persistent state on relevent objects.
You can see the actual Terninger code in GitHub. And the main NuGet package.
Next up: we’ll wire up the various IPersistentStateSource
s and the IPersistentStateReader
/ IPersistentStateWriter
in PooledEntropyCprngGenerator
to load state as part of initialisation, and periodically save it.
You can read other Terninger posts which outline my progress building a the Fortuna CPRNG, or see the source code.
So far, I’ve put Terninger into production in makemeapassword.ligos.net.
It’s been a long while since my last Terninger post, but it’s been working well enough and my time has been spent in other places.
There is one major feature of Fortuna which I never implemented: persistent state.
That is, the ability for PooledEntropyCprngGenerator
to save its internal state to disk.This state would include digests of all pools which have gathered entropy (plus various other information).
Without this feature, every time I restart makemeapassword.ligos.net, Terninger needs to start reading entropy from scratch.As I reboot my servers each week to avoid memory leaks and other random badness, that means Terninger can only accumulate entropy for 7 days before it has to start again.
With this feature, the accumulated entropy should increase forever (as long as the file on disk remains).And I’m serious about the forever part - each pool accumulates using SHA512, and 2^512 is a really big number.
The reason persistent state took so long to implement (other than me getting distracted with other projects), is it has a number of moving parts.Rather than one giant post, I’ll split this up into smaller ones:
Drilling into point 1 in a bit more detail, here’s what I want to achieve:
netstandard 1.3
, that rules out JSON and XML.Data always survives longer than code.So I think long and hard about the on-disk and in-memory format of any kind of persistent state.
Once I’ve worked out what the data looks like, other code and interfaces become relatively obvious.
Creating a namespaced key value pair is easy enough:
public readonly struct NamespacedPersistentItem { |
Saving that to a text file is easy: pick a delimiter (tab works well), base64 encode the Value
, and store each item on a separate line. Eg:
ANamespace <TAB> AKey <TAB> WusWRBaOzm7zX3KQzdNhVpS+6aJHvpCXO8P1yJq3Zi0= |
However, I found pretty quickly that it wasn’t just binary data that needed to be stored.There were plenty of numbers (some Int64
s and also Int128
s), guids and strings which don’t need to be base64 encoded at all (so long as they don’t contain the delimiter).Base64 encoding everything makes the file really hard for a human to read.
If I can’t understand the content of the persistent state file, I’m probably going to get it wrong.So I added a way to encode the binary value in different ways:
public readonly struct NamespacedPersistentItem { |
This allows easier to understand string encodings of binary values, particularly for strings or numbers.For example, all these encode the value 42
, and the last one is easiest for a human to read:
Terninger <TAB> BytesRequestedAsBinary <TAB> Base64 <TAB> KgAAAA== |
Note the ValueEncoding
doesn’t affect the content of Value
in memory.It’s more of a recommendation of how to save that Value
in a way humans can read it (relatively) easily.
The last part of any file format is a header, because storing a big tab separated file with no context or metadata is likely to cause problems in future.The Terninger file header is a single, tab delimited line with the following fields:
TngrData
. Which also happens to fit in a UInt64
.1
!An example header:
TngrData <TAB> 1 <TAB> UDpxL5ZiKhda8ok3/asKFbmdaihfvAzJmVhxzBP/SaI= <TAB> 3 |
This represents a simple to read and write data format capable of storing all the state Terninger requires. It also is extendable (via namespaces) to be used by IEntropySource
implementations, if they need to store persistent state.
Here’s an example file from a unit test:
TngrData <TAB> 1 <TAB> DBvlW8Nt/XTVKr/aMGWZd8N6KQ9nb8d+BNBWbfzSs8A= <TAB> 6 |
When in memory, the persistent state is represented as a PersistentItemCollection
.It allows getting, setting and removing single items or whole namespaces of items.Internally, it is a dictionary of namespace > items
, and within each namespace a dictionary of key > value
.When getting a whole namespace, it will return an IDictionary<string, NamespacedPersistentItem>
, which is the structure used by consumers of the collection.
public class PersistentItemCollection { |
There are two interfaces to read and write the in-memory data:
public interface IPersistentStateReader { |
I don’t think it gets simpler.We have a collection of NamespacedPersistentItem
s in a PersistentItemCollection
, and can read the content of a whole file into memory, and then write an entire collection to file.Might not be the most efficient algorithm, but we aren’t going to be reading / writing very often, nor will be writing MBs of data.
There are two implementations of these interfaces:
TextStreamReader
and TextStreamWriter
, which are able to read / write the Terninger file format to a Stream
.TextFileReaderWriter
, which uses the stream reader / writer implementations and writes to a file on disk.As we have simple interfaces, anyone can implement a reader / writer that works differently.For example, you may want to store persistent state in a database, or a web service, or in an encrypted file, etc.In all cases, the implementation is relatively easy, and you can then pass your reader & writer to any Terninger instance.
If you happen to be reading / writing a Stream
and are happy with the delimited format, then you can use TextStremReader
and TextStreamWriter
to look after that part.
The primary reason data is stored in namespaces is to isolate different parts of Terninger from each other.The EntropyAccumulator
is a security sensitive area of Terninger, because if you can observe the pool of entropy, it is possible you can predict future random numbers - which kinda breaks everything!And if you can write a MaliciousEntropySource
which spies on other persistent state, that’s bad.
So any one component of Terninger can only see data for its namespace, and not other components.The main PooledEntropyCprngGenerator
class will ensure a component can only see its own key-value-pair list of data.This isolation mitigates the security risk.It also makes it easier to implement persistence within each component, as it only needs to worry about its own data.
Persistent state represents a huge security risk.
For now, I’m just going to acknowledge the risk.I’ll discuss mitigation in a future post.
There are some alternative implementations I didn’t go with.
Because arbitrary nesting is harder than a fixed two level hierarchy.And, even after implementing everything, I’ve only found one use case where nesting would have been helpful, and there was a simple (if tedious) work around.
Because it won’t help.
Terninger itself needs to read the file, and if the file is encrypted then Terninger needs to know the key.If you have a hard coded key baked into Terninger, any malicious attacker can reverse engineer Terninger to find the key (or just find the key on Github).
Perhaps you could store the key somewhere else, and that keeps the key out of the hands of our malicious attacker.That might help, but the attacker could still find the key, and then game over.Also, that’s something else for the user of Terninger to manage - a persistent state file & a separate key.
Maybe you encrypt the key, which encrypts the persistent state.Oh dear! We’re now in infinite recursion!
Getting that kind of encryption right (and actually ensuring it provides meaningful benefits) is really hard.And there are other mitigations I will describe in future posts.
Anyone can implement their own IPersistentState[Reader|Writer]
if they really want this feature.
After writing the above interfaces and code, I found that separating the reader and writer as separate interfaces makes Terninger slightly difficult to configure.Because you have to pass the same instance twice:
var readerWriter = new TextFileReaderWriter("/some/path/terninger.txt"); |
I’m not sure if I’ll ever bother to change this, but it was a bit annoying that I couldn’t do this:
var terninger = PooledEntropyCprngGenerator.Create( |
We have the data structure to keep persistent state on disk.And the code required to load and save it.And the API meets my non-functional requirements.
You can see the actual Terninger code in GitHub. And the main NuGet package.
Define an interface to get and set state from components. That is, how we add / remove from the PersistentItemCollection
.
I just finished an extended series on Long Term Backups and Archives.A major shift in my personal and professional backup strategy is optical media, in particular, BluRay disks.
In 2009, I said goodbye to my DVD based backup strategy, because it was taking multiple disks to do a weekly snapshot of my documents and data.And photos were already overflowing many DVDs.
In 2017 my backup strategy involved:
Five years later, in 2022, my strategy has changed to:
Discuss the reasons why I’ve moved back to optical disks as a key part of my backup strategy.With particular attention on the reliability of optical disks - BluRay disks and M-Disks.
My oldest backups are from the end of the year 2000, and are now on DVD+Rs.Originally, they were burned to CD-Rs, but I migrated all my CDs to DVDs at some point and discarded the original CDs.The oldest CD I can find is from 2003, containing a snapshot of all my documents at that time.Its hard to tell the oldest DVDs, but I think 2004 or 2005 was when I moved from CDs to DVDs for backups.
Finally, the DVD era came to an end in 2009, giving way to HDDs and the cloud.
What I didn’t realise at the time, was that all these CDs and DVDs would become a grand experiment of reliability and longevity.When I read data from these backup disks in 2021, I had a 100% success rate!
That’s a perfect record after being stored for 12-18 years, in semi-controlled conditions (darkness, but no temperature or humidity control) and zero maintenance!
This high reliability is what drove me back to optical disks in 2021 - BluRay disks in particular.
The 45+ year storage requirement for church compliance data made me re-think how my backups would survive in the long term.I wasn’t comfortable with HDDs surviving that long, nor anything stored in the cloud.Tapes are cheap, but their drives are expensive and my only experience with them is from 2002.It was only when I tested these CDs and DVDs that I realised optical was a contender!And further research showed that BluRay disks were readily available at a reasonable price.
So why chose optical disks over the alternatives (tapes, HDDs, cloud)?
As I’ve mentioned, having hard data of their long term reliability was a big factor.Getting any kind of real world reliability data of storage mediums is really, really hard.Backblaze releases HDD stats, which is the only public information I’m aware of.Otherwise, you have to trust the manufacturer’s “mean time between failure” figure.And, when the time scales you’re looking at is 45+ years, there is no real world data because no consumer digital storage technology has been invented for that long (tape have been around longer, but its not aimed at consumers).
So, having a few hundred optical disks of 10+ year age that I could test is a huge plus.Real world data always trumps theory.
Optical disks are write once.When it comes to storing compliance data, or long term backups that’s a big plus.Because the only way data can be tampered with is by replacing an entire disk (not impossible, but tricky).While most storage mediums have some kind of “write protect” switch, write once optical media physically cannot be written to multiple times.
Optical disks have less moving parts than HDDs, and thus less that can break.A HDD contains the physical media (disk platters), electronics to read said media, and software to make it all work.If any one of those parts fails, it can be difficult, expensive or impossible to recover data.Optical disks are just the physical media - the electronics and software are in a separate package (the reader).If your reader breaks, you buy a new one for $200 and move on.And if you media fails, well, you’re no worse off than with HDDs.
HDDs, especially NAS disks, have an “always on” assumption - the disks are always online.Indeed, the Backblaze data is all about disks that run 24/7.On one hand, that’s great because the NAS can scrub disks to automatically detect and correct errors.On the other hand, that costs electricity.And if you ever wanted to take disks offline and store them on a shelf, you don’t really know how long they’ll survive - unless you plug them in every now and then.
Optical disks, by definition, are “always offline”.Once burned, they must remain stable without any scrubbing, error checking or automation.They will be stored in a jewel case or a spindle, and will rarely (possibly never) be read.And yet, the expectation is, that you will be able to read the disks without problem - even with zero maintenance.Indeed, that was the outcome of my ~15 year experiment!
The write once and offline properties combine for another benefit: optical disks are ransomware proof.As long as your backups are connected to a network and writable, it’s possible they could be encrypted and held to ransom (or simply deleted).That includes NAS servers, and the cloud.But, because optical disks are offline and immutable, no remote hacker or malware can touch it - the only way they could be held to ransom is via physical theft (very possible, but not the current strategy of Internet Bad Guys™).
Finally, BluRay disks improved the failure modes compared to CDs and DVDs.Their physical spec includes improvements such as a hard coating to reduce scratches, non-organic substrates, improved error correction, and improved track addressing.See references below for several white papers on BluRay physical specifications.(And I note these are theoretical improvements, only time will tell if they yield greater longevity).
There are certainly problems with optical disks though:
Their capacity isn’t great compared to HDDs (or tape).In the physical space of two x 4 TB HDDs, you might be able to fit 10 x 12cm optical disks.The highest capacity BluRays are 128GB per disk, which is around 1.3TB in the same space.For my purposes, I’m not generating enough data for this to be a problem.But if you’re dealing with 1080p or 4k video, you’ll be filling many, many spindles of optical disks each year.
Optical disks are slow to read.Their sequential read and random read speeds are 10-100x worse than even the slowest, cheapest HDD.My experience is that as optical disks age they become harder to read, which makes them slower still.So you don’t want to be reading from them frequently, or doing a restore with the clock ticking.Given they’re designed as long term media, this isn’t a big problem - but something to be aware of.
A big risk with optical disks is they are becoming a niche technology.That is, they aren’t as mainstream as they used to be in the 2000s.Most laptops don’t come with optical drives any more, and no one really misses them.Software and content is delivered by streaming rather than disks.So, its entirely possible they will go the way of floppy disks and become obsolete and difficult to purchase.As of 2022, it is possible to buy brand new BluRay readers and media - although I note eBay is your friend if you want to buy a wide variety of media.
Because optical disks are offline media, you really need to index or catalogue their content.That is, without some kind of catalogue you can browse or search, it’s somewhere between hard and impossible to find what you need.And putting 100 disks into a reader, one at a time, and slowly searching each of them really sucks (I tried).In the 2000s, I never bothered with this, but I’ve become more disciplined this time around and am using WinCatalog to catalogue all optical media.
Finally, optical disks are more expensive than HDDs - at least in cost per GB.A 4TB NAS branded HDD costs ~AU$160, which is ~4c/GB.My last BluRay purchase was for 3 x 50 spindles of 25GB disks costing AU$330, which works out to be ~9c/GB.Obviously, you need a computer for that HDD, a reader for the BluRays, and factor in things like electricity and maintenance - a full total-cost-of-ownership comparison is more complex.But in raw capacity, HDDs are cheaper.Note that M-Disc BluRays are around 4x more expensive than regular BluRays, costing ~33c/GB.
Given the primary reason to chose optical media over HDDs is long term reliability, I decided I should put them to the test.I tested a DVD, 3 brands of BluRays (BD-R disks), and a BluRay M-Disc to destruction.
There are four things that will destroy any kind of media: light, heat, moisture and time.
I’ve already tried time (at least for ~15 years) and found CDs and DVDs are pretty resilient!So I moved onto light and heat (I didn’t test against moisture).
My light test consisted of placing the disk in an east facing window that would receive ~4 hours of direct sunlight each day.While this isn’t entirely scientific because I wasn’t testing all the disks at the same time (some were tested in Summer and others in Winter), it’s still a place to start.I tested all disks this way.
My heat test is placing disks in a) my car (which is parked such that it has ~4 hours per day of sunlight) which acts as a greenhouse, b) my ceiling cavity (which is not insulated and can reach over 50℃ in Summer), and c) my freezer (which should be around -18℃).I only tested the BluRay M-Discs for heat.
Updated in January 2024: Note that heat tests expose disks to some indirect light; while the cold test is stored in a dark freezer - I suspect this means the freezer disk will end up lasting longer.Only time will tell.
The TL;DR results: keep optical disks out of direct sunlight and you should be good for a long time.
Results for direct sunlight:
Disk | Days Before Failure | Failure Mode |
---|---|---|
DVD | < 90 | Completely unreadable; computer reports no disk when inserted. I didn’t check very diligently, so not sure exactly when it failed. |
BluRay (Ritek) | 38 days | Some sectors have errors; disk partially readable. |
BluRay (Verbatum) | 38 days | Some sectors have errors; disk partially readable. |
BluRay (Verbatum M-Disc) | 260 days | Some sectors have errors; disk readable after multiple attempts. |
Direct sunlight is definitely something to avoid.Keeping your optical media in darkness is your number one priority in storage.
Comparing longevity of regular vs M-Disc BluRay media, there’s a factor of ~6x difference.The marketing claims of M-Disc is they should last for “at least 100 years”.If we assume regular BluRays will last for the same 15 years as my CDs and DVDs, then an M-Disc should last ~90 years.That’s not quite what the manufacturer claims, but close enough - and confirms M-Disc media lasts longer.
Others have done similar test to destruction for M-disc media which support my results.
Tests for heat / cold started in May 2021, and remain ongoing without failure (the 2021-2022 Summer was nowhere near as hot as previous year, so I suspect this test will continue for another year at least. Temperature statistics for Sydney).
Last updated January 2024:
Location | Has it Failed? | Days Before Failure | Failure Mode |
---|---|---|---|
Freezer (cold) | No | 977+ | N/a |
Car (heat) | No | 977+ | N/a |
Ceiling Cavity (heat) | No | 977+ | N/a |
I’ll update this table from time to time, as I check the disks.
The take away is: keep optical disks out of direct sunlight; even better in total darkness.Heat / cold seems to be less critical.
I’m promoting optical media, and particularly BluRay M-Disc media, as a zero maintenance solution for long term data storage.However, given enough time, every form of digital media will eventually fail.
As long as we a) have multiple copies of the data, and b) can make new copies faster than failures, all is well.That means we must have some kind of maintenance schedule in place to detect failures and make new copies.
Data stored on a NAS server has a big advantage here: any NAS will automatically check for errors, and notify if problems are found.TrueNAS (via ZFS) will automatically correct errors.
But checking optical media for errors cannot be automated (unless you can afford a robot / jukebox) because the disks are stored separately to your computer.
Because I’m confident optical media, when stored away from direct light, will survive for 10 years, I’m going to check them every 5 years.At least until I get some failures, so have some idea of when failures are likely to happen.
Optical media, and BluRay M-Discs in particular, are the most reliable way to do long term, offline data storage.CDs and DVDs have lasted for 10-20 years and can still be read successfully.BluRay media offers disk capacity of 25-128GB, and should have similar longevity.The claims of special M-Disc media lasting 100+ years seems plausible - unfortunately, it will take another 99 years before we can confirm it!
Anyone who wants an offline, ransomware proof, 20+ year backup should consider BluRay optical media.
]]>You can read the full series of Long Term Archiving posts which discusses the strategy for personal and church data archival for between 45 and 100 years.
So far, we have considered the problem and overall strategy, and got to my chosen implementation.
The last point I’ll consider is: how do we organise our files and data so we can find stuff in 10, 20, 50 or even 100 years time?
Develop a structure, guidelines and processes to organise files / data such that specific data can be found in reasonable time.
This structure needs to be self-discoverable, as the original creator of the structure will not be available in 45 years.
This structure can apply to digital files and data, or physical documents. There are advantages to digital data storage, but we’ve considered a number of risks as well. Any structure should work with the physical as well as digital with minimal changes.
Before we go any further, we should remind ourselves that backups and archives are infrequently accessed.We should always optimise for the common case, which is adding data to the archive.Retrieving is far less common, so its acceptable if it takes a bit longer.(Caveat: always check with stakeholders how long is acceptable).
With that out of the way, the way we organise data is dictated by how we need to access it.
If the access pattern is “restore everything”, then the structure should reflect how the data appears in our regular systems.Any additional structure just gets in the way.However, “restore everything” is just one possible access pattern.
There are also certain queries we might want to ask our archives.For example: find all the work photos from 2010-2016, or find that funny video of my kids from their first day of school, or find all the documents that refer to Mr Bloggs when he lead youth group.
Each of those queries has an explicit or implicit time dimension (Mr Bloggs only lead youth group from 2018-2023), plus various other parameters (file type, file content, and category).While unlikely, it is possible the time range is “forever”, in which case we just need to trawl everything - that will suck, but there’s not much we can do about it.
Queries usually have some kind of category or context to them.In the above examples, “work photos” or “kids first day at school” or “youth group”.These are the kinds of categories that can be incorporated into file structures to make things easier to find.For example, we might decide to keep all youth group documents together, and all work photos in one place, and keep personal videos separate from work related ones.
There are additional levels of categorisation as well.Perhaps work photos are also categorised by job number or location.Personal videos might be kept by event.And the “youth group” documents are sub-categorised into “permission forms”, “lesson plans”, “attendance” and “general resources”.
The categorisation I’ve mentioned can be augmented by tagging.Most systems (particularly the “physical document” system) can only put a file or document into one category.That is, all your “attendance” documents can only physically exist in one place, the youth group attendance 2020 folder.But you can tag folders, documents or files with additional keywords.Perhaps the youth leaders names are recorded on the front of the folder containing all the attendance documents.Many photo apps support tagging people (often automatically via facial recognition) and geo-location.And you can label a document with important keywords (either manually, or using an automated algorithm).Tags can make it much quicker to find data of interest, without reading the entire document.
File type is usually straight forward - most computer file types are trivially identifiable from the end of their name (eg: jpg
or docx
or mp4
).And if not, the beginning of the file usually has a particular fingerprint (often referred to as magic bytes).
Finally, most computer systems support security rights, so that only authorised persons have access to particular files.The simplest way to apply these rights are at a top level, so everyone involved with work job 41354 has access to all the job data, or everyone involved in youth ministry has access to all youth related data.While it is possible to grant or revoke access at a more granular level, that brings additional complexity that I won’t consider too deeply here.Access to physical documents can be controlled in a similar way: different keys give access to different storage rooms or filing cabinets.
With those observations, we are ready to create a simple but effective structure for personal and church data.This structure will form a primary index, or a way to locate specific files.Additional secondary indexes will be listed as well, however they will always point to files that need to be retrieved via the primary index.
Here are the principals I follow for the primary index, or how files are physically organised on disk:
The top level provides very broad categories. Pictures, Music, Documents, etc.And then various sub-categories within there. Often by person.
Pictures is the most structured area: 12 months after my first digital camera, I was already struggling to organise photos.And it hasn’t got any easier.I quickly adopted a strict time series approach to storing photos, and created scripts to automate the process of getting photos from my camera (and more recently phone) into the Library
folder.As all my cameras are actually WiFi connected phones these days, the process is fully automated: once I connect to local WiFi, photos are automatically synchronised, post-processed and copied to the right folder.I’ve used a number of secondary indexes for photos down the years - they’ve all ended up obsolete for one reason or another. Currently, I just click through photos month by month in Windows Explorer.
Music has always been managed by media players.Rip content from CDs directly to an album, and let the media player index and organise it for me.There’s never been enough content to warrant time series here.
Videos has never had enough files for structured time series.I generally have folders with a rough category / description + year.Since COVID there’s been lots more material added here, but not enough I care to re-arrange it.
Documents are again pretty ad-hoc.Particularly other family members.
- Pictures |
Most of our church data is being stored on OneDrive, because of its combination of ease of use, price and functionality.Even data that isn’t primarily on OneDrive (on various other cloud based systems) gets exported and stored on OneDrive.Its our single source of truth.
The top level category is based on security roles.That is, different people are granted access to different areas as required for their ministry at church.
Then, there are more specific ministry categories.For example, within the broad “children’s ministry” category, we have “kids for Jesus” (our Sunday School) and “play time” (preschool).
Then, there is our time series structure.A folder for each year which contains lesson plans, meeting documents, attendance rolls, etc.
In some cases, there are additional folders for multi-year resources, or other buckets for files.
Finally, we need to take the extra time to name files descriptively.Well, we try to, it doesn’t always happen.
OneDrive |
I’m running several secondary indexes against both personal and church data.
The first are the “manifest files”, generated by my ManifestMaker app.These are simple tab separated text files which list the contents of each disk burned.They include filename, size, created and modified dates, plus a content hash. And can be read by Excel.
While their primary purpose is integrity, they also provide a very crude way to search disk content without access to physical disks.
The more featured index is WinCatalog.This app shows a graphical view of disk content (and live / working files), with the same core attributes as manifest files (name, size, dates, hash).In addition, it takes thumbnails of PDFs, Word documents and images, so you have basic visibility into file content.And will index some file specific information, such as EXIF details from photos, ID3 metadata from audio and video media, and metadata from e-books.It also allows you to tag disks, folders and files with arbitrary tags and user defined fields (although data entry for these is a manual process).
WinCatalog has a reasonably powerful search function.Allowing you to search by file date, size, name, type, location in catalog.It also lets you search for duplicates by name, size and hash.
As an aside, I don’t mind duplicate files.If the same file ends up on multiple disks, that’s extra redundancy!And, having indexes by file hash mean you can instantly determine if the file is a true byte-for-byte duplicate, or just a file with the same name.
One search function I found WinCatalog lacks is to find files which do not appear on backup disks.So, if I index all the “live data” and compare it to all backup disks, I would like a list of everything in “live” which is not in “backups”.That is, data that needs to be backed up!
While WinCatalog is a proprietary application, the underlying data is stored in an SqlLite database.And I’ve created a CatalogQuerier app to implement my “find data that isn’t backed up anywhere” search.I find this invaluable to ensure absolutely everything gets backed up.
The final search function lacking in WinCatalog is full-text search.It does not (as of 2022) let you search for text within Word documents or PDFs, etc.That would be a killer feature, particularly for church document searches!
I’ve used various databases for photos, video and audio down the years.All have eventually become obsolete or I’ve just run out of time to manage them.
These have various features like tags, facial recognition, geo-location.
Windows has a full-text search feature.This works very well to find content within a file, as long as the documents are on your local computer.Fortunately, documents / PDFs are small enough that it is feasible to keep them all.
Apple and other vendors have their own search functions as well.
The Sydney Anglican Diocese has some good content about structuring data and retaining records. I have issues with their over reliance of cloud based systems, but otherwise very good information.
You are keeping backups and archives so you can retrieve data from them in the future.Possibly the very far future.You need a structure in place to make it reasonably easy to find particular files.Even if you have inherited the archive from someone else, who inherited it from their predecessor.
Three or four levels of categorisation works quite well.At least one of those level must be time series.And files descriptively named.
If possible, it is highly recommended to keep one or more external secondary indexes.This provides a centralised search functionality that can see the entire archive, even as it is broken into many disks.And the ability to search using other criteria (eg: file content, thumbnails, and others).
This is the last part in my long term archiving series. I may report back in a few years about how my archives are going.
]]>You can read the full series of Long Term Archiving posts which discusses the strategy for personal and church data archival for between 45 and 100 years.
So far, we have considered the problem and overall strategy, and got to my chosen implementation.
One key point to consider is: what file formats will stand the test of time?
Decide on preferred file formats when saving data onto long term archives.Such formats should have a high likelihood of being read in 45-100 years.
The longevity of any given file format sits on a spectrum:
Short Term <------------------------> Long Term |
On the left, there are propriety formats that require expensive applications (and even specific hardware) to work with.These are undocumented (sometimes even within the creating company), involve restrictions to make public or open implementations difficult (patents, non-disclosure agreements), are only used in a narrow domains, and can be very complex.
Industrial and medical applications often fall in the category: highly specific, backed by expensive R&D (which means patents to protect that investment), and unavailable outside one product or company.
On the right, there are simple, open formats.These have a public specification, no legal impediments to making sense of the data, and are in common use by millions or billions of people across many different domains.
UTF8 plain text, PDF documents, and PNG images are examples of highly open formats.
And there are plenty of formats in the middle.Word documents, H.264 encoded videos, and even HTML have dangers as very long term formats.
Here’s the criteria I see as important for long term file formats:
There’s plenty of criteria to evaluate there, but I have a very simple rule of thumb:
If web browsers can view the file format (without extra plugins), its likely to be safe.
That is, if you can drag the file onto Chrome, Firefox or another web browser, and it Just Works™, its likely to be supported into the future.Web browsers are about the most ubiquitous software available, and an excellent lowest common denominator.
Now, let’s consider common file formats and how safe they are in the long term.Only formats rated 4 or 5 will be used in my archives.
Plain text files have no formatting.They are about the simplest form of data you can store on a computer.
File Type | Rating | Comments |
---|---|---|
ASCII Plain Text (txt) | 5/5 | There’s nothing simpler than ASCII text, as long as you only speak English. |
UTF8 Plain Text (txt) | 5/5 | UTF8 covers 98% of plain text data on the Internet. All languages are covered. This data should be easily readable in 45+ years. As long as you don’t need formatting, all is well. |
Other encoding Plain Text (txt) | 2/5 | Yes, there are other text encodings. Best not to bother with unusual standards, they just make it harder to read. And, because plain text files are not self-describing, it can be difficult to know the correct endcoding. |
Structured data is designed to be readable by both computers and humans - although with a priority to computers.
File Type | Rating | Comments |
---|---|---|
JSON | 5/5 | JavaScript Object Notation is mostly human readable in any text editor, and widely readable by computers. A schema is optional and rarely used (which means you usually need to reverse engineer an unfamiliar file). Most software development apps and advanced text editors can “pretty print” JSON. |
XML | 5/5 | The Extensible Markup Language is more complicated than JSON, but otherwise very similar in terms of outcomes. Schemas are more common. And apps are widely available too. |
CSV / TSV | 5/5 | While JSON and XML are document orientated, tab and comma separated files are tables of data. Again, they don’t have a schema built in, but its usually pretty obvious what the data means. Most spreadsheet apps can read CSV or TSV files. |
When people think about storing “data” they are usually thinking of documents with text, formatting, images, etc.I’m including spreadsheets and presentations here too - so the core office productivity apps.
File Type | Rating | Comments |
---|---|---|
DOCX / XLSX / PPTX | 4/5 | Microsoft’s core Office formats are open, have multiple implementations and widely used. Deduct one point because they can’t be natively displayed in a web browser, and they do slowly evolve and change. While Microsoft has published a spec, I don’t view these are truly open formats. On the other hand they are used pervasively. |
ODT / ODS / ODP | 5/5 | OpenDocument file formats are… well… open. Personally, I use Microsoft’s formats, but would be perfectly happy keeping these ones instead. As they are explicitly open, they score one point more than Microsoft’s formats, although it’s worth noting they are much less widely used. |
5/5 | The Portable Document Format is the gold standard for printable documents. As they are (usually) read-only, they are a great way to keep snapshots at a point in time. | |
HTML | 4/5 | While the whole Internet is built on HTML, it doesn’t get used very much for offline or editable documents (minus 1 point). Web browsers speak HTML natively, of course. It also isn’t well designed to save a document as a single file. |
RTF | 4/5 | Rich Text Format is like DOCX and ODT, but simpler and it hasn’t changed in years. Its slightly more likely to be readable in the far future. However, it is a proprietary Microsoft standard. |
Family photos is the majority of my personal data.Many businesses will scan documents as still images or PDFs.
File Type | Rating | Comments |
---|---|---|
JPEG | 5/5 | JPEG images are the gold standard for lossy stills. While there are alternative digital negative formats that professional photographers may use, JPEG is readable pretty much everywhere, and has been since the mid 1990s. |
PNG | 5/5 | Portable Network Graphics are loss-less images. They are ubiquitous on the Internet and viewable everywhere. |
TIFF | 5/5 | Tagged Image File Format is associated with scanners. It’s a bit more obscure than the above formats, but been around longer. Its very stable and widely readable. |
WebP | 4/5 | The WebP format is aiming to be a PNG successor. Version 1 was published in 2010, making it much younger than other formats (so minus one point). Modern web browsers support it, but it its usage is minimal compared to JPEG and PNG. |
SVG | 4/5 | Scalable Vector Graphics is the most open vector format around. All the others listed are bitmaps. Vector graphics are great for icons, fonts and logos that need to grow and shrink. Web Browsers can view SVGs, but they are not as widely supported as the bitmap formats. Various Office apps can export graphics as SVGs, and it is a good long term format for computer aided design files. |
Music and recordings are important to keep into the far future.Partly because we love music.And also because recordings may be of important events (eg: office meetings, police recordings, etc).At church, we keep audio recordings of each Sunday’s Bible talk.
File Type | Rating | Comments |
---|---|---|
MP3 | 5/5 | MP3s are the gold standard of lossy audio compression. They have been playable since the mid 1990s in many, many apps. A very safe choice for long term storage. |
WMA | 3/5 | Windows Media Audio was a Microsoft specific technology which improves on MP3. While its widely supported, its proprietary and not recommended for new recordings. |
OGG | 4/5 | While less common than MP3, Ogg allows storage of lossy and loss-less audio, that is generally of higher quality for the same file size. Unlike WMA, it’s an open standard. It’s widely supported, but not as wide as MP3. No patents. |
AAC | 4/5 | Advanced Audio Coding has a similar intent to OGG - improvements over MP3. Although less popular, it is commonly used in mobile devices. No patents. |
WAV | 5/5 | Uncompressed audio is wonderfully simple and easy to understand in the future. Unfortunately, WAV files are several times larger than the equivalent MP3. Definitely readable; but not practical, and loss-less alternatives exist. |
Full motion video with audio is everywhere these days.Family videos are important to keep.And businesses care as well, as they may record teaching material, meetings, etc.Since 2020 at church, we keep video recordings of each Sunday’s Bible talk.
I’m dividing these into two sub-categories: containers and codecs.Containers are usually the file extension, but they just say how the audio and video is packaged.Codecs are the way you decode and display the video.
Container | Rating | Comments |
---|---|---|
MP4 | 5/5 | MP4s are the most common video container at the moment. And are widely supported. |
AVI | 5/5 | AVIs are more common on Windows and are an older container. |
MKV | 4/5 | Matroska files are a bit less common, and frequently found in live streaming applications because they file is still readable even if it is stopped unexpectedly (eg: crash or interruption). |
MOV | 3/5 | MOV files are common in the Apple world, based on QuickTime. While readable by many applications, it is not an open format (so minus points). |
Note that modern video applications are capable of playing all the above containers.This was not always the case in 1990s and 2000s.
Codec | Rating | Comments |
---|---|---|
MPEG2 | 5/5 | MPEG2 is the codec used on DVDs and video CDs from the 1990s, and still used in lower quality over the air digital TV broadcasts. Due to its age, it is readable pretty much everywhere. While it was patented, those have now expired. Not recommended for new content as there are better options. |
H.264 | 5/5 | Advanced Video Coding (AVC) is a more advanced codec and used on Blu-Ray disks, OTA TV and streaming services. This produces smaller files than MPEG2, but at a higher quality. All modern devices can play H.264 encoded videos, and it’s a great choise for long term archival. Royalties are not payable for non-commercial use. |
H.265 | 4/5 | High Efficiency Video Coding (HVEC) is superior again. It’s the codec for 4K and 8K broadcasts and many streaming services. It’s relatively new and has patents that cause legal issues (minus one point). |
AV1 | 4/5 | AV1 is an open, patent free, codec that competes with H.265. Technically, the two are quite similar, AV1’s big plus is you don’t need to pay royalties to use it. However, it’s not as widely supported as H.265 (minus one point). |
It’s worth noting that the video encoding space has evolved faster than still or audio files.This is because the tech behind still images and audio files invented in the 1990s and 2000s is more than good enough - the quality is acceptable, and file sizes small.Video, on the other hand, has gone from low definition to standard def, high def, 4K and 8K - and the tech has needed to improve to keep file sizes manageable.
What that means is its quite likely there will be a new (and superior) video codec invented in the next 10-20 years.But there have been a number of new still image and audio formats invented over the last 20 years, but none were so much better than existing tech to take over - so much less likely for a newcomer.
There’s a stack of data tied up in Email.All kinds of communication happens via email and it’s often important to capture for the long term.Personally, I prefer to save important emails (or email chains) as a PDF if it needs to be kept for the long term.And I don’t tend to pay as much attention to the file formats used by my email apps.
File Format | Rating | Comments |
---|---|---|
EML | 5/5 | EML files are used by many apps for individual emails. |
MSG | 4/5 | MSG files are a Microsoft thing used for individual emails by MS Outlook. Minus one point for proprietary, although most modern email apps will read them. |
PST | 4/5 | A PST file is what MS Outlook uses to store a whole mail box (many emails). While there are various apps to read a PST file, it’s still rather proprietary. |
MBOX | 5/5 | MBOX files are how mail boxes were stored on older UNIX systems. They have carried on into various non-Microsoft email apps. The format is simple and open, so good for reading in 45+ years. |
There are a stack of database technologies out there.And an even wider number of implementations such as MySQL, SQLite, MongoDB, LevelDB, and many others.
The data in these systems are used by all manner of apps in personal and business contexts.Our church keeps some records relating to Safe Ministry in a MySQL backed web application.So keeping this data available in the long term is really important.
Unfortunately, the only reliable way of doing this is to keep upgrading your database system every few years.Because there’s considerable research and development in the database field to improve performance and reduce storage requirements.Basically, it’s in the interests of large companies to improve their data processing.And that means file formats are constantly evolving.
Generally, it’s not too hard to upgrade from version 1 to version 2.Things get more complicated to go from v2 to v5 though - many systems only support upgrades for one or two versions different (so v2 -> v3 or v4 would be OK, but not v2 -> v5).Instead, you need to do a multi-step upgrade like v2 -> v4 -> v5.
For this reason, if you want to keep a snapshot of your database available into the far future, the best approach is to export to one of the structured formats above (JSON, XML, CSV or TSV).
While many database systems allow you to make backups, these backups are often very closely related to their main file formats, and come with similar restrictions to the upgrading process (eg: v5 can only restore v4 and v3, but not v2).
The only other option is to maintain your database system and keep it current.While that’s usually a desirable thing, it doesn’t always work with compliance requirements like “what did you data look at on 34th Smarch 2312”.
The above lists cover off my requirements.But many other apps are out there and used for mission critical business scenarios.I’m not going to make recommendations here, there are simply too many options.
In general, the database recommendations of doing regular snapshots is the best approach.And sometimes that means big exports, or lots of PDFs.
The National Archives of Australia have good file format recommendations for digital formats.They also have details about analogue formats, which isn’t my focus here, but may be of interest.
It’s no good to keep your data for 45-100 years, only to find there is no app to read and process it.Wisely choosing file formats is an important part of your archiving strategy.
Fortunately, the ubiquity of audio, video, still image and documents in our digital lives mean that common files are very likely to be readable in the far future.
Next up: In the last part to this series, I will discuss how to organise files on archival disks so they are easy (well, less difficult) to find.
]]>You can read the full series of Long Term Archiving posts which discusses the strategy for personal and church data archival for between 45 and 100 years.
So far, we have considered the problem and overall strategy, possible failure modes, how we will capture the required data, likely access patterns of the backups, and finally, listed possible options for backups and archives.
With all the due diligence out of the way, it’s time to describe the implementations chosen.
Describe the implementation of my chosen long term archiving strategy (45+ years) for personal and church data.
My family’s personal data is split across four areas:
The TrueNAS is a frankenstein computer of parts from down the ages (oldest is ~10 years).It has a 2 Core AMD CPU, 16GB RAM and 6TB usable storage (mirrored disks).It is powered via a small UPS, which is designed to protect against a 5 minute outage and allow a safe shutdown (electricity is extremely reliable in Sydney, but thunderstorms happen in Summer).Despite the low end (and second hand) hardware, it is one of the most reliable computers I’ve come across.
I consider OneDrive and Google very reliable cloud providers.But I take a weekly snapshot of OneDrive using RClone.GMail is more problematic to backup automatically, so I’m content to download a snapshot every couple of years.
Data on the TrueNAS is backed up to BackBlaze B2 cloud storage.TrueNAS has a web front end to RClone that makes it much easier to understand and use.Cost for B2 is ~AUD $6 / month with my current usage of ~700GB.
BluRays are used for offline backups and archives.So far, I’m sticking to single layer 25GB BDR disks as they are cheapest per GB and simplest (read: least ways for them to fail), though I’m experimenting with larger capacity disks as well.All important data is stored with triple redundancy (3 copies of each disk), and two copies are stored off-site.I’m also using 3 different brands of disk, in case there’s a systematic failure from a factory.BluRay disks are the cheapest offline backup system for consumers (tape is out of my price range).And optical media has the highest longevity I’m aware of in consumer hardware, which is good for archiving data for at least 20 years using standard disks.
Data on TrueNAS and BluRays are indexed using WinCatalog.This gives an explorer-like view across all disks, and facilitates searches and finding duplicates.Unfortunately, it doesn’t have a “find files that are NOT on a BluRay disk” feature - but the underlying database is Sqlite, so I have written my own utility to find missing files.I also have written a console app to generate hashes of each file on a BluRay disk (the disk manifest), and that gets signed using PGP and KeyBase keys - which gives high confidence of reading data correctly.The WinCatalog index & manifest files are stored separately on TrueNAS.
Finally, every year or two, I manually gather up all data from cloud services and local storage and burn BluRays of them all.This gives an occasional snapshot of all documents, email, etc.I also make a snapshot on a hard disk of photos, videos, etc (ie: larger data that is also on BluRays).
Every 5-10 years, I get dissatisfied with some aspect of my backup system, so I re-visit and re-work it.(This post outlines my latest iteration; previously I’ve used DVDs, HDDs, and cloud based systems).This is an informal “review” process to evaluate if I should change due to hardware / software obsolescence (and unfortunately involves data migrations).
This strategy satisfies the 3-2-1 backup rule:
BluRay disks are expected to be readable in 20 years, likely more.And provides an off-line, air-gapped, off-site archive.
Processes for church data are slower in taking effect, however we’ve done all the planning (and I’ve tested pretty much everything in a personal context anyway).
The main difference is church data is primarily on the cloud, to facilitate sharing.Systems we’re using include:
The stronger use of cloud systems is because church members need to share data with each other.We certainly have many computers on-site, but most members are volunteers who do their work from home, and sometimes need to access that content on-site (eg: for presentations, printing or post-processing).
Backup systems:
We view OneDrive as a system to store data we use on a day-to-day basis, as well as a backup system.It provides features to assist sharing documents, and also retaining them in longer term.It’s more reliable than anything I could build on a limited budget for backups.It’s also quite simple to use, which is a big plus for church volunteers who might not be very technologically savvy.
Data on OneDrive is mirrored to a local server (in case OneDrive disappears for some reason).Currently, that server is an even older frankenstein than my home TrueNAS box.It was cobbled together at short notice (to replace a failed server) from very old parts.It’s running Ubuntu server and has no web UI like TrueNAS does, so all admin is via SSH - which makes simple admin tasks more complex than they need to be.It is using ZFS to ensure data integrity.There are plans to migrate to TrueNAS (possibly even first party TrueNAS hardware).
There is no additional backups to other cloud systems (eg: BackBlaze or AWS).Due to our limited budget, and to keep things simple, we’re classifying OneDrive as both our day-to-day storage and a cloud backup system.
We take periodic snapshots from all systems.Some are automated (where possible) and others are manual.
BluRays are also used for offline backups and archives, in a very similar way to my personal backups.The main difference is church BluRays will use M-Discs - these are archive grade media designed to survive for “hundreds of years”.We’re also planning to store an additional copy (ie: 4 in total) at the Sydney Diocesan Archives - which have a better environment for storing disks long term.
We’re planning on using WinCatalog & manifests to index disks.No changes from personal strategy here.
One big difference from personal backups is a much more structured approach to procedures and reviews.Because there are legal compliance requirements we need to meet (particular data must be available for at least 45 years), we need to regularly check we are actually meeting those requirements.So there are template reviews drafted that will be done annually, and report back to our church’s board of directors to ensure compliance.These reviews include people focused questions - are people using the systems we’ve provided, is the data we need being stored. As well as technical questions - are backups working, can I read the media successfully, is the technology still viable.And even the manual processes - so we remember to do them!
This strategy satisfies the 3-2-1 backup rule:
M-Disk BluRay disks are expected to be readable in 45+ years, possibly over 100 years (if the advertising proves correct).And provides an off-line, air-gapped, off-site archive.
Very long term backups need to have as few single points of failure as possible.If there is a single link in the chain that can break and cause loss of ALL data, that is entirely unacceptable.
The biggest single risk is encryption.
If your backup is encrypted, it is impossible to restore unless you have the password / encryption key.Of course, you want your backups encrypted because there’s likely to be sensitive data in them.
There is a fundamental tension here:
My approach is: when making backups, I only encrypt cloud backups.
That is, data on the public cloud is encrypted (and RClone makes that easy).But offline backups / archives are not encrypted.That is, anyone who gets their hands on my BluRay disks can read everything.
Which is by design.
Because archives a) are often old enough the sensitive data has lost its value, b) designed to be the last resort when restoring, so need to be easily accessible, c) more likely to be read by someone after I’m dead (eg: grand kids, archeologists, etc).
I can manage security of BluRay disks by controlling physical access to them.But if someone gets access to them in 100 years time, I’d prefer they can see their content rather than be thwarted by a password.
Aside: there are ways of keeping a backup password safe by distributing it to many people.Shamir’s secret sharing algorithm is a way to do this such that a quorum of people are required to recover a password.Or a “dumb” approach: have a long passphrase and give parts of it to different people.Both likely introduce a delay if you’re going to the backup of last resort, as you need to contact several people.
Other than encryption, overly complex recovery processes are the most likely way reading data would fail.
There are three ways to mitigate:
Items 1 and 2 are in place for church archives via compliance reports.Less so for personal archives, but I still occasionally test the disks are readable.
Item 3 is my main focus here: keep things simple.
My BluRay disks are, as much as possible, just a bunch of files burned to a disk.
If you put them in any BluRay drive connected to a laptop / desktop computer, you can browse them using your favourite app, and open them using whatever apps are available.There’s no requirement for Windows, or Microsoft products (although that’s where much of the data originates).The disks could be read on a Mac, or a Linux machine (or some new OS that comes out in 50 years time, as long as it can talk to a BluRay reader and supports UDF).And the files should be readable using many applications (JPEG photos, MP4 videos, MP3 music, DOCX documents, PDF documents, XLSX spreadsheets, etc).
In particular, I avoid compressing data.The logic being: a single error has a higher chance of doing extensive damage to compressed data - but would only break a single file if not compressed.And, most large files I deal with (video, audio, photos) are already highly compressed; documents and spreadsheets are small enough that it doesn’t matter.
That is, the requirements to read my archive disks are a) the disks themselves, b) a BluRay drive, and c) a computer.
Special backup software should NEVER be required for long term archives.It adds a layer of complexity that may cause difficulty when trying to restore data.And you don’t know what the scenario is when the disk is read (it might be after your house burned down and you have absolutely nothing beyond an off-site backup, or it might be your great grand kids in 100 years time, or it might be an archaeologist in 500+ years time).
If you only have one disk (because all the rest were damaged beyond repair somehow), you should be able to read everything from that one disk without dependencies on others.
The biggest layer of complexity I’m happy to add is for large files to span multiple disks.This is pretty rare as I don’t often work with files over 25GB.But full disk images are the one exception.
Having said all that, I’m happy to include extra data on each disk.For example, the manifest file is not required to read anything on the disk - although it forms an index that may help someone find what they’re looking for more quickly, and provides a hash to verify the file integrity.I usually include MultiPar parity data - that includes additional checksums to verify integrity, and may help recover a damaged disk.
However, none of that “extra data” is required to read the content on disk.
Also, I’d like to ensure the integrity of any data cannot be tampered with.That is, if someone edits or replaces a file (or an entire disk), you should be able to clearly tell something has changed.
First off, all BluRay disks I use are write-once.So it is technically impossible to accidentally or maliciously modify data on a disk.However, a bad guy could make a copy of the disk with changes and replace the original with the copy.Unless they are very careful, this would leave different date stamps or different media brands which could be noticed.
The WinCatalog index includes an SHA256 hash of each file, and the manifest files include SHA384 hashes.Both are stored separately from disks, so even if someone replaced a disk with a new one (with dodgy data), that could be detected.The bad guy would need to a) replace all disks in all physical locations, and b) update the index & manifest files which are stored separately to the disks.
I’m also signing the manifest files, so the signature would no longer be valid if the bad guy is tampering with things.The bad guy could generate new PGP / KeyBase keys that look like mine, but are not.However, KeyBase keys are public by default, so that should be very difficult.(PGP keys can be published as well, but there is no central authority so nothing stopping an attacker doing exactly the same thing).
If I was implementing this in a larger corporate environment, I might ask many people to observe the process to create disks, inspect the disk contents and then ALL sign the manifest.That is, you might have 2 or 3 or more people attesting to the correctness of a disk.If the private keys for this process were stored on a hardware token (eg: Yubikey), then the difficulty for an attacker to modify data without detection becomes extreme.
If I was really concerned about bad guys trying to alter data in deep archives, I could publish the original manifest files to a public location (like a blockchain), when the disks are created.As blockchains are effectively append-only databases, an attacker would need to re-create the whole blockchain to change hashes.
For my use case, write-once media + hashes + signatures is more than enough.
It took 5 posts and about 12 months of thinking to come to a reasonably simple (if overly redundant) backup strategy that can meet the 45+ year requirement.
By using on-prem (TrueNAS), cloud (BackBlaze / OneDrive) and offline storage (BluRays).And keeping copies off-site.And using two external indexing systems.And keeping signed hashes of all files.
I am very confident my data will survive well into the future.Even confident it will survive to my 45 year goal!
(And yes, I realise most of the 10+ year part is met via M-Disc BluRays. And 20+ years is met via “review backup technology and migrate if required”.Insert something about the journey being more important than the destination).
Next up: We aren’t finished yet! I will discuss which file formats are suitable for long term archiving.
]]>You can read the full series of Long Term Archiving posts which discusses the strategy for personal and church data archival for between 45 and 100 years.
So far, we have considered the problem and overall strategy, possible failure modes, how we will capture the required data, and likely access patterns of the backups.
Now we’re up to the fun part!Time to research the options available to do backups and consider how well they meet our criteria.
List common backup platforms or technologies, and evaluate them based on the criteria we’ve identified over the last few posts.
Remember, I’m planning to backup personal data and church data (not-for-profit organisation).These evaluations mostly apply to a small business (20 or less employees), but less so for medium or large organisations - they will be processing orders of magnitude more data.
Disclaimer: Some of the criteria are pretty arbitrary and subjective. Others will be based on other studies or maths.As always, do your own evaluations to determine if any service or technology is suitable for you.
The Cloud is a fantastic place for backups and archives.It enables individuals and small businesses to access the same scale of storage as multi-national corporations.
Remember, the cloud is a euphemism for “renting someone else’s computer”.It is relatively cheap and highly reliable - essentially, the cloud provider takes responsibility for all the boring aspects of storing data.But in accessing those features, you give up ultimate control of your data.
So this advice applies to all cloud based backups: have an off-line copy as well.
AWS S3 popularised “cloud storage”.It works by storing key-value pairs: some kind of name, and a blob of data.It has conventions for creating a filesystem-like view.And adds permissions, storage tiers, and various other features.You can “put” data into a “bucket”, and then retrieve it later by its name.
Alternatives: Azure Blob Storage, Backblaze B2.
Capital Cost: $0
Ongoing Cost / GB / Year (AUD): 30c (AWS), 25c (Azure), 6c (Backblaze). Plus network costs, API usage, and who knows what else.
Calculating costs of cloud storage is incredibly difficult; there are any number of pricing tiers, levels of redundancy and additional costs beyond raw data storage (eg: network uploads / downloads, API usage).The cloud promised “only pay for what you use”, but delivered “our pricing model is so complex, until you actually use our service, you have no idea what it will cost”.You should use the “pricing calculators” provided by each cloud service to get a rough estimate of cost.
Cost to Store 1TB for 1 year (AUD): ~$310 (AWS) (plus network / API / etc)
Reliability: All cloud providers use redundant storage within individual servers, data centres and can even replicate data between different geographical regions. While there are occasional outages due to network issues, your raw data is incredibly safe.Even if there are internal errors or failures (and there will be), the provider has automated systems to detect and correct them.
Effectively, you can safely assume you will never see a failure when using the cloud.This is by far the biggest advantage of the cloud: it is really expensive to achieve similar reliability by rolling your own.
Longevity: All cloud providers have long term archive options (eg: S3 Glacier, Azure Archive), and the same principals behind their high reliability mean your data is safe over 10+ years.So long as the provider itself remains in business, it is safe to assume your data is available.
Access: Is generally limited by your own Internet connection - as with any cloud solution, if your Internet connection is poor, the cloud will perform badly.S3 is the industry standard protocol for cloud object storage, and there are many apps available to upload / download / browse your data.Many backup solutions have built in support for S3 storage.
Scale: Object storage allows for Petabyte level storage (1 PB is 1 million GB).For personal usage, small or medium business, you can assume there are no technical limits.The first thing that will break is your credit card!
Simplicity: Any nerd or technically minded individual will have little trouble using object storage.However, cloud providers pitch this technology at technical people; your mom-or-pop is going to struggle signing up for these cloud providers, let along configuring their devices.
Automated: Cloud providers are available 24/7, and were primarily designed to be accessed by machines rather than humans.Their support for automation is excellent.Low level APIs are available (if you’re a programmer), graphic clients are available (for interactive access), command line clients are available (for automation via scripts).
Security: It is a vested interest of cloud providers to ensure privacy of your data, and security via access permissions & user authentication.Having said that, most cloud providers can peek at your data if they choose (although have strict policies prohibiting that) - you should configure any backup software to encrypt your data.And it is common for permissions to be accidentally set to “public” and allow anyone to download your data.
Recommendations: Object storage is an excellent candidate for backups and long term archiving.The only caveats are, 1) you need a nerd to get started, and 2) you have to trust they won’t go out of business in the next 50 years.
Criteria | Rating |
---|---|
Capital Costs | 5/5 |
Ongoing Costs | 3/5 |
Reliability | 5/5 |
Longevity | 4/5 |
Access | 5/5 |
Scale | 5/5 |
Simplicity | 3/5 |
Automation | 5/5 |
Security | 4/5 |
Overall Suitability for Backups | 5/5 |
Overall Suitability for Archives | 3/5 |
DropBox is the original cloud sync service.With similar services provided by OneDrive, Google Drive, Sync.com and others.
It is by far the simplest way of backing up data from your devices.You keep your files in designated folders, and the synchronisation service magically copies files to the cloud.When another device makes changes, they are magically copied to your device.Indeed, its so simple that “backups” in Windows 10 are “keep your files on OneDrive” - all the older backup features like File History or Backup and Restore are second class citizens.
Most cloud sync providers are able to see the contents of your data.Several providers make a point of confidentiality, by encrypting data on your computer before it is uploaded (zero knowledge cloud storage).This may be a desirable characteristic when making backups.Providers include: pCloud, Tresorit, and SpiderOak
Capital Cost: $0
Ongoing Cost / Year (AUD): $150-$200 for at least 1TB of storage.
Pricing above is for personal accounts; most services offer a business or professional level account which is more expensive and has more business orientated features.At the end of the day, if you want to backup data, it doesn’t matter; personal, professional or business is all the same.If you want to share files with other people, the professional accounts may be of interest.
Reliability: All cloud providers use redundant storage within individual servers, data centres and can even replicate data between data centres. While there are occasional outages due to network issues, your actually data is incredibly safe.Even if there are internal errors or failures, the provider has automated systems to detect and correct these.
Effectively, you can safely assume you will never see a failure when using the cloud.This is by far the biggest advantage of the cloud: it is really expensive to achieve similar reliability by rolling your own.
Longevity: So long as the provider itself remains in business, it is safe to assume your data is available.Note that these consumer orientated cloud services don’t have the same guarantees about long term storage - that is, AWS S3 offers tiers specifically for retaining data for 10+ years for compliance purposes; none of the consumer services make such claims.
Access: Is generally limited by your own Internet connection - as with any cloud solution, if your Internet connection is poor, the cloud will perform badly.All these services have an app you need to install for the best connectivity, most (all?) offer a web interface as well.Most apps support iOS, Android, Windows, and MacOS. Linux is more hit and miss.
Scale: Most consumer cloud storage tops out around 5TB.Google Drive offers up to 30TB.If you want more storage, you’ll need to sign up for another account.This level of scale is fine for documents or photos, but if you’re recording 4k video you will hit the 5TB limit pretty quickly.
Simplicity: These services are aimed at every-day users.They are usable by pretty much anyone.
Automated: Cloud providers are available 24/7.But these consumer services are designed for humans rather than computers.At least, you will need to have a device with a person logged into it (so they won’t work on headless services).Having said that, there is software available which allow automation via scripts.
Security: It is a vested interest of cloud providers to ensure privacy of your data, and security via access permissions & user authentication.Most cloud providers can peek at your data if they choose (although have strict policies prohibiting that).It is difficult to encrypt data when using cloud sync apps.Fortunately, access permissions are private by default.
Recommendations: Cloud Sync based storage is a very good candidate for backups, particularly for everyday users.But not as good for long term archiving.And, as with any cloud provider, you need to trust they won’t go out of business.
Criteria | Rating |
---|---|
Capital Costs | 5/5 |
Ongoing Costs | 4/5 |
Reliability | 5/5 |
Longevity | 3/5 |
Access | 5/5 |
Scale | 4/5 |
Simplicity | 5/5 |
Automation | 4/5 |
Security | 4/5 |
Overall Suitability for Backups | 5/5 |
Overall Suitability for Archives | 3/5 |
Hybrid systems allow many of the advantages of cloud storage, but you host the service on your own servers.Essentially, a cloud-like system, but using your own disks and hardware for storage.
If there is data that you can’t store in the public cloud (perhaps its too sensitive or you are prohibited by law) but still want a cloud-like interface to access it, then hybrid is the way to go.You retain ultimate control over your data, but need to take responsibility for maintaining the systems hosting said data.
There are a number of Cloud Sync services that can be self-hosted.
OwnCloud / NextCloud are very similar services that behave like DropBox.SyncThing / Resilio are more like a writable version of BitTorrent.
All can be used as a backup, as long as you provide your own hardware.
Capital Cost (AUD): All need a server of some kind. Some need more powerful servers than others.
Ongoing Cost / Year (AUD):
All services listed have free options, although that may be limited for personal use only.Most have business / enterprise pricing per user per month. You’re looking at $400 - $1000 per year for 5 users, depending on the service.
Fortunately, because these companies are selling you a product, their pricing is much easier to understand than AWS S3 or Azure Blob Storage!
Reliability: Because these are self-hosted, their reliability depends on the hardware you purchase and Internet connection available.The entry level costs (above) are NOT going to give you high reliability; cheapest is not best if you want reliability.Purchasing 3 of everything is a great way to improve reliability!But that means your capital costs just tripled.I’ll discuss reliability of hard disks in a NAS below in more detail.
SyncThing and Resilio are designed to scale out as you add more devices; OwnCloud and NextCloud not so much.
If you’re only using these devices at home or at business, your LAN may be plenty reliable for your needs.But, I’m assuming the “hybrid” part means you will want to access data or devices remotely, so a reliable Internet connection is important.
In Sydney Australia, I’ve found personal Internet via Internode more than reliable enough to host my own website.However, this may not be true in all part of the world (or even all parts of Sydney)!
Longevity: Again, I’ll discuss how long you can expect your hard disks to last for below.Your server(s) will last as long as you maintain / replace them on failure.
Access: These services require their own apps to run, which generally makes them easy to use.Otherwise, access to data is similar to other cloud sync providers.But with one important difference: you can always connect to the server directly if you need the data and the app isn’t working right.
Scale: I’m not aware of inbuilt limits for these services.OwnCloud / NextCloud will scale up to the size of your server.SyncThing / Resilio are distributed, so you can store more and more data as you add more and more servers.
Simplicity: “Self-hosted” means you need at least a computer nerd to get you started, possibly an IT professional.These services are moderately difficult to install, and pretty easy to use, but are certainly not aimed at mom-and-pop users.
Automated: All services can be automated within their own apps - generally this assumes a human logged onto a computer.Outside their apps, there is good scope for scripting and automation - “self-hosted” allows a high degree of flexibility in this department, if you have the expertise available.
Security: Data is in your own hands, so the security and privacy of your hybrid solutions are equally in your hands.All software listed have built in security and encryption - so the main point of failure is human: incorrect configuration or simply forgetting to revoke access to ex-employees.Also, make sure you keep software up to date - bugs and security vulnerabilities are found frequently, updates are key.
Recommendations: Hybrid Cloud Sync storage is a good candidate for backups and long term archiving (because you control the underlying hardware).Even if the parent company goes out of business, you’ll have whatever you last installed.Perhaps their best use case is to bridge between the public cloud and your own servers; which makes them a really good fit in the business world.
Criteria | Rating |
---|---|
Capital Costs | 3/5 |
Ongoing Costs | 4/5 |
Reliability | 4/5 |
Longevity | 4/5 |
Access | 4/5 |
Scale | 4/5 |
Simplicity | 2/5 |
Automation | 4/5 |
Security | 4/5 |
Overall Suitability for Backups | 4/5 |
Overall Suitability for Archives | 3/5 |
There are a number of “S3 compatible” services available, the two most popular are MinIO and Cyph, but there are plenty of others out there.Because they are “S3 compatible”, anything that can backup to AWS S3 can be configured to backup to these services.They need to be self-hosted.
Although not S3 compatible, the Interplanetory File System is a promising distributed system, which can use public providers, or self-hosted servers.The big feature of IPFS is “immutable content based addressing”, which is a fancy way of saying “you can’t every change something you upload on IPFS”.When archiving data for 45+ years, that is a very good property.On the other hand, it is relatively new and somewhat experimental.And the big gotcha is: everything is public on IPFS, which is a very bad property when keeping sensitive or confidential data - encryption is a must.
Capital Cost (AUD): All need a server of some kind. See above for starting costs.
Ongoing Cost / Year (AUD):
All services listed have free (open source) options.MinIO has commercial licensing options.
Generally, your ongoing costs are going to be related to the hardware more than software.As these are distributed solutions, they work best on many servers.At some point, if you install enough servers, you’ll have a data centre like AWS and Azure operate!
IPFS has a public cloud that let’s you “pin” content on other servers - the rough equivalent of uploading your data.Costs range ~$1-2 per GB per year (significantly higher than AWS / Azure).
Reliability: Because these are self-hosted, their reliability depends on the hardware you purchase and Internet connection available.The entry level costs (above) are NOT going to give you high reliability; cheapest is not best if you want reliability.I’ll discuss reliability of hard disks in a NAS below in more detail.
All these services are distributed and designed to scale out as you add more devices.And distributed systems mean you should probably have 5 or 7 of everything (or more).
Longevity: Again, I’ll discuss how long you can expect your hard disks to last for below.Your server(s) will last as long as you maintain / replace them on failure.
Access: MinIO and Ceph are S3 compatible, so its no harder than AWS to access data.IPFS runs its own service and provides command line, web based and virtual file system access.Because they are distributed services, the raw data on disk is not easy to read - data is split and copied between servers automatically.So direct access to servers is less useful.
Scale: I’m not aware of inbuilt limits for these services; because they are distributed, they are designed to scale up as you add more servers.MinIO and Ceph are designed for 10TB and up.IPFS is designed for effectively unlimited storage (though its relative immaturity means that hasn’t been extensively tested).
Simplicity: These services are even harder to use than “bring your own server, install this service, off you go”.Public IPFS is close to that level (if a bit experimental).MinIO and Cyph are designed to be integrated as part of other server infrastructure.It is possible to create your own private IPFS network, but that is quite technical.However, once your IT department looks after all the technical stuff, scripted backups should be nice and simple.
Automated: As with the “real” AWS S3, these services have excellent APIs and support for automation.MinIO and Cyph should work with any S3 compatible backup software.IPFS has command line scripting support.
Security: Data is in your own hands, so the security and privacy of your hybrid solutions are equally in your hands.All software listed have built in security and encryption - so the main point of failure is human: incorrect configuration or simply forgetting to revoke access to ex-employees.Also, make sure you keep software up to date - bugs and security vulnerabilities are found frequently, updates are key.
Recommendations: Creating your own S3 Compatible object store is the ultimate hybrid cloud - having all the features of S3 but on servers you control.This is the kind of setup that medium or large business may find attractive, but it’s going to be out of reach of individuals and small business.
IPFS feels like it could be a fantastic solution for long term archiving.But its quite complex and expensive compared to other options.
Criteria | Rating |
---|---|
Capital Costs | 3/5 |
Ongoing Costs | 4/5 |
Reliability | 5/5 |
Longevity | 5/5 |
Access | 4/5 |
Scale | 5/5 |
Simplicity | 1/5 |
Automation | 4/5 |
Security | 4/5 |
Overall Suitability for Backups | 4/5 |
Overall Suitability for Archives | 4/5 |
The traditional way to do backups and archives is to do it yourself.
Unlike the cloud, we can’t take advantage of economies of scale, nor the ultra high reliability.But we do retain ultimate control of our data - there is no external 3rd party who can cut us off from our precious data.No account that might be hacked, or locked.And no cloud provider that might go out of business.
We have ultimate control and ultimate responsibility with on-prem backups.
Pretty much everything in IT runs on servers with disks.
Whether its the largest cloud provider or a tiny website, the service you access needs to run on real hardware.There might be many layers of virtual machines and services between the website and the hardware, but make no mistake, everything runs on servers with disks eventually.
For backups, we’re interested in many cheap disks.And the simplest way to achieve that is Network Attached Storage.
A NAS device is a small server that optimises for lots of disks (as opposed to CPU power).The ones we’re interested in have multiple disks, to allow redundant storage.So if one disk fails, your data remains intact.
Key players include Synology, QNAP, and TrueNAS.TrueNAS is the one I use because it uses ZFS for storage, but it’s more expensive than other brands.I don’t have direct experience with Synology or QNAP.
Capital Cost (AUD):
The cheapest NAS supporting 2 disks start around $400. And 4 disk models from $500.
You need to add disks for the NAS to be useful. 1TB disks are ~$100ea. 4TB looks to be the best value for money at ~$160ea. 8TB jumps to ~$350ea.
So, a basic NAS with 2 x 1TB will cost ~$600. A decent NAS with 4 x 4TB disks is ~$1200. Or a high end model with 8 x 8TB disks is ~$5000.
The TrueNAS software is available for free, but you need to supply your own hardware.My estimate is $1500-$2000 if you want to DIY with quality parts and 2 x 4TB disks.Genuine TrueNAS hardware starts in a similar range (and Australian buyers pay a premium for shipping, unfortunately).
The estimated life time of your NAS is 5-10 years.
Ongoing Cost / Year (AUD): Once you have purchased your NAS there are two main ongoing costs: electricity and network access. And don’t forget to add a maintenance allowance.
My electricity costs ~21c / kWH in Sydney.Your NAS will be running 24/7, and will consume 60-120W (depending on size).My math for this works out to an annual cost of ~$110 for a 60W NAS and $220 for a 120W NAS.
I’m assuming you want Internet access to your NAS (perhaps to mirror its content off-site).I pay $110 / month for 100/40Mbps Internet with a static IP in Sydney.Obviously, I use that for more than just my NAS, but it means I’m paying $1,320 per year to ensure it is online.The static IP and upgrade to 40Mbps upload is $20 per month, so let’s say that’s the special “NAS” part of my Internet, which is $240 / year.
Finally, maintenance.Disks do fail, and you need to allow a budget to replace them (the cloud providers do).I’m going with 7.5% per year of the original purchase price, which should be enough to buy a replacement disk after a few years.That’s $90 / year for our $1200 NAS.
A quick comparison with AWS shows a NAS is similar in cost once you include ongoing costs:
Reliability: Backblaze publishes best public statistics on HDD failure rates.
There’s a 1-2% chance of any hard disk failing each year (assuming data centre conditions; assume worse environmental conditions for your NAS).So it’s quite likely the disks in your NAS will survive 10 or more years.
On top of that, all NAS devices employ some kind of technology to detect and correct failures on a regular basis, and notify you when that failure happens.That means there is an automated system checking if your disks are working or not, so there should be a very short time between an actual failure and when you can take corrective action.
All this means, disks in a server are very, very reliable.Not quite as reliable as the cloud, but still very good.
Longevity: The NAS itself should last 5-10 years, at which point you’ll need to migrate data to a new device.
Disks should last forever, so long as you can afford timely replacements.That is, the automated monitoring built into NAS devices is really important at keeping your data safe.
The underlying technology of NAS is Ethernet + various file transfer protocols. While they may become obsolete in 10-20 years, I don’t see them disappearing entirely in that time frame.Every time you buy a new NAS (say every 10 years) you are automatically be upgrading this core tech.
Access: NAS devices support various file transfer protocols for Windows, Mac and Linux devices, so no problems accessing.Mobile device support is not as good, because mobile devices are “cloud first” platforms.
Access outside your local network is dependent on your Internet connection.While my residential connection might have a few minutes of down time each month (which I rarely notice), it’s no where near as good as cloud providers.
Scale: NAS devices support a fixed number of disks.Once you install all those disks, your choices are a) buy a new (bigger) NAS to scale up, b) buy a second NAS to scale out, c) get into clustered file systems - which are expensive and require IT experts.There’s only so many disks you can fit in a single server.
For personal and small business use, ~70TB is a reasonable upper limit for an 8 disk NAS with 12TB disks.Larger NAS devices are available supporting 16 disks (plus another 16 disk expansion), which gives ~320TB.
Scaling out and buying more NAS devices also works.But then you need to think of a way to split your storage up between each device.
Simplicity: Running your own hardware is always more complex than using “the cloud”; you need a higher degree of technical knowledge to get it right.Having said that, NAS devices are the easiest way to add reliable on-prem storage.Most consumer orientated devices will have wizards and walk-throughs to get you started.
And, if you’re a business that needs to store more than 50TB of data, you’ll likely have professional help available.
Automated: NAS devices run 24/7 and should be always accessible on your local network.Combined with a wide variety of storage protocols, pretty much any non-mobile device should be able to automate backups with your NAS.
That gets more complex if you need connections from outside your local network, depending on your Internet connection.
Security: Data is in your own hands, so the security and privacy of on-prem solutions are equally in your hands.All software listed have built in security and encryption - so the main point of failure is human: incorrect configuration or simply forgetting to revoke access to ex-employees.Also, make sure you keep your NAS up to date - bugs and security vulnerabilities are found frequently, updates are key.
Recommendations: NAS devices are a great way to store backups.They have good reliability and longevity, plus are competitive with the cloud on cost, and pretty easy to configure.If you need global access to your data, they might not be as good, depending on your Internet connection.
Criteria | Rating |
---|---|
Capital Costs | 3/5 |
Ongoing Costs | 4/5 |
Reliability | 4/5 |
Longevity | 4/5 |
Access | 4/5 |
Scale | 3/5 |
Simplicity | 3/5 |
Automation | 4/5 |
Security | 4/5 |
Overall Suitability for Backups | 5/5 |
Overall Suitability for Archives | 3/5 |
Disks (either hard disks or solid state drives) can be purchased in an external enclosure with USB connection and stored in a safe place (possibly an actual safe).
In many ways, this is simpler than NAS devices.Buy a disk, copy data on it, stick it in a safe, done.
Capital Cost (AUD): $80ea (1TB), $100ea (2TB), $160ea (4TB).
You absolutely 100% must without exception buy multiple disks for redundancy.Data should be copied onto at least 2, preferably 3 disks.And then stored in different locations.
If you want to store them in a real safe you might need buy one. Costs start at $500 and can reach $3,000 for larger fire proof safes.
If you don’t care for a safe, storing disks on a bookshelf is nice and cheap (if not very fire resistent).
Ongoing Cost / GB (AUD): ~5c (2TB drive).
There’s no electricity being used, and no Internet required, so no ongoing costs for existing media.
OK, we should allow some maintenance because these disks will fail.However, we’ve already factored 2x or 3x redundancy in capital costs.
Cost per raw GB is 5c.You need to multiply that by your desired level of redundancy.
Reliability: While Backblaze publishes HDD failure rates, these do not apply to disks stored offline.
In my failure modes article, I looked for good statistics about the reliability and longevity of disks stored offline.There’s nothing remotely comparable to Backblazes’ data.
My anecdotal data: I used external disks for backups for ~5 years.The biggest source of failures was me dropping them accidentally.You can get rugged external disks which can mitigate this risk, but the “physical factor” is much more important when you’re physically moving disks around.
Longevity: As with reliability, there’s minimal data in this area.
Checking the table on my failure modes article, 5 years looks very safe, 10 years is possible, and 20 years is the upper limit.Solid state disks have a shorter life time (and we have even less data about them).
The advantages a NAS has in this area (automated reliability checks and notifications) doesn’t apply.You need to manually pull disks out of your safe on a regular basis, and test for correct operation.
USB should be around for another 10-20 years in some form, so that’s relatively safe.
Access: Offline devices are harder to access by definition.You need to manually retrieve the device, and connect it to a computer to read data.
An external catalogue of disk contents (or at least a good labelling system for the physical disks) is highly recommended.If you need to check every file on every disk, it might take a long time to find what you’re looking for.
Scale: Boxes of external disks scale up really easily: just keep buying more disks (and boxes).This assumes you can divide your data up logically (eg: by year or month).
Kinda interesting that a NAS has an upper limit because all the disks need to be running in the same device at once.While if you’re happy for your data to sit offline, the only limit to scale is your wallet and size of warehouse.
Simplicity: On one hand, “just copy data to disks and stick them in a safe” is about as simple as you can get. But retrieving that data can be extremely painful if you don’t have a catalogue or index of your disks.
Automated: By definition, offline / physical operations cannot be entirely automated.They can certainly be supported by scripts to copy data, reminders to move disks to the safe, and maintenance schedules.But any process that can’t be 100% automated can be forgotten, or done inconsistently.
The biggest risk is testing old disks.We have minimal data about how long we can leave a hard disk powered down and still be able to read data from it.So those tests are incredibly important.And also the most likely thing to be neglected or forgotten.
Security: Data is in your own hands, so the security and privacy of on-prem solutions are equally in your hands.Offline devices require physical access, which is much easier to understand - no key to the safe means no access.No hacker from the other side of the world can touch them.And (with the exception of when disks are attached to a computer) they cannot be wiped or encrypted by malware like Cryptolocker.
Recommendations: External disks are a reasonable offline storage mechanism.However, NAS devices are better for backups (because disk maintenance and backups can be 100% automated).And there are better options for long term archives (see below).
In spite of my negative recommendation, if the other options are unsuitable for your scenario, don’t make perfect the enemy of good.External disks are ∞% better than no disks at all.
Criteria | Rating |
---|---|
Capital Costs | 4/5 |
Ongoing Costs | 5/5 |
Reliability | 3/5 |
Longevity | 3/5 |
Access | 4/5 |
Scale | 4/5 |
Simplicity | 5/5 |
Automation | 3/5 |
Security | 5/5 |
Overall Suitability for Backups | 3/5 |
Overall Suitability for Archives | 3/5 |
Writable CDs and DVDs are the most common forms of optical media.But I’m only going to consider Blu-ray disks here (because CDs and DVDs simply don’t have the capacity needed in 2021).Blu-ray capacity ranges from 25GB to 128GB.
The technological development of optical media has been left behind due to NAS devices and high speed Internet connections.But the Archival Disk is a Blu-ray successor designed explicitly for 50 year life time.(It also costs over $10,000 for drives, so out of reach for personal and small business scenarios).
Capital Cost (AUD): ~$200 for Blu-ray burner.
Assumption: you have a computer available to plug it into.Internal SATA and external USB burners are available.
As with external hard disks, you should buy multiple burners for redundancy.And you may need a safe, bookshelf or small warehouse for storage.
Ongoing Cost / GB (AUD): ~9c.
There’s no electricity being used, and no Internet required.
Single layer Blu-ray disks store 25GB and cost ~$2.15ea (on average).That works out to ~9c per GB.As with external hard disks, you need to factor your desired level of redundancy (minimum 2x, recommended 3x).
Note that I found Blu-ray media a little hard to find via Australian vendors.I resorted to e-Bay to import direct from the US or Japan with good results.
This is more expensive than external hard disks, but quite competitive with the cloud.
Reliability: Once burned and verified, I’ve found optical disks have very high reliability.Unfortunately, that’s based on my experience, not published data.
In my failure modes article, I outlined my anecdotal evidence for CD and DVD based backups still accessible after 10-20 years, even when there was no maintenance or regular tests done on the disks.This was a giant experiment that I didn’t realise I was running!But it shows a 99.9% success rate for optical media.
I’ve also done some “test to destruction” tests for Blu-ray disks: the real killer is direct sunlight.Every disk exposed to extended sunlight showed failures within 1 month.Heat and cold are less of a problem.
Scratches are a concern.Blu-ray has made improvements to disk coatings to mitigate scratches.But care when handling disks is still important.
Note that some Blu-ray drives support surface error scanning which can estimate if a disk is degrading and will fail soon.Apparently mine doesn’t (and ones that do are hard to come by).I found a reasonable proxy is the read speed: if a disk reads at high speed, it’s probably OK, if it reads very slowly and has a number of retries, it’s likely to fail soon.
Longevity: As with reliability, there’s minimal data in this area.
I consider optical disks a better way to store data offline, as compared to hard disks.Optical disks have separate reader and media: as long as your disks are OK, you can always buy another reader.Modern hard disks integrate the physical media and reading interface: so the data might be OK, but if the disk firmware or motor fails, its very expensive to read the data.
Blu-ray disks also use non-organic material.The organic dyes used with writable CDs and DVDs were a big concern (although I never observed failures).With Blu-rays, this isn’t an issue any more.Hard disks use a magnetic basis for storing data; this will decay over time if the drive isn’t powered on (although its unclear how quickly).
Finally, there is a Blu-ray M-disc technology which claims “a projected lifetime of several hundred years”.As far as I’m aware, there is no other consumer technology that makes such a claim.(And my test-to-destruction tests of M-disk Blu-rays has yet to cause a failure; 6 months and counting)!
As with external disks, you need to manually pull your Blu-rays out of your safe on a regular basis, and test for correct operation.
As long as you can purchase a new Blu-ray reader, you should be able to read data from the disks.Given that CD readers have been available for ~30 years and are still sold today, we’re reasonably safe here.
Access: Offline devices are harder to access by definition.You need to manually retrieve the device, and connect it to a computer to read data.
An external catalogue of disk contents (or at least a good labelling system for the physical disks) is highly recommended - and even more so for optical media as it is significantly slower than hard disks, and has lower capacity (so more disks).If you need to check every file on every disk, it might take a very long time to find what you’re looking for.
Scale: Boxes of Blu-ray disks scale up really easily: just keep buying more disks (and boxes).This assumes you can divide your data up logically (eg: by year or month).
Blu-ray disk capacity starts at 25GB for single layer disks.50GB duel layer, 100GB triple layer and 128GB quad layer disks are available.Be aware that the 100GB and 128GB disks use a slight different technique when burning, which makes them incompatible with older readers.
There are even disk library systems available (for $call) which store up to 50TB of data.
Simplicity: Optical disks are more difficult to write than external hard disks.Modern operating systems generally make this straight forward, but its more involved than “just copy data to disks”.Remember that retrieving data can be extremely painful if you don’t have a catalogue or index of your disks.
Automated: By definition, offline / physical operations cannot be entirely automated.They can certainly be supported by scripts to copy data, reminders to move disks to the safe, and maintenance schedules.But any process that can’t be 100% automated can be forgotten, or done inconsistently.
The biggest risk is testing old disks.Although I’m more confident about longevity of optical media as compared to external hard disks, we still don’t have much data on the topic.So those tests are incredibly important.And also the most likely thing to be neglected or forgotten.
Security: Data is in your own hands, so the security and privacy of on-prem solutions are equally in your hands.Offline devices require physical access, which is much easier to understand - no key to the safe means no access.No hacker from the other side of the world can touch them.And write-once optical media cannot ever be wiped or encrypted by malware like Cryptolocker.
Recommendations: Optical media is an excellent offline storage mechanism for TB scales of data.The best use case is for long term offline archives.NAS devices are better for short term backups (because they are easier to automate).
Criteria | Rating |
---|---|
Capital Costs | 5/5 |
Ongoing Costs | 5/5 |
Reliability | 5/5 |
Longevity | 5/5 |
Access | 3/5 |
Scale | 4/5 |
Simplicity | 4/5 |
Automation | 3/5 |
Security | 5/5 |
Overall Suitability for Backups | 3/5 |
Overall Suitability for Archives | 5/5 |
Magnetic tape has been around longer than hard disks, and is very well understood as a long term data storage medium.It’s capacities are significantly higher than optical media, and similar to external hard disks (1.5TB for LTO-5 tape).
While there are many standards for magnetic tape, Linear Tape-Open is the most common.
Note: my personal experience with tape is very limited (I used it for business client backups in ~2004).
Capital Cost (AUD): $1,500 to $7,000. And sometimes $call.
Lenovo, HP and Dell all sell new LTO-6, LTO-7 and LTO-8 tape drives.However, many don’t publish prices on the Internet.Finding the drives via other Australian vendors is also an exercise in futility.
These devices are available second-hand on eBay for $300-$2000.Although they are usually older (LTO-4, LTO-5, LTO-6).
Ongoing Cost / GB (AUD): ~1c.
There’s no electricity being used, and no Internet required.
Tape cartridges are slightly easier to find pricing for, and are available for $100-200ea.Interestingly, there isn’t a significant premium for newer cartridges; LTO-6, LTO-7 and LTO-8 are priced within $50 of each other.And when the capacity of those are 2.5TB, 6TB and 12TB respectively, the cost per GB is really good!
As with external hard disks, you need to factor your desired level of redundancy (minimum 2x, recommended 3x).
Reliability: As I’ve had no recent experience with tapes, it’s hard to know how reliable they are.
The reliability of magnetic data tape suggests it is very good.And given that tape (indeed any offline storage) is the last line of defence, high reliability is very important.
Given the low cost of tapes cartridges, it would seem very silly to only have one copy.The usual 2x or 3x redundant copies should apply to tape to ensure reliability.
Longevity: LTO tape is designed for 15-30 years of archival storage.
As with external disks, you need to manually pull your tapes out of your safe on a regular basis, and test for correct operation.
As an individual consumer, finding tape drives is quite difficult.I assume if I were a medium or large business, I’d have a direct line to a large vendor who would make this process very easy.And given that there are many LTO manufacturers, I’m assuming this is a relatively safe technology.
One thing I noticed was that a drive only supports the current tech, and previous 2.So an LTO-8 drive can read/write LTO-7 and LTO-8 media, and read LTO-6 media, but can’t touch LTO-5 and earlier.That’s not a great property.
Access: Offline devices are harder to access by definition.You need to manually retrieve the device, and connect it to a computer to read data.
An external catalogue of disk contents (or at least a good labelling system for the physical disks) is highly recommended - and even more so for tape media as it is significantly slower than hard disks.If you need to check every file on every tape, it might take a very long time to find what you’re looking for.
Scale: Boxes of tapes scale up really easily: just keep buying more disks (and boxes).This assumes you can divide your data up logically (eg: by year or month).There are even tape libraries that make it easy to work with many tapes (just don’t expect to be able to afford one in your home).
Simplicity: Tapes are even more complex and unusual than optical media.Because I haven’t had any recent experience with tapes, “not simple” is all I can say here.
Automated: By definition, offline / physical operations cannot be entirely automated.They can certainly be supported by scripts to copy data, reminders to move disks to the safe, and maintenance schedules.But any process that can’t be 100% automated can be forgotten, or done inconsistently.
Security: Data is in your own hands, so the security and privacy of on-prem solutions are equally in your hands.Offline devices require physical access, which is much easier to understand - no key to the safe means no access.No hacker from the other side of the world can touch them.There are write-once LTO tapes (although I understand that’s based on tape firmware rather than a physical proprty of the tape cartridge), and write-once media cannot ever be wiped or encrypted by malware like Cryptolocker.
Recommendations: Magnetic tape is an excellent offline storage mechanism for multi-TB scales of data.The best use case is for long term offline archives.NAS devices are better for short term backups (because they are easier to automate).
Criteria | Rating |
---|---|
Capital Costs | 2/5 |
Ongoing Costs | 5/5 |
Reliability | 5/5 |
Longevity | 5/5 |
Access | 3/5 |
Scale | 5/5 |
Simplicity | 3/5 |
Automation | 3/5 |
Security | 5/5 |
Overall Suitability for Backups | 4/5 |
Overall Suitability for Archives | 5/5 |
I’ll make brief mention of some useful software, when doing backups or archiving.
RClone is a command line app which can copy and synchronise data between many different cloud storage providers.In short, it can be used to mirror data from your NAS to AWS S3, or between AWS and Azure, etc.Its rather difficult to configure at first, but once working, its a fantastic way to ensure you have backups on both the cloud and an on-prem NAS.You do your backups to either the cloud OR your NAS, then use RClone to mirror to the other.
Cyberduck, and its more powerful cousin Mountain Duck, are my go-to tool for GUI / interactive use of cloud storage.
Cyberduck is similar to an FTP client, letting you explore your data on most cloud providers via a powerful interface.
Mountain Duck lets you mount your cloud data as if it were a local disk drive.So you can explore work work with data using the same tools you use for disks or NAS data.
Whenever I’ve discussed offline storage (hard disks, optical disks, tapes), I’ve recommended some kind of catalog or index, so you don’t need to inspect all your disks to find what you’re looking for.WinCatalog is such software.It’s interface feels a bit dated, but it is extremely effective at keeping a searchable catalogue of your external media.And that’s a huge improvement over “hmm… maybe what I’m looking for is on this disk… nope, let’s try the next one”.
This one is Windows only, and costs AUD $30 (although there are frequent discounts).
One risk when archiving data is the disk will only be partially readable, so certain files can’t be recovered.MultiPar lets you add redundant parity data to a disk to mitigate this risk.
While I use MultiPar to ensure the integrity of files (via hash / checksum), my primary way to mitigate partial disk failures is to make multiple redundant disks!External hard disks, optical media and tapes are relatively cheap - if you care about your data, just make 2 (or more) copies.
There’s lots of text, so here’s the TL;DR:
If you care about your data, you will have a copy on the cloud AND on-premises.
Cloud:
On-Prem:
Next up: I will outline my own choices of technology for personal and church backups (which you can probably guess based on my conclusions)!
]]>You can read the full series of Long Term Archiving posts which discusses the strategy for personal and church data archival for between 45 and 100 years.
Last time, we listed the failure modes possible when making long term backups and archives. Also remember the broad strategy.
The last thing we will consider before we get into the how of backups & archives is how we might need to access said backups & archives.
On one hand, this will require a certain amount of guess work.On the other hand, it’s very educated guess work.And there are some strategies which will help even if we guess wrong.
List the likely ways I need to access the backups & archives.How often that might happen.And how that influences my choice of technology.
I’ll start with some personal observations:
And some implications:
Let’s think about these in more detail.
The only way to structure long term backups is as time-series data. That is, data must be grouped by year (which works well for financial transactions) or when it was created, or last modified. That is, you store all the files, documents and records for 2020 on one disk, and all the data for 2021 on another, 2022 on another. and so on.
Nothing else works.Nothing else scales.Particularly when you have 45+ years of data to retain.
The good thing is there’s only so much data you can create or modify in a given time period.And unless you’re Google or Facebook or Twitter, you can always backup everything that changed in the last year / month / week / day (choose whichever works best).If you end up with a particularly large year / month / week / day, you can usually break it up into smaller chunks.Or, in the worst case, split into multiple chunks (eg: A..K and L..Z, or first 100GB, second 100GB, etc).That is, when I say “store all data on a disk”, that may be “a set of disks” (2020 might only be 1 disk, but 2021 might be 2: January to June, and July to December).
Your backup media needs to scale with time as well.And time is big; 45 years is a long time.Some media does this better than others: hard disks in a server will eventually run out physical space in the server, while external hard disks can just pile up until your warehouse is full (and you can get very big warehouses).Even if you decide to convert your warehouse into a data center and with lots of servers, it will be much more expensive than the raw media - the servers themselves, people to maintain them, electricity to run them: they all cost money.The cloud is very good as scaling up, if you chose the right storage product: “object storage” is effectively limitless, “block storage” has an upper limit.
(Aside: time also impacts media longevity, as we discussed in part 2’s failure modes. I’ll consider that in more detail in the next post).
There are five access scenarios to consider:
Scenarios 1 and 2 are the regular operations of creating and maintaining backups.Scenarios 4 and 5 are pretty much the same.So, when you need to restore from backups, we’re down to 1) everything, and 2) a few things.
If the everything scenario happens, you’re going to grab all your backups and restore everything from them in sequence (or parallel if you can load multiple disks at once).There’s no worry about “do we need this or not?” - you need everything so the restore is done in bulk.Access pattern is sequential, and all media is really good at sequential.
If the few things scenario happens, you need to be more targeted in which backup disks you restore from.You need some way of identifying which disk(s) are of interest.So, at minimum, you should keep a list of the files on each disk separately.Even better, an index or table of contents that you can look at without loading every disk.Also, some backup technologies are much better at random access than others - HDDs, optical disks and the cloud are all good at reading one thing; tapes not so much.
Backups and archives are accessed, by nature, infrequently.Here are the operations you perform on backups, in order of frequency (most frequent first):
Always optimise for your common operations.There’s no point making sure you can restore an individual file in under 5 seconds if it takes a week to back it up in the first place.Making backups & archives need to be quick & painless (and automated whenever possible).Verifying should also be straight forward.
Generally, people assume that pulling data from a backup doesn’t happen instantly.So if it takes an hour or even a day to complete a restore, that’s OK.(Of course, always ensure your users understand and agree to any time frames).
I’ve just said its OK if restores take time.Well, this is a special case where it’s not OK.
If your one and only server crashes, every minute longer the restore takes is a minute of lost productivity multiplied by every user (in a business, you can easily put a dollar figure on this; it gets big very quickly).
If you need disaster recovery, you need it fast!So, you should a) plan your backups so a restore can happen fast, b) do practise runs so you understand exactly what needs to happen, and c) optimise & automate so that it happens faster!
Fortunately, not everyone needs disaster recovery - if my personal TrueNAS server fails and I can’t get it running again in under 24 hours, it will be a headache for me, but its not like I’ll lose a million dollars or get fired.
For the data Wenty Anglican needs to retain for Safe Ministry purposes, there is one additional access scenario: a legal request.
I’d expect it to go something like:
“B Bloggs has allegations of <insert terrible crime here> made against him / her. As the police / prosecution / defence team, we require all relevant ministry documents from Wenty Anglican pertaining to B Bloggs in the ministry role of youth and children’s leader from January 2027 to December 2030.”
I’m hoping that will never happen, but history shows that some people, given power over another, will abuse it some of the time (Christians call that “sin”).So I’m expecting it will happen one day.
And that day will suck if I’m still in charge of church backups & archives.
From a data access point of view, I have a date range, so I can get any disks that have data for that period of time easily (time series data).
There are two additional criteria: the person, and the their role.Ideally, I want some way to identify documents or data based on those criteria.So that I don’t need to trawl through 3 years of everything.
For now, I won’t answer that question.Part 8 will look into how to structure data within backups & archives, and how to create good indexes to find things within offline media.But its something to keep in mind.
And you should consider if there are special access scenarios you need to optimise for in your particular situation.Otherwise, you get to trawl everything.
In this post, we’ve established all backups & archives need to be time-series data, broken down by year or month.We’ve identified our core access scenarios: everything & a few things, and know we will need some kind of index for the few things scenario.And we’ve identified how frequently we need to access our backups: very rarely - so we should optimise creating & verifying rather than restoring (unless you have a special case that demands otherwise).
Now that we’ve covered failure modes, identified what needs to be on the backup, and the ways we need to access the data, we’re in a position to make intelligent decisions about what backup technology to use!
Next up: I will list different technology options for backups & archives. And discuss pros and cons of each, based on the criteria I’ve listed.
]]>I’ve used LetsEncrypt to generate publicly trusted certificates for any websites I’m running.And used InstantSSL to generate similar S/MIME certificates for my email.These are all free services, which is fantastic.
But there are limitations to them: LetsEncrypt requires a level of automation for maintenance - you can’t install a certificate and forget about it.And it works best if you have shell / console access to the machine you want the certificate on, and that machine has public Internet access.
There are other places I’d like certificates, like internal only websites, or routers - they are using plain HTTP, and browsers get irritated at this “non-HTTPS” thing these days.And there’s more you can use certificates for than just HTTPS: I’d like to have a go at EAP WiFi using certificates, due to an increasing list of security gotchas and issues with WPA2 and WPA3 (EAP is the enterprise equivalent, and seems to have held up better security-wise).
For internal use, I could mint Self Signed Certificates, but they aren’t trusted by devices - they encrypt your data but don’t provide any clear identity for the service you’re connecting to.And if you have to click through all the security warnings, you’re teaching your users the wrong thing.If I had one root certificate to sign the certs installed on my services, I could trust that one certificate to rule them all and my devices would be happy!
And this is exactly what a Certificate Authority (aka, the companies who sell you SSL certificates) does!They have a root certificate, trusted by your browser, operating system or device, and then follow special rules to make sure they only mint certificates for the right people.
If I could be my own Certificate Authority (CA), I could make whatever certificates I wanted!Of course, they’d only be trusted by my own computers and devices, but I can live with that.
Indeed, there’s a sense in which creating my own certificates is more secure than paying someone else to.After all, the magic certificates and keys never leave my network.
I’d always thought creating my own certificates would be just too hard.Then there was a work project that… well… encouraged me to just do it.
Turns out a few Power Shell commands is all I need.
Be my own Certificate Authority.That is:
Before we get to certificates, we start with asymmetric cryptography.This is a bunch of magic math which let you encrypt and decrypt data - but only in one direction.“Asymmetric” comes because the key has two parts: public and private.The public half is available to all and sundry, and lets you encrypt data or verify signatures.The private half is secret to the owner only, and lets you decrypt data and create signatures.The public half can never decrypt or sign, and the private half can never encrypt or verify, so they’re a bit like one-way mirrors.
Data -> Public Key -> Encrypted / Signature |
Asymmetric Cryptography is used in a number of computing applications and contexts.The best known is SSL / TLS and HTTPS.But it’s also used by SSH, PGP and the infamous Bitcoin.
While asymmetric cryptography is wonderful, but it’s just maths.And maths can be used for lots of things, not all of which are useful.So, we need to impose rules on what different key pairs can do, when they are valid, what contexts they are valid in, and so on.
In particular, the maths allow us to be very confident of a secret conversation with another party - that’s wonderful and a big part of what makes HTTPS “secure”.However, on it’s own, it doesn’t help identify the other party - so we might be having a very secure conversation with the Bad Guys™, because we couldn’t confirm their identiy.
Enter X.509.
“SSL Certificates” are actually X.509 certificates.These are horribly complicated things which define a bunch of properties and rules on top of your public / private key pair.In the context of HTTPS, they enable reasonably high confidence in the identity of the other computer.
One of the rules is “what servers is this certificate valid for” - which corresponds to the name you type into your browser’s address bar.My blog is blog.ligos.net
, so the certificate must also be valid for blog.ligos.net
for web browsers to accept it.
So, the question becomes: how do you get a certificate for blog.ligos.net
?Or more specifically, how can someone else validate Murray is really the owner of blog.ligos.net
?Or, in the negative, how does the validation process prevent the Bad Guys™ get a certificate for blog.ligos.net
?
There’s a standard for that.If you want to be a Certificate Authority, there are processes you need to follow to check identities before issuing certificates.
There are two common ways, and a third complex one:
ligos.net
then I can get access to that code.blog.ligos.net
then I can create that file.The first two ways simply validate someone (or something) controls the domain name or web server.The third way is a stricter validation of the actual person (or company) identity.
And in practice, all three ways can be faked if you try hard enough.None are fool proof, but they present enough difficulty to the Bad Guys™ that the system works most of the time.
One thing I didn’t explain is how the Certificate Authority communicates to end users that it successfully validated the blog.ligos.net
certificate.That is, if every person who visits blog.ligos.net
needs to send me an email to verify I own that domain, the whole internet would break very quickly!
The Certificate Authority signs the blog.ligos.net
certificate to say “yes, this is valid”.As long as you trust the CA, you trust anything the CA has signed, so you trust blog.ligos.net
.
The Certificate Authority has a root certificate, which is the thing your web browser knows about.That certificate might chain to zero or more intermediate certificates.Before finally blog.ligos.net
is signed at the very bottom.
This “chaining” allows a small number of trusted root certificates to scale out to the whole Internet.
OK, enough theory, let’s make certificates!
First up, we need to create a root certificate.This is what will pretend to our very own Certificate Authority.
PS> New-SelfSignedCertificate |
There are many options here, let’s walk through them all:
Subject
: the official name of the entity / person. It is a list of key-value pairs, where the most specific is the left, and least specific on the right. C
= country, S
= state, DC
are parts of domain names (ligos.net
in my case), O
= organisation, OU
= organisation unit, and CN
= common name. Given we’re inventing a CA, you can put whatever you like here!FriendlyName
: is what most browsers display to the user. Best to make it the same as “common name” (CN
).NotAfter
: indicates when the certificate expires. I’ve set mine to expire in 50 years, because I only want to create one root certificate (and I’m not expecting to be issuing certs in 50 years time).KeyUsage
: a list of things the certificate is allowed to do, all variations of “signing”.TextExtension
: some magic which says “this is a root certificate”. This is essential for all browsers to trust your certificate as a true certificate authority.KeyAlgorithm
: RSA is the most common, and oldest.KeyLength
: the RSA key size. 4096 is the largest, which is best practise for the root certificate.HashAlgorithm
: SHA384 is higher than the usual 256 bit version. Again, biggest is usually better for root certificates.KeyExportPolicy
: tells Windows we are allowed to export (and backup) the private key. Yes, you need to backup your certificate key!CertStoreLocation
: tells Windows to save the generated certificate in your “Personal” store. More about that below.Type
: there are pre-defined types of certificates. Root certificates are not one of them.After you run the command, Powershell will tell you the thumbprint for your brand new root certificate. Make a note of this, because you will need it when issuing certificates.
Thumbprint Subject |
Your private key is currently accessible to any application you run.Which means, if you get malware on your computer, the Bad Guys™ could create their own certificate that your computer trusts.Potentially letting them impersonate any website (eg: your bank).
To stop this, you should export the certificate including the private key (which goes somewhere very safe as a backup).Then re-import it with certificate protection.This requires a password to be entered each time create a new certificate using your root.
Steps to Export
Search for “Manage User Certificates” to open Certificate Manager.Expand “Personal” > “Certificates”.
Right click your new certificate > All Tasks > Export.Make sure you “export the private key”.And tick “Export all extended properties”.
Give you certificate a password and save it.
Finally, delete the certificate from Certificate Manager!
Steps to Import
Double click the file you saved.Import for “Current User”.
Ensure “Enable strong private key protection” is ticked. And “Mark this key as exportable” is unticked.
Each time you create a new certificate using your root CA, you will be prompted for it’s password.(And you should make 200% sure you have that certificate file backed up; because if you lose it, you have to start again).
You need to load your root certificate into your operating system certificate store.Only then will it trust it.
First, repeat the above process to export your certificate without the private key:
This file can (and should) be redistributed publically.Anyone who installs it will trust certificates you create.The onus is on them to verify your identity and decide to trust you (or not).
Import the root certificate into the “Trusted Root Certificate Authorities” store by double clicking and then “Install Certificate”.Be sure to place the certificate in the “Trusted Root Certificate Authorities” store:
You will need to repeat this process on every device that you own.
You may also need to load the certificate into application specific stores, for example, Firefox has its own certificate store that you can find in Settings.
Now, your device & applications should trust any certificates issued by your brand new Certificate Authority!Let’s make one:
PS> New-SelfSignedCertificate |
I’ll outline the major differences:
DnsName
: this is a special case of “subject”. We use a powershell array to list all DNS names we might access this server by. In this example, there’s an internal DNS name, a public name, and an IP address. The first name becomes the “common name”, others are known as “alternate names”.Type
: unlike root certificates, there’s a well known type for HTTPS.Signer
: this is the thumbprint of your root certificate.NotAfter
: 10 year expiry. I expect my server will be replaced before then. Be careful setting a longer lifetime than your root certificate.When you run this command, Windows prompts you for the root certificate password (hopefully, making it difficult for Bad Guys™ to get their hands on your precious root cert):
Thumbprint Subject |
Once again, your new certificate will be accessible in Certificate Manager.I’m not as paranoid about backing up HTTPS certificates I create.They cost me 10 minutes of my time - if I lose one or muck it up, I can just create another.
(But just to remind everyone, your root certificate MUST, without fail or exception be backed up)!
After deploying my new certificate, Firefox now trusts my connection to my TrueNAS server! (Even if it has a small disclaimer).
The final type of certificate is a “code signing certificate”.Developers may be interested in this to do code signing of executables and installers.
PS> New-SelfSignedCertificate |
There are not many differences:
Subject
and FriendlyName
: we’re back to the convention used in the root certificate.Type
: there’s a well known type for code signing.I’ve outlined the process to export certificate using Certificate Manager from the Windows Certificate Store.When you include the private key, you will get a pfx
file.
Different servers use the key pairs and certificates in different formats.Some can use pfx
with a password, others require a pem
file with no password.They’re all a bit different.
So we need to convert the pfx
into other formats.Unfortunately, I’m not aware of a powershell command for this, so we resort to using openssl:
openssl pkcs12 -in certificate.pfx -out private_key_with_password.key |
The first command extracts the private key and certificate from a pfx
file, and saves it in a password protected file.
The second command reads from an encrypted pem
file, and saves the private key with no password.
You may need to open the files produced by openssl, and copy+paste the contents (to get the exact certificate / key you’re interested in), but all the data is available.
Because OpenSSL is too complicated!
I originally set out to write this article using OpenSSL on a Linux server.And was confronted by this document outlining how to do certificates using OpenSSL.
If you thought this post is long, that link has 7 chapters and about 4400 words of “how to configure openssl” (and very little about how certificates work)!
Quite simply, I don’t need revocation servers and serial numbers and all the rest.I want just enough certificate to make browsers happy when connecting to my TrueNAS server or SyncThing or Mikrotik router.
The following resources were used to create this post:
openssl
!You are now your very own Certificate Authority!And can create certificates trusted by… well… whoever you can convince to install your root certificate.
For use within a household, family or small business, this is fine.And a darn sight cheaper than “real” certificates.
Web browsers will stop nagging you about untrusted and unsecure connections.
(Have I mentioned you need to backup your root certificate enough yet)?
]]>You can read the full series of Long Term Archiving posts which discusses the strategy for personal and church data archival for between 45 and 100 years.
Last time, we listed the failure modes possible when making long term backups and archives. Also remember the broad strategy.
Before we consider the how of backups & archives, we need to ensure we can get our hands on the data we need!After all, it’s rather pointless to have a robust strategy for keeping 45+ years of data safe, if we forget to include crucial files or documents.
List the data I need to store on backups & archives.Then ensure I have access to said data.
Make a list of everything you need to backup.
The simplest list is “everything” - that way you won’t forget!Often “everything” ends up being too big and you have to choose, but we can cross that bridge later.
It might be too abstract to work out a meaningful list of “data”.so, you could check all devices you own / control and inspect the files on them.You could list applications used and the files they use.You could list all your cloud accounts to check for data in the cloud.And don’t forget hard copies.
Now you have a list of data (files, photos, videos, recordings, databases, financials, records, etc).Figure out what devices they reside on.Its possible you have a centralised server (or servers), or they could be stored on each device, or perhaps in the cloud.Write down how you can access them.Write down how large each category is (MB, GB, TB, etc) and how much it grows each year - often one or two categories will make up 90% or more of the total data size.And finally, how you might include them in backups & archives (preferably via an automated process).
If you want, you can make the data list in priority order, and ensure most important things are backed up first.The relative size of each category might mean those are backed up less frequently.
You could also define a lifetime - for my purposes, the lifetime is 45+ years for everything.But it’s possible some data only needs to be retained for a few months or years - that might indicate different backup strategies are required.
Enough abstract principals, lets make some lists!
The list of personal data I want backed up:
The list of personal devices:
I have passwords and admin rights to everything!So no problem with access.And all devices have OneDrive and / or Syncthing to automatically copy data to well known locations.
How does the data on each device get to a backup?
The rule is: if I want it backed up, it should end up on my TrueNAS server.Otherwise, it should be in the Microsoft or Google clouds.I can manage backups from all these locations, either via automated or manual processes.
Assumption: things stored in the cloud are pretty safe; I’m happy to do a manual bi-annual export.I had an automated GMail backup to local files, but it broke years ago and I never fixed it.
A note about cloud data: Google, Microsoft and Facebook all have an export all your data function.The format it is exported in is often mediocre, but it’s better than nothing.For other cloud services, you will need to search for an “export” function.If one is not available, that’s a big risk - if the provider goes out of business you will likely lose your data.
My photos and videos category is both the largest and fastest growing.It’s also my highest priority to survive any disaster or data loss event.
Church is considerably more complex.The main reason is data is stored on various devices owned by volunteers; centralised digital storage is a relatively new thing.
The list of church data I need to backup:
The list of devices I’ll need to get data from:
How does the data on each device get to a backup?
This is the main complexity of our church environment.I need to provide a way (probably via OneDrive) for people to store / submit data to church controlled systems.That’s a change to how people conduct their regular church ministry / work, so it’s not trivial - I need to provide processes, documents and technical support to assist non-technical people in this transition.
Key to this strategy is to move more ministry related data to cloud storage.The more data on servers / services I can access without asking, the easier I can automate backups.
One option I am toying with is Nextcloud, which is an “on-site DropBox”.Basically, something like OneDrive, but on our own hardware.The main reason is to increase our control over data with personally identifiable or sensitive information.It just so happens we have an existing Linux server with a few hundred GB of storage, which should be plenty enough for storing small documents.
The single largest category is church meeting recordings.Since November 2020, we’ve been live streaming and doing video recordings of all Sunday meetings (plus various other events), which is ~1GB per meeting.Previously, it was audio only recordings, which weighed in at 50MB per meeting.These video recordings dwarf all other categories of data, so they’ll need special treatment.However, in terms of surviving 45+ years, they are only of historical importance - compliance data is what we really need to keep long term.
We’ve identified the categories of data needed to be backed up, where they are stored and how we can get this data to a backup (at least at a very high level).Essentially, we’ve identified how to get access or control of any data we need to backup.
Which boils down to: what devices do I need access to?And: how can I export from my cloud service providers?
So make your lists and check them twice!
Next up: how will we need to access data? That is, access patterns will drive the storage technology chosen.
]]>You can read the full series of Long Term Archiving posts which discusses the strategy for personal and church data archival for between 45 and 100 years.
So far, we have a broad strategy for making long term backups and archives.
To implement a viable technical solution, we need to be aware of why it won’t work.That is, we need to think of all the ways backups might fail over 100 years.That is, we need to know exactly how robust we need to be.
That is, failure modes.
Define likely (and unlikely) failure modes for data storage over 45 - 100 years.Remembering that over 100 years, even very unlikely failures become possible or even common.
I’ll discuss the various failure modes below, and give some examples.
The first group I call “insta-fail”.Which means, your backup was never viable in the first place.
Many other failure modes involve time: they become more likely over time, or your data degrades over time.Insta-fails are instant - your data is gone in the blink of an eye!
Examples:
Your backup didn’t actually work. Perhaps the backup disk wasn’t plugged in. Perhaps you didn’t run a manual process. Perhaps you don’t even have a backup!
If you never had a backup to start with well… you’ll have nothing tomorrow, let alone in 45 years. Insta-fail!
A variation: your backup didn’t include the files you need to restore. Perhaps you didn’t configure your backup correctly (missing includes, wrong excludes).Perhaps some files couldn’t be copied because they were in use - remember that important databases and financial records are often in use 24/7.
If you never backed up the files you need well… you’ll have nothing to restore tomorrow, let alone in 45 years.Insta-fail!
Is your backup encrypted? Make sure you never lose the password / encryption key!Modern encryption is built so that you need the exact password to decrypt your data - one character wrong is the same as everything wrong.If you forget the password to your backup, or lose the paper you wrote it down on, or can’t access your password manager - your backups are gone.Well, the data might be perfectly preserved, but you’ll never be able to read it.Insta-fail!
This brings up a tricky question when storing data for 45+ years: should you encrypt it or not?On one hand, there is almost certainly personally identifiable information in your backup, so you should encrypt it.On the other hand, how do you backup the password to your backups?Clearly you can’t use your normal backups for the password, but how do you make sure the password survives 45+ years?I’ll discuss that in more detail in a future post.
There’s a joke that goes: “backups never fail, but restores do”.That’s a jaded way of saying “restoring data is the important thing, backups are an incidental process along the way”.Don’t forget to test you can restore from your backups on a regular (if infrequent) basis.
Media failure is the most common thing people think of when storing data for a long time.Your hard disks, or CDs, or tapes, or whatever slowly degrade over time to the point where they can no longer be read reliably.
However, your backup needs to survive a long time before the media itself cannot be read!There are a few variations of this one, so lets think about examples:
Your backups are lost.Perhaps your backups are on some USB hard disks, and you misplace them in some “safe” place.Or you move house once, or twice, or thrice and they disappear (maybe into the trash, maybe into… well… somewhere).Or they are filed into some system which makes no sense and they end up in some giant warehouse with the wrong label and no hope of finding them without inspecting all 1,000,000 items.
Your backups are stolen.A variation on “lost” - you are robbed and your precious backup on USB disk that is connected to your laptop is pilfered along with your computer.Remember that thieves don’t discriminate: computer gear is computer gear is computer gear, and backups look just the same as any other computer gear.
Your backups are destroyed.This is “lost” into tiny little bits.Fire, flood, earthquake, tornado, and so on.Note that it doesn’t need to be a catastrophic event - a car crash while taking your backup hard disk home might be just as destructive as a fire.I’ve had a number of USB disks fail simply because I dropped them once too often.And if you are taking your backups home or off-site (which is a good thing) accidents become more likely.
In all these cases, if your backups are on physical media, you need to have that media in your hands to get the data off it.If it’s lost, stolen or destroyed - you have no backup.
Your backups survive for years, but simply degrade over time.OK, now we’re into the 45+ year realm!Nothing bad happened, but given enough time, even the best media will fail.
It’s an open question how long this will take, and depends on lots of environmental factors.But hard disks are rated for, say, 100,000 hours of use - which is around 11 ½ years.How might that change if the disks are in cold storage and never powered up?What about solid state disks?Or tapes?Or optical media?
I’ve put together a basic table of different media and approximate life time, based on the Internet.The “Refresh Interval” is the frequency you’d need to power up media and “scrub” for errors to achieve the “Life Time” reliably.Note that I found it quite difficult to find hard data on long term media life time; most is speculation and guess work, with the occasional anectdote.The best source is BackBlazes’ hard drive report, but that is for running and active drives, not cold storage.
This reflects a cold, hard reality: no consumer media has survived 45 years, because none of this media was available 45 years ago.
Media | Life Time | Refresh Interval | Sources |
---|---|---|---|
Hard Disk | 8-20 years | 1-2 years | Source 1, Source 2, Source 3, Source 4, Source 5, Source 6, Source 7 |
Solid State Disk / SD Card | 5-10 years | 6-18 months | Source 1, Source 2, Source 3 |
Optical (CD / DVD / BluRay) | 7-30 years | None | Source 1, Source 2, personal experience |
Magnetic Tape | 15-50 years | ??? | Source 1, Source 2 |
A short story about the “personal experience” for optical media:I made backups from 2000-2010 on CDs and DVDs (stopping when my weekly backups exceeded capacity of single layer DVDs), burning data and leaving the media on spindles.There was zero maintenance - disks went on spindles each week and were left in a cupboard.Occasionally, I made two copies and stored the other copy at my parent’s house.I dug these disks out recently and was able to read every disk, except one from 1999!There was definitely some degradation of older disks (reading was very slow), but only one hard failure.
So, ~50 CDs and DVDs tested out of ~200 burned.Age: 10-20 years.No maintenance.No special environmental control: I kept them away from direct light and water, but temperature would range from 10°C to an only-in-Australian-Summer 40°C.
And 99.9% success!
Cloud providers like AWS, Azure and Backblaze claim crazy reliable data availability of 99.9999% or more.And when they spend billions of dollars each year, they can probably do a better job than I can on a budget of $500.But there are a few failure modes that can catch you unaware.
Your Internet is down.Pretty obvious that you can’t access the cloud when the Internet is down.
Your Provider has a Temporary Outage.It’s possible the provider has a serious network outage - though this is much less likely because they have multiple redundant connections.What is more likely is an application issue, or an authorisation problem, or some other transient outage.These usually only last a few hours, and only happen once or twice a year, but they do happen.
Your Cloud Provider Disappears Forever.Yes, the cloud can disappear.And much faster than you think!Companies go out of business all the time, or decide cloud backups aren’t a profitable business model.While it is unlikely Amazon or Google or Microsoft will go out of business, over a 45-100 year time line who knows what might happen!Remember, “the cloud” is a trendy way of saying “renting someone else’s server” - rental agreements last a few years at most, not 45+ years.
Your Cloud Account is UnavailablePersonally, I think this is the scariest thing about using the cloud for long term archiving.If you forget or lose your account password, your data is gone.If the provider decides to block access to your account, your data is gone.If a government takes legal action against a cloud provider, your data may be seized and unavailable.
Basically, there are things completely outside of your control that could block access to your data in the cloud.
Technology gets old very fast.And the new and shiny quickly replaces last year’s amazing storage tech.When you’re thinking about a 45+ year time scale, whatever you use to store data today is definitely, 100%, without a doubt going to be obsolete when you really to read it.
Floppy disks are a nice example.I haven’t touched a floppy disk since… I can’t remember!And I don’t own a computer capable of reading one any more.If my backups are on floppy disks, I’m in trouble.
I have backups on optical media (CDs and DVDs).Optical drives are not as popular as they once were, but some (not all) of my computers still have optical drives.Perhaps optical drives will go the way of floppy disks in 10 years.
SATA is the standard interface for consumer hard disk drives.20 years ago it was IDE with 40 pin ribbon cables.35 years ago there were ST506 controllers and MFM drives - and yes, I remember using them when I was a kid.NVMe is becoming more popular, perhaps it will surpass SATA in the next 40-ish years, rendering all today’s HDDs unreadable?
USB is everywhere today.But will it be in 40 years? 60 years? 100 years?You need a laptop or desktop device to read a USB disk; but mobile phones and tablets are more popular, yet cannot read a USB disk.If pocket computers completely replace desktops and laptops, how will you read your precious backups?
Using a NAS appliance with an Ethernet UTP cable to store your backups?I remember attending LAN parties in the mid-90’s with 10BASE2 coax cable.Wired ethernet seems to have stagnated in the consumer space recently, perhaps your next NAS will be WiFi only (I hope not, but who knows)!
Pretty much every storage technology in common use today didn’t exist 45 years ago.If you’re storing data for 45+ years, be ready to migrate from old to new technology.
Fortunately, all my doom-saying isn’t all that bad.Almost all the tech I’ve mentioned above is still available, it just might require some eBay purchases to acquire niche equipment.
Data can be available in open or proprietary formats.Open formats like PDF, RTF, JPG or MP3 are readable by many applications.Proprietary files are only readable by one application.If you can’t use that app any more, the data is also gone.
This kind of thing is very common in medical or industrial settings, less so for every day documents, pictures and videos.So this is more applicable to businesses using niche or specialised software.
I looked through some old backups from late 90’s and found pm6
files.For various reasons, most of my work in high school was done using Adobe PageMaker.The last update for PageMaker was in 2001; I don’t have the disks any more, and even if I did, Wikipedia says it doesn’t work on Windows 10.So I have no way of reading those files - that data is gone.
An insidious form of this is proprietary backups.Imagine you purchase “Acme Backup”, which saves your data in acme
files that only it can read.One day, Acme goes bust and your backup product is no longer supported.No problem, you continue using Acme because “not supported” doesn’t mean “it stops working”.But eventually, after a few computer upgrades it does stop working: Acme Backup is not compatible with Windows 2040.Even if your acme
backups files are available, you lack the software to restore from them.
Perhaps the proprietary application is still available, but you just lost the license code required to use it.Best case, you need to buy a new license.Worst case, the application is not sold any more and you’re stuck.
This is a variation on Application Unavailable, but 100x worse.Not only is the application obsolete, the data format is also obsolete.That is, nothing out there can read your files.
Let me say, this is really, really, REALLY unlikely to happen.Applications almost never remove support for such core functionality.
But imagine if someone came up with something better than JPEG - images could be stored without any loss of quality, in just a few kB, encoded and decoded with minimal CPU usage.It is such a good format that everyone stops using JPEG.Cameras, phones, even the Internet all switch to this magical new image format.Eventually, application developers decide supporting JPEG is too difficult, too time consuming, and brings no benefit.So they remove JPEG support.And all those JPEG family photos from the early 2000s are unreadable.
(Note that we already tried to come up with a better JPEG - it didn’t take off).
Some of the core standard file formats include: JPEG, MP3, ZIP, PDF, UTF8 text.And let’s be honest, none of them are going to be obsolete any time soon.But 100 years is a long time.
Perhaps the file system on your disk isn’t supported any more.NTFS, ZFS, UFS, Ext4 are all in common use - and again, support for these isn’t likely to disappear.But 100 years is a long time.
One real world example of an obsolete standard (albeit not a file format) is SSL.The thing everyone calls SSL is actually TLS - Transport Layer Security.The current version of TLS is 1.3.Version 1.2 is also in common use.And 1.0 and 1.1 are considered a security problem, so many servers are disabling them.Poor old SSL is even older, being deprecated in 2015 as a security hazard.
So, if you’re using an old version of Netscape Navigator from the late 90’s, you cannot access most of the Internet.And if you’re sticking to Android 4, Windows XP, or any version of Internet Explorer before 11 (so 15-ish years ago), you’re in the same boat.All those backups in the cloud are inaccessible!
People are a key point of failure in any organisation.
It could be as simple as someone leaves the organisation and doesn’t leave passwords required to access backups.Or that person was responsible for the backup procedures, and never bothered to train a successor.Or that person physically has the backup media.If the person’s gone, the backup may have gone with them.
Perhaps backups are still available, but their content was organised in a very unusual way.Without the “librarian” who knows how it all works, content is lost in a maze of twisty backup disks, all alike.Or there’s a “computer guy” who just knows how the backups work - only he’s gone.
On a long enough time line, the survival rate of everyone drops to zero.People die.Sometimes suddenly, sometime with lots of warning.Either way, any knowledge about backups solely in their head is gone (eg: passwords, procedures, places).
In 45 years, I don’t expect to be maintaining backups at Wenty Anglican.In 100 years, I expect to be with the Lord.
Without key people, backups may be totally useless.They need to pass their knowledge, expertise and passwords onto a successor.
Finally, there might be fundamental changes to undermine long term backups and archives.Things that break our assumptions about how the world works.
The English language will change.Probably not so much that we can’t understand today’s documents in 2121, but probably enough that they will be confusing or ambiguous.A few hundred years and it’s quite possible the English of 2021 won’t be recognisable or understandable.English might end up as a dead language.
Perhaps the Internet will change radically.Maybe someone will undermine how HTTPS works and “the cloud” will no longer be a secure place to store data.Maybe the global Internet will break into multiple Internets that can’t access each other - China is already trying pretty hard to segregate itself.It would suck if your cloud backups were in the other Internet, or behind a great firewall.
Perhaps digital storage isn’t a thing any more.It could be due to a shortage of materials and chips, or a lack of rare earth materials used in high-tech devices, or simply storage stops getting cheaper.Maybe a significant global event (pandemic anyone?) makes digital devices a luxury item and we can’t afford to use them for archives.
Electricity is rather fundamental to digital storage - heck, even hard copies rely on printers, copiers and lighting.No electricity, no digital anything, and definitely no backups.One hundred fifty years ago, in 1871, electricity was well understood from scientific and engineering points of view, but not widely available to general population.One hundred years ago, in 1921, electricity was a luxury that was available to upper classes only.It seems unlikely that the power would go off permanently, but we need to remember its a relatively recent invention when trying to store data for 100 years.
COVID has reminded us that disasters, natural or otherwise, can cause significant social and economic disruption which may impact long term backups.Few will maintain archives or keep passwords if they’re in fear of their lives!COVID has turned into a long disaster, lasting several years (even when vaccines being deployed at break neck speed).While an earthquake or flood has an immediate impact, a pandemic is longer and more drawn out.And requires a different approach to ensure archives survive.
(In case you’re wondering, I’m not going to consider how to mitigate these fundamental changes. I’m just listing them to illustrate how hard long term data storage is).
I’ve focused on digital media all through this article.But it’s worth thinking how the failure cases apply to physical hard copies of documents (ie: paper).
Many failure cases are specific to digital data, and just don’t apply to hard copies:
Some failures apply equally to both:
Hard copies are affected by some even more:
The biggest disadvantage of hard copies is: they are hard to copy.Computers are really good at making perfect copies over and over, really quickly.That’s why the solution for digital archiving is to just make lots of copies and compare them every now and then.Hard copies are physically bigger, harder to copy and trickier to compare.So although you can apply the same principals, its 100x more difficult in practice.
Well, there certainly are a lot of ways data can be lost!(And I’m not even claiming this is an exhaustive list).
I haven’t really discussed how to stop these events, but that will come in the future.And I expect many readers will already have answers in mind.
For now, let’s just admit many things could go wrong.
Some are entirely within our control (so letting them go wrong is just dumb), others are predictable and preventable with appropriate maintenance, others are outside our control and we need to take special steps to mitigate them.And some are really tricky to deal with - indeed, so hard that I simply can’t address them on my $500 annual budget.
Next up: what data I’m interested in collecting (and what I’m not), and how I’ll collect it.
]]>You can read the full series of Long Term Archiving posts which discusses the strategy for personal and church data archival for between 45 and 100 years.
In mid 2020, right as our church was working through what needed to happen to be COVID Safe and resume face-to-face meetings, we got a nasty surprise:
We have to keep records relating to “Safe Ministry” forever.That is, any records or documents that might be needed for a court case involve sexual abuse cannot be deleted.Ever.
Reliable and comprehensive Safe Ministry Records will be an important part of building a case against an alleged abuser of children in our churches, so it is vital that the correct information is recorded in a manner that is able to be kept indefinitely – in other words no Safe Ministry Record information can ever be deleted or thrown away. Source
After some reading of the Royal Commission into Child Sexual Abuse I found a recommendation for storing records for a minimum of 45 years:
We also recommend that institutions that engage in child-related work retain, for at least 45 years, records relating to child sexual abuse that has occurred or is alleged to have occurred. This is to allow for delayed disclosure of abuse by victims and to take account of limitation periods for civil actions for child sexual abuse (see Recommendations 8.1 to 8.3).
I asked: “is there any government or diococen assistance?” And found the answer is “No”.
My initial response was: “Are. You. Serious??!?!?There is no way this is possible!”
And the problem was parked until we had more breathing room post-COVID.
Well, in Sydney, we’re doing pretty well with COVID at the moment, so time to deal with this storing-data-forever problem.
My mission (which I have no choice but to accept - yay for government complience) is to develop a long term data archival strategy for Wenty Anglican Church.The data must be readable in 45 years, and is desirable to be readable in 100 years (the approximate lifetime of a person).
This must be accomplished with off-the-shelf technology, implemented by myself in my spare time, be supported by non-technical volunteer users, and has a maximum budget of a few hundred dollars per year.
Bonus points if we are able to search the data and find relevent information in any way other than “trawling through everything year-by-year”.
Sub-goal: accomplish the same aim for my own family.If I can adopt a strategy that works for me, I have some hope of church doing the same.
Aside:
Although I’m focusing this series on the technical requirement of “long term data archival”, it’s important to note that in a church context this requirement is part of wider policies and procedures to ensure the safety of everyone who comes on our property.That includes church staff, volunteer workers, regular members, occasional visitors, one-off guests, contractors, and anyone else who might walk through our front door (or back gate).It addresses physical, emotional, and spiritual safety.It is particularly geared to protect minorities and vulnerable people (who have been terribly abused in church contexts in the past).
That is, this is not a box ticking exercise for government compliance.It is part of our church’s desire to keep people safe, as we seek to share the good news of Jesus.
My initial reaction to this requirement of 45+ year data retention was: this isn’t possible!
The government is asking volunteer organisations (not just churches) to collect data in a systematic way, store it securely (as many records will identify people; thus raising privacy issues), and ensure it is still available in at least 45 years.
As so much data is digital these days, we need to come up with a digital solution.Only thing is, 45 years ago (1976) the personal computer was not a thing.The cutting edge of digital storage was the cassette tape and could store perhaps 100kB.
In other words, we’re being asked to do something that has literally never been done before, because the technology has not existed long enough yet!
However, I’m not one to be dissuaded by “impossible” goals.
While the digital technology has not existed to retain records for 45-100 years, the analog technology certainly has.
My church has paper records going back to 1919 (when the building was completed).Governments have records going back hundreds of years.And archaeology has been able to recover documents - OK clay tablets - from thousands of years ago.
At church, we see the Bible as the supreme authority in matters of salvation.It also happens to be a collection of documents that have been handed down over many generations - so a fitting yardstick for my current project!
The New Testament was collected from various sources into its final form in 325AD at the Council of Nicaea.And while there is plenty of debate how old the original source material is, the New Testament can be no younger than 1700 years, and is likely closer to 1900 years old (the latest material written in ~120AD).The Old Testament is messier (mostly because it’s older) but consensus is it was essentially what we have today in 132BC when the Greek Septuagint translation was finalised.And the original sources must be older (just how old is a subject of much debate that isn’t relevant for my data storage project).
The point is: the Bible is a written document, originally created in an oral culture, written on materials that naturally decay, and often propagated and copied by volunteers.Yet it has survived remarkably well for around two thousand years.
So storing records for 100 years is certainly not an easy task, but it’s far from impossible.
I’ll leave the details of long term archiving for future posts.This is my overall strategy:
Point 1: the data I store today will outlive me.
In 45 years time I’m not likely to be maintaining records at Wenty Anglican.I might not even be a member there.Heck, I might not be alive.
So I MUST, without fail, be able to hand data on to a successor.I need one (or more) people in-training who can take over after I stop looking after the data.
The data itself (however and wherever its stored) must be documented enough that someone could pick up archiving even if I’m not around.That is, storage needs to be simple, and self documenting.
If someone randomly comes across one piece of the archive (say a hard disk, DVD or cloud backup), they should be able to find their way to other parts of the archive.That is, even if I disappear without handing on to a successor, the poor archivist who has to take over can piece things together from any one part of the archive.
Point 2: whatever choices I make now will be wrong in 45 years.
The technical details of backups and archiving will change over time.And 45 years is a long time.The decisions I make today will become obsolete, or wrong, or be superseded.
In 2000, my first backups were on burned CDs.Later I moved to DVDs.Then to hard disks and network attached storage.And eventually the cloud.Most recently, I’ve started using BluRay disks.
So I MUST, without fail, take a big step back and review my backups & archives every 10 years.I need to be prepared to migrate before technologies become obsolete.I need to look for new and better ways of storing data.Worst of all, I need to migrate from old file formats to new ones (and I’m really not looking forward to that).
In other words, the technical details will definitely change over time.
Point 3: as long as one copy is readable, all is well.
Ultimately, long term archives are a distributed data problem.And that has a well known solution:
As long as one copy can be read, the data has survived.
Step 2 is the weak point, because it implies maintenance.If maintenance is not automated (or at least scheduled) it won’t happen.And, given a long enough time line, all backups & archives will be lost - there is no media that will reliably survive 100+ years (even the 45 year minimum is a stretch).
So I MUST, without fail, have some kind of maintenance program to detect failures and replace them BEFORE all copies fail.
Incidently, this is how the Bible - in particular the New Testament - survived so well.People just kept making more and more copies of it.Even though the originals were lost, the copies (of copies, of copies…) survived.
OK, you’ve come here not to read about some hand-waving high-level strategy, but concrete technical plans to achieve 45+ year data storage.In the context of my own personal data, and for Wenty Anglican church (and definitely NOT for some big corporate organisation).
Here’s what I plan to discuss in coming posts:
Long term archiving of data for 45+ years is tricky.It’s a goal that is longer than I’ve been alive!But it is not impossible if you have many copies, maintain them, and are prepared to change (possibly radically).
I’m signing up for the long haul. Assuming it works, my grandkids will be reading this in 2121!
Next up: a discussion of what things can go wrong with backups over 100 years.In other words - failure modes.
]]>COVID has forced churches around the world online. While we previously were happy meeting in person, we were suddenly forced (by law) to provide content for Wenty Anglican online.
So, phone calls, Zoom meetings, pre-recorded sermons and live streams became normal.All in the space of a few weeks in 2020.
Now the COVID threat is slowly dissipating, we’ve decided to continue live streaming church.(Zoom, on the other hand, is significantly less popular and no one is rushing to keep it)!
I’m pretty technical and learned “the new normal” quickly.And, being the go-to technical guy at Wenty Anglican, I had to implement live streaming in late 2020.I’ve learned a lot about OBS Studio, how unreliable WiFi can be, YouTube copyright and video cameras in very short time!
Now, I’m trying to train others to run our church live streams to a reasonable level of quality.
My main goal is to train new people how to do A/V work in general, and also to train existing people how to run our Sunday meeting live streams.
The way we’ve always done this kind of training is ad-hoc and “on the job”.That is, the person who knows how it’s done (me) tells the people rostered on particular roles how to do their job on Sunday morning.Usually that means there’s one day they’ll watch me do it, then next time (which might be 2-4 weeks later) I put them in the driving seat while I supervise.
This has a number of draw backs, including a) there’s a limited time for people to prepare for our church meeting (30-45 minutes) and that doesn’t allow much time for training. b) Most people want to have some level of training before hand, so they know what they’re up against.
Some changes we made to our meetings for COVID purposes meant our morning meetings were “tech heavy” - many people were already trained for A/V duties. While our evening meetings were “tech light” - only a handful of people were trained and had to be rostered on pretty much every week.We want to transfer those skills around so more people can do more A/V roles.
The first step was to work out very clearly in my mind what needed to be done, and how best to do it.This involved things like configuring OBS Studio for the very simple scenes we needed.And then acquiring and installing the required hardware (a camera with accessories and HDMI to USB converters).
We’ve been streaming since November 2020, and I wasn’t doing all the work myself along the way.There was plenty of on-the-job training.But only recently I completed all the changes I wanted for the minimum level of quality I was aiming for.
To get training to people in bulk, I recorded a number of training videos and screen casts.These demonstrated what people needed to do each Sunday.I won’t comment about those videos here, you can watch them yourself if you want.
Then we conducted a “training day”, which was basically a few hours where people could practise, experiment and do what they need to do on a Sunday.Some of that was ad-hoc experimenting and learning.Some was more structured - following a runsheet for a regular Sunday church meeting.
Just not on a Sunday.And in an environment where there was no pressure to get it right.
OK, it wasn’t just practise, there was a little bit of theory as well:
The equipment and software I used for creating these videos was pretty much the same as the start of COVID.
Training people for A/V takes time.And it works best if you teach in different ways - theory, demonstrations and practical.
Most of all, you need to be clear what you are teaching. Otherwise people will learn nothing.
My aim is that our church will have a good number of technically trained people, so we can live stream at a reasonable quality on into the future.
]]>Each year my family and I attend a Christian missionary convention CMS Summer School.The focus of the conference is to hear a series of in-depth Bible talks (5 x 45 min), receive updates from CMS missionaries serving around the world, and to support said missionaries in prayer and financially.It is attended by ~4,000 people over 6 days.
In short, its the biggest church event I attend in a year.
A few years ago I volunteered for their “tech team”, which does major work setting up the various infrastructure required for the conference.This ranges from power and lighting, to audio and visual (and many other things in-between).
The “tech team” is the goto team for troubleshooting of any vaguely technical issue, plus operating cameras, sound desks, and making recordings.
As per other church meetings, it’s effectively a big live event. Indeed, the biggest live event I have responsibilities at.
For the last two years, I’ve been responsible for implementing networking (among other things).
To provide networking infrustructure for the conference.This includes:
Other than in-ear communications and raw video from cameras, pretty much everything runs over ethernet.
Let’s drill into a few of those in more detail.
Several team members work for Audinate, which created the Dante audio protocol.This is a high quality, low latency protocol to deliver uncompressed digital audio over IP and ethernet.It achieves extremely tight latency between devices: ~300 microsecond latency is normal.And is commonly used in the A/V industry.
From a networking point of view, it needs very low latency gigabit switches.
While the mixing desks and amplifiers use standard ethernet, we make heavy use of Avios, which are an analog to Dante audio adapter.Most Avios require PoE switches.
I’m not an audiophile by any means, but I can understand the technical side of Dante.It’s basically software controlled audio (similar to the usual software controlled things I’m used in to my day job).
2020 was the year of COVID and the year of virtual everything.CMS Summer School runs in January, and due to a number of Sydney COVID cases in late December, the conference had to pivot from ~500 people in-person to essential persons on-site only (max of 200) live streamed conference with a two week warning.
We always knew live streaming would be our primary audience this year, but something like 95% of our audience ended up being virtual.
So high quality live streaming was very important.
Streaming was done via Vimeo using a Teradek Vidiu Go hardware device.Obviously, high speed broadband internet is required.
The network needed Internet access.The KCC Conference Centre, which hosts CMS Summer School, already has Internet access.We just need to tap into it.And provide WiFi APs so that wireless devices can connect to the network (there are apps which act as simplified mixing desks for Dante audio).
Nothing special here.
Video at CMS Summer School is delivered via 3 HD cameras over SDI to a BlackMagic ATEM Video Switcher.A matching video controller is used for live vision control.
Although the raw video does not run over ethernet, the control channel to the video switcher does.
While we tried various VLANs, trunking and other solutions to create isolated networks for Internet vs audio vs other data.In the end, the best solution was 3 PoE switches in a flat config.The most complexity was some bridging to create an isolated secondary network (to satisfy a Dante audio requirement).
The border router does NAT and some shaping using simple queues.It has the complex firewall rules.We thought we might need to do other complex things on this device (running HDMI over ethernet through it) but didn’t need to.It is also a WiFi AP, but only because of physical proximity to some of our equipment.
The videoland switch is connected to the border router.Videoland is where the video switcher lives and the live streaming happens, plus a few minor sub-title / graphics roles (powered by laptops with HDMI outputs).There are no firewall rules on the switch; our goal is purely hardware switching for minimal latency.Usually with Mikrotik devices, I use VLAN interfaces, but we found they are implemented in software and introduce additional latency that Dante could detect.
Between the border router & videoland switch, we have 24 ports of PoE ethernet + WiFi.
Live streaming had a dedicated NBN connection (100Mb down / 40Mb up), plus a 4G / LTE backup.These were patched by the owners of the auditorium into a network switch; we just needed to run patch leads to the Teradek Vidiu devices.The 4G backup functioned via WiFi: Teradek > WiFi > switch > external 4G modems.Why? The Teradeks do an automatic fail over from ethernet to WiFi; so if the NBN were to fail, the stream would fail over automatically to 4G via WiFi.
In the end, the NBN never failed and the 4G was never used in anger (although we did manage to crash the Teradek due to a particular visual we used at one point).
The foldback land switch patches from video land.Foldback land is behind the stage controls audio so the band can hear themselves.It’s also where we have all the wireless microphone receivers, amps, etc.And there’s an X32 mixing desk which is considered our master device.This switch is where DHCP runs from; so it is closest to our master mixing desk.
Dante audio requires duel, redundant and independent networks to function correctly.And you can’t fool it by simply connecting the secondary interface to your main switch, or “forgetting” to connect the secondary interface..However, you can fool it by creating two, separate networks with no bridge between them.So we do that for the ~4 devices which require it.
Once again, there’s 24 PoE ethernet ports + WiFi.
Note we only run 5GHz WiFi on the APs.And even then, it’s configured on narrow 20MHz channels.We’re in an auditorium which has Ubiquity APs all over the place with guest networks - the 2.4GHz spectrum is completely full and useless for us.And we aren’t trying to push bulk data over WiFi, just ~100kB/sec of control data.So its multiple APs on narrow, non-overlapping 5GHz channels.
Finally, our front of house switch patches from foldback land for both primary and secondary networks.Other than the secondary network, there’s nothing extra here.
When in action, we see up to 60Mbps of traffic and 30kpps:
But that can vary, depending on device:
And around 40 devices on DHCP:
[admin@sw01-foldback-poe] /ip dhcp-server lease> print |
A couple of interesting things to point out:
We couldn’t do any VLAN trunking over single cables, because the software VLAN implementation caused Dante audio latency problems.Fortunately, we didn’t need to - we had enough network outlets to patch two flat networks.However, if we want this feature, we’d need to work out how to implement it using hardware switch rules.There was even talk of converting the switches to SwOS, to minimise the chance of using features implemented in software - but everyone on team is so familiar with RouterOS we’re very hesitent.
We may move DHCP off the foldback land switch to the border router.It’s something in software on latency critical switches, and it really doesn’t need to be there.The leases are deliberately long enough to cover our live sessions, but short enough to expire before the next session begins.
Although not required this year, we’ve run HDMI over ethernet to get video to other parts of the property - around 150m away from video land (where the signal originates).This is painful because a) we don’t have enough cable runs to where the video needs to get to, b) its video only; we have to run Dante audio separately, c) it consumes too much bandwidth.Given we had plenty of success with streaming via RTMP this year, I’m considering using it to replace HDMI over ethernet.Rather than streaming out to YouTube or Facebook or wherever, and then back in again (with the 30 second delay and external bandwidth hit) we could run an internal RTMP service via nginx which can be consumed by any location with ethernet (even WiFi) on-site.The main advantage is it runs combined audio & video at a high quality using 8.5Mbps, and solves most of our HDMI over ethernet problems.The disadvantage is we need a smart device (Raspberry Pi, Android device or Android TV) to play the RTMP stream.That and I haven’t tested it.
We have a bunch of analog comms (think video director talking to camera operators), which I think is the very last bit of analog gear we use.I’d like to get rid of that and run digital comms over Ethernet.No idea what this involves though.
CMS Summer School is a reasonably large live event.And pretty much everything A/V runs over ethernet at live events.Mikrotik devices are cheap, powerful and meet our requirements for pushing ~100Mbps around for Dante audio.Along with our more minor networking needs, Mikrotik has us covered.
And that means speakers and missionaries can get on with their thing: sharing Jesus.
]]>It’s been a while since my last blog post.Which is because I’ve not had enough time this year.
Because I’ve been working to get Wenty Anglican Church COVID Safe.
OK, you don’t really need much background to COVID-19.It’s been the dominant event of 2020.
In my part of the world in Sydney, Australia, we got off pretty lightly.We had one lock down in March through to May, which was moderately severe, but not as hard as in other parts of Australia or the world.And since then, Sydney has slowly been rolling back restrictions.And managed to avoid a second wave.(Our friends in Melbourne were not so fortunate).
From June 2020, our church started working toward being COVID Safe.
Places of worship were the subject of several COVID clusters early in the pandemic, so there were some pretty strict conditions required to re-open.
Wenty Anglican had been doing online meetings since the March lock down came into effect.After a month or so, we had got into something of a new rhythm: pre-recorded talks, YouTube songs, Zoom meetings.Although it was horribly impersonal, it was as good as we could do.And meant we could continue encouraging each other to follow Jesus even when we could not meet in person.
By June, the NSW government had rolled back restrictions to the point where churches and places of worship could re-open for up to 50 people, if they had a COVID Safe plan in place.At this point we decided to wait - the online meetings were going OK and there was a lot of work required to be COVID Safe.And, the restrictions meant a) no singing and b) limited mingling.Which felt like face-to-face meetings would almost be worse than online ones.
In July, our regular Parish Council meeting spent considerable time working out what going back to face-to-face meetings would look like under the COVID Safe requirements.As a warden, my responsibility was to ensure compliance with these requirements, and the safety of people coming on our property.And we also started reaching out to church members to gauge how many people wanted to return to in-person meetings.
This was the start of crazy busy time for me.There were regular meetings between church wardens to discuss what COVID Safe would look like.And many pages of draft policy documents written.
By August things had changed significantly.In NSW, COVID was relatively under control.But the local Wentworthville area was a COVID hot-spot (our local government area had ~50 active cases and was one of the worst areas in Sydney), and the beginning of the second wave was hitting Melbourne.This gave us pause.There was no immediate need to return physical meetings, and we had to consider the possibility that Sydney could have a second wave just like Melbourne.
While we did not commit to a date to return, the wardens continued working hard to be COVID Safe.
In September, it was clear Sydney wasn’t heading to another lock down.So we committed to going back before Christmas - with an early November soft launch, followed by a few weeks online to debrief and make any changes, before a public re-launch in late November.
And that’s when the work stepped up a gear!We put together a Trello board listing off all the things we needed to do before relaunch day.There were 32 specific items on that list, ranging from moving the wooden pews, to be socially distant, to putting policies together addressing all the government requirements.The biggest single item in my orbit was training people in all the new processes.
At that point, we worked backwards from the due date.Training needed to happen in the week prior to relaunch.All the physical work needed to be completed before training (because any on-site training was effectively testing all our COVID Safe policies) - signage, moving of furniture, purchasing equipment, etc.And we needed to decide all our policies before everything.So September was mostly policy discussions and making final decisions.
In October, everyone was crazy busy implementing everything we decided for the early November relaunch date.While this is one line in a blog, remember that all the wardens are volunteers, so we were working most evenings and weekends to make it all happen.Crazy busy was an understatement!
In November, we actually launched!Our first in-person church meeting was on the 1st of November.There was a lot of nerves after we did our first live meeting since March.And plenty of awkwardness with social distancing.And way too much paperwork to ensure our compliance.
There were three weeks that we went back to online meetings.That was time where I could relax a little.
There was also some debriefing, which lead to a few tweaks and improvements to our processes and policies, which took up some time.
Finally, on 29th of November, we went back to face-to-face church meetings permanently, much to everyone’s delight! (Assuming no further COVID outbreaks occur in Sydney).
By December, we were getting back into the groove of in-person meetings.The NSW government unexpectedly relaxed the restrictions for places of worship.While this did not affect us significantly, it did require us to update our paperwork and policies.The process of these minor updates is now pretty straight forward, so the amount of work is minimal.
Which brings us to mid-December, where I have finally got enough time to write an article!
Only to have a new infection cluster emerge after 6 weeks of zero cases, and see restrictions become stricter once again!
While all places of worship in NSW had to follow the same set of Covid Safe rules, we had some freedom of how to implement them.The following list is what we did (taken from our COVID Safe training slides).
Those are the dot points. Plus a few more items I’ll going into more detail.
Contact tracing has been a big part of the success of NSW Health’s containment of COVID - whenever a case appeared, lots of effort goes into working out where that person has been when potentially infectious and aggressively testing anyone they may have had contact with.While this isn’t so effective when there are hundreds or thousands of active cases, NSW rarely got beyond 100 active cases.
At church, we have to keep contact tracing records for 28 days.Our church database for the directory we publish for members is in MS Access.It’s not the most advanced technology, but it’s effective enough.The biggest change was to ensure the database is available on cloud storage so many people could access it, while also being secured.
We built a simple MS Access Report to produce a church roll for regular members - tick a box to indicate the person is here.Plus an A5 sheet to capture details of any vistors.These details are retained in our church safe (and hopefully never needed).
This was built based on my experience managing elections in Australia, which is entirely based on paper rolls and ballots (and righly so IMO).Quickly identify and mark off the majority of people, and have a mechanism for everyone else.
This was the most annoying requirement we had, because it went completely against what church meetings are about: we want to talk to people.Be they people who are regular members, who we can encourage as they walk with the Lord Jesus.Or if they are irregulars, who we want to re-connect with.Or visitors, who we want to welcome and extend the news of salvation in Jesus.Church is about people.And the no mingling rule made that really hard.
Our instructions:
Following the Premier’s advice, we should ensure that members of our congregations do not mingle before, during or after the service. Where morning tea is served after the service, provision should be made for seating persons 1.5m apart, discouraging people from mingling or walking around.
We delayed going back to face-to-face meetings because we knew this would be a) unpopular, and b) very difficult to enforce.
Our policy on this was to enforce 1.5m distancing, and to encourage people to stay in their seats before and after the meeting.
However, in practise, there are a number of people who need to be moving around (ushers, musos, leaders, techs, etc).So it becomes very difficult to enforce “stay in your seats” when half the people present have a reason (or perhaps “excuse” is a better word) to be moving around.And while the number of cases were low, the risk factors meant enforcing this would do more harm than good.(This is changing since the cases appearing in mid-December).
The government requires churches to have an online option, so every church has (by rule of law) become tele-evangelists!!
I have been using OBS Studio quite a bit through 2020 to record training material at work.So it was my choice of streaming software.And we were doing online church via our Wenty Anglican YouTube channel, so YouTube was our broadcast platform.We did a number of tests in the lead up to our first meeting - including streaming our training material.
As of December, it’s usable, if a little unprofessional (with the powerpoint slides appearing as part of the frame).IMO the most important part is the audio feed; if the video isn’t perfect its no big deal, but if you can’t hear then you might as well not bother streaming at all.
The main technical changes were:
We have a few simple scenes: three static slides, plus the live stream itself.And there are the Start and Stop buttons.And that’s it - it’s designed to be simple.
The operators click “Start”, then just before our meeting begins they click “LIVE STREAM”.At the end they click “Thanks”, then wait a few minutes before clicking “Stop”.
Our first lesson was to both watch and listen to the live stream during the meeting to ensure it’s working as expected.There were several times when there was no audio (because of incorrect OBS config), or no stream (because the YouTube stream somehow was marked “private”).
The trickiest technical things were:
The main improvements we’re planning are:
Cleaning was one of the big COVID requirements: we need to clean any regularly touched surface after every meeting to remove the COVID virus.Again, simplicity is key to making sure this happens and is effective.
Many churches have taken to issuing alcohol wipes to each person so they can clean their seat after the meeting.We took a slightly different route: Glen 20.
After the meeting, we ask a few people to spray down all the wooden pew seats with Glen 20.It takes about 5 minutes and ensures more uniform cleaning as trained people are doing it.And a few others are responsible for other parts of the building.
There is a long list of other things we need to clean as well (tables, lectern, door knobs, benches, light switches, etc), but Glen 20 is effective on almost all of them.Electrical equipment is our biggest problem - as spraying 60% alcohol into electronic devices is bound to break things.
For everything else, we quarantine for 96 hours.
My biggest time sink in the lead up to face-to-face meetings was training.We needed to get all our core members up to speed quickly, and as many other regulars as well.Although we were used to doing “COVID things” in other public places and our own homes, church counts as a “business”, so we need to be consistent and meet a higher standard.
There were two sides to this: written material, which contained extra details for specific cases and ministries. And a recording for general training. And then a summary recording so people could see exactly what to expect as they arrive.
The written material is available on our church’s website.We ended up calling it our COVID Safe Playbook, which contains all our policies and procedures required.It makes for pretty dry reading (as is the case for most compliance documents).
For a less boring approach, I recorded a screen cast summarising the playbook using powerpoint slides based on the playbook.This is similar to what I’ve done several times at work.It went for 60 minutes, and was still pretty boring (only slightly less bad than the written version).I also did the same training material as a live stream on two occasions, to give people maximum chance of hearing it - in both live stream cases, it was done to an empty auditorium!
Finally, we recorded a short what to expect video, focused on what someone just walking in the door should know.That one is under 3 minutes!
After we went back to in-person meetings, we found we needed more people trained using the sound desk and computers.So I’ve recorded a few tech training videos walking people through the minimum requirements for making church audible and broadcast in 2020.These ones were done via my phone and edited using the Windows Photos app (which is just barely suitable for the task at hand).
COVID compliance: This is where my last six months has gone.
I’ve learned many new technical things (live streaming, video editing, YouTube).I’ve applied my previous knowledge of creating policy documents.I’ve found how alcohol kills viruses.I’ve used paper to record attendance.And tried to train people how to be COVID Safe.
Hopefully, it will actually stop at least one person getting COVID, and maybe even save a life.
(But at the moment, it feels like fifty+ hours of my time wasted on bureaucracy and paperwork).
]]>In my part of Sydney, the NBN Internet is connected via HFC.That is, the last mile connection to the Internet is via a copper cable.
Due to a strange set of circumstances when the NBN was deployed in my block of 6 units, I’m sharing my connection with 2 other units in our complex.So there’s ~150m of ethernet cable running across the roof of our block.That is, I’m playing ISP for my neighbours.
Unfortunately, copper cables conduct electricity.Even more unfortunately, lightning is made of electricity.
On 12/July/2020, there was a severe thunderstorm in my area.I wasn’t at home at the time (my family was at a friend’s house for lunch), but even there the storm was pretty bad.
I got a message from Uptime Robot that said my Internet connection was down, and I assumed there was a power outage from the storm.So when I got home, I checked the circuit breakers and lights.Only to find power was normal - everything was working, but no Internet.
I checked the HFC modem and noticed it didn’t seem to have power.So power-cycled it.When that didn’t have any effect, I power-cycled the UPS it’s plugged into.
Again, nothing.
I checked other equipment and found my hEX router had also failed.(And later on, found an ethernet port on a server was also dead).
Here are some pictures of the hEX and HFC modem.The hEX has visible damage, while I couldn’t pick anything obviously broken about the HFC modem.
At the time I was annoyed because both the modem and router were protected by separate UPSs, and the UPSs were fine.Even the plug-packs which powered the devices were OK.
You never call NBN Co directly, instead you need to register the fault with your ISP.
So I gave Internode a call.As usual, their service was fantastic - the converstion went something along the lines of: “My Internet isn’t working. There’s no power lights on my NBN modem. And there was a thunderstorm a few hours ago. I think you can see where I’m going with this”.
The tech didn’t even bother troubleshooting anything with me, and helpfully arranged an appointment for an NBN tech to replace the HFC modem the next day.
Long ago, my very first Mikrotik device was a RB2011.It was the router which converted me to Mikrotik and meant I’ve never bought networking gear from another vendor since.
The hEX had replaced the RB2011 as my router ~18 months earlier, in preparation for the NBN becoming available.Since then, the RB2011 was sitting on my desk as a smart switch in my “home office” (aka garage).
Well, it was time to push the RB2011 back into service as a real router!Although it’s hardware isn’t as powerful as the hEX or even more recent hAP ac², the software is identical.So I was confident any feature I used in the hEX would also work with the RB2011.
I pulled my latest backup script from the hEX, and started making appropriate changes to the RB2011.Two things came out of this, 1) my backup was a few months old, so it wasn’t perfect, and 2) I’m glad I take both a system backup and a configuration export. The backup works for a like-for-like restore, but doesn’t work when devices and configuration need to change - the export lets you restore parts of the configuration as required.
Finally, I installed it and physically connected all the cables, as required:
I had everything plugged in and ready to go when the NBN tech arrived the following day.(Incidently, it was the same tech which did my original installation around 12 months earlier).
After he tested the upstream HFC connection was still OK, he installed and connected a new HFC modem.(He left the old broken one with me, as per the photos above).
Within 10 minutes, the RB2011 had established a connection to my ISP and I started getting notifications from Uptime Robot that I was back online!
An hour or so later, I connected up my neighbours again and enabled the PPPoE server.An accidental misconfiguration meant one of my neighbours wasn’t online until the next morning.
For a lightning strike which damaged equipment, I was back up in under 24 hours!And neighbours were up within 36!
That’s a pretty good outcome!
(And even my neighbours are particularly happy with the level of support from their “ISP”)!
I thought it was pretty likely there was a nearby lightning strike which caused the damage.This was confirmed by neighbours who were at home at the time - they heard an extremely loud “bang”, which caused a short power outage, and they also had equipment damaged (NBN modems and networking gear).
But right from the beginning, I knew there wasn’t a direct strike on our building.Lightning strikes involve a lot of energy: thousands of ampers, tens of thousands of volts, which works out to be at least one billion watts of energy discharged in milliseconds, for even a baby lightning bolt.
That much energy won’t damage equipment, it will vaporise it!(Or, if you’re lucky, just set fire to it).
Kenneth Schneider has some information about lighining strikes, and Littlefuse has more technical detail (which is completely over my head, but someone with an electrical background should get it).
Schneider has a fantastically understated quote:
Lightning induced surges usually alter the electrical characteristics of semiconductor devices so that they no longer function effectively.
Err… yes, that’s definitely what I experienced - “altered electrical characteristics” and my devices were “no longer functioning effectively”.
Basically, lightning induced power surges can still damage and destroy equipment, even when the strike doesn’t directly hit said equipment.Usually, the strike is against the grounding wire above power-lines, but even that is enough to induce a surge in exposed electrical wiring.
150m of CAT6 cable and similar lengths of coaxial copper cable definitely qualify as “exposed wiring” when we’re talking 100kV+.
This picture is by W8JI, and posted on the HAM Radio StackExchange:
At least 3 units in our complex lost HFC modems, and my router was destroyed, but there was no damage to UPSs protecting the modem and router.So my guess is the induced current occurred in either the ethernet cables connecting our units, or the coaxial cable from the street to units - maybe both.
My network had the Internet router at its core.Literally, every device had to go through my hEX router to talk to something else.Even though I have a separate WiFi access point, and other switches on my network.
This is pretty normal for most households - one device to rule them all.Aka, single point of failure.
When my hEX was destroyed, not only did I (and my neighbours) lose the Internet, I also lost my LAN.I should know better than this; most of my professional life has been about mitigating the impact of technical failures.
So, when I get a replacement hAP ac², it will be my Internet gateway / border device. And my RB2011 will be my DHCP server and run my internal network.That way, another lightning strike may kill the hAP, but my LAN will (hopefully) continue functioning.
Oh, and Ubiquity sells a gigabit rated Ethernet surge protector for AUD $30.Even if I can’t protect the HFC side, at least I can protect the Ethernet cables running between units.
Downtime makes me sad.And this downtime was particularly bad.
Again.
Unfortunately, there’s not much I can do to protect against lightning strikes during thunderstorms.Fortunately, they don’t happen too often - this is the first one I’ve had first hand experience with (although my father tells me there was a similar incident at his house many years ago).
In the end, a quick response time from NBN Co and having my old trusty RB2011 available at quick notice meant I was back online within 24 hours.I’ll improve my internal network structure, and add some extra surge protectors in the hope of less damage next time.
]]>MakeMeAPassword.ligos.net is hosted on a laptop.This is deliberate - the site barely requires any CPU and a laptop has a built in UPS.
However, the other day, it went down.And I wasn’t notified.
So, after around ~14 hours of down time, some polite users contacted me via email.These arrived an hour after I went to sleep, so I didn’t notice for another 8 hours.For a grand total of 22 hours when no passwords could be generated from MakeMeAPassword.
Overall, this made me rather sad.
A few weeks before this downtime, I’d transferred all my web hosting to a new server… err… laptop… err… server laptop.This was to get a clean Debian 10 install.
As part of this migration, I installed two SATA disks in a mirrored zfs pool to hold all the critical data for the webserver (which is basically all the content I host, plus log files).Unfortunately, those two disks used both available SATA ports (and I had to remove the DVD drive to make room for the second).I used a USB flash drive as the root disk.
Turns out the flash drive I chose was rather cheap.I ran the new laptop for a few months, and 2 days before I was ready to put it into production, the USB failed and went read-only.
I migrated the root partition to a HDD and booted the laptop from an old USB HDD enclosure.This worked well enough, and the new laptop was put into service.
When I woke up and noticed the emails saying MakeMeAPassword was down, I visited the website on my mobile phone (which worked) and tried to generate a password (which timed out).
So I SSH-ed to the server to investigate further.No problem connecting.Started htop
and didn’t notice any obvious issues.Even tried atop
(because it includes IO stats), at nothing jumped out at me.
I tried to become root via sudo -i
, so I could inspect log files.And sudo
simply didn’t run.(Because it’s trying to write to /var/log/auth.log
, and the disk isn’t working).
This meant I a) couldn’t inspect log files, and b) couldn’t reboot using shutdown
.
At this point I had several terminal windows open waiting for sudo
to complete.And atop
was telling me /dev/sdc
IO was taking 10+ seconds.
I walked over to the laptop and checked the console.Took a photo in case I needed the exact error message later on.
And hit the power button.
Several tense seconds later, the machine started its boot sequence.And ~30 seconds after the reboot, it was back up and asking for a login.
I checked MakeMeAPassword was working (both the static content and API).Replied to the emails people had sent, and posted an issue on GitHub.
Then I went to work.(Well, its COVID19, so “work” and “troubleshooting MakeMeAPassword” are conducted from the same physical location).
During my lunch break, I dug into the log files to see if I could gather any more details about what happened, when it happened, and exactly how long the down time was for.(Note that all dates and times are AEST UTC+10).
The first place I looked was /var/log/syslog
, there was a disturbing gap at 9:04:
Jun 23 09:00:01 obiwan CRON[11070]: (root) CMD ( PATH="$PATH:/usr/local/bin/" pihole updatechecker local) |
That’s around 22 hours of dead silence.Which is very unusual.
Unfortunately, there’s no indication of what actually went wrong.Some time after 9:04, the disk couldn’t be written to.
I took a look at /var/log/daemon.log
as well (where systemd logs to), and there was a similar “gap”, and no indication why:
Jun 23 09:03:56 obiwan systemd[1]: Stopping Network Time Synchronization... |
I’ve configured all the nginx sites to log to the zfs mount point, rather than the default of /var/log
.This originally was because I didn’t want to be writing to a USB memory stick too much, but has the nice side affect that logs kept being written there.Here’s part of the HTTP access log for MakeMeAPassword around 9:04 (IP addresses have been changed):
2001:1234:1:10::1 - - [23/Jun/2020:08:48:00 +1000] "GET /api/v1/readablepassphrase/json?s=RandomForever&pc=1&sp=n&whenNum=EndOfWord&nums=2&whenUp=StartOfWord&ups=999&maxCh=63 HTTP/1.1" 200 111 "https://makemeapassword.ligos.net/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36" |
(You’ll need to scoll to the right to see more details).The request at 23/Jun/2020:09:07:55 +1000 is the first indication anything is wrong: it has a 504 “gateway timeout” error.And that’s followed by some requests with error 499, which is Nginx specific.
However, requests to /keepass_plugins.version.txt
continue to succeed (because that file is hosted on the zfs pool).As are requests to non-API end points like /generate/alphanumeric
.And even requests to some API end points like /api/v1/alphanumeric/combinations
.
But anything which tries to generate a password is failing.
The site error logs show lots of errors like this:
2020/06/23 09:07:55 [error] 1270#1270: *21693 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 1.2.3.4, server: makemeapassword.ligos.net, request: "GET /api/v1/readablepassphrase/json?s=RandomShort&pc=1&sp=y HTTP/1.1", upstream: "http://[::1]:5001/api/v1/readablepassphrase/json?s=RandomShort&pc=1&sp=y", host: "makemeapassword.ligos.net", referrer: "https://makemeapassword.ligos.net/" |
And later on:
2020/06/23 18:10:28 [crit] 1270#1270: *26812 mkdir() "/var/lib/nginx/proxy/7/18" failed (30: Read-only file system) while reading upstream, client: 1.2.3.4, server: makemeapassword.ligos.net, request: "GET /api/v1/readablepassphrase/dictionary HTTP/1.1", upstream: "http://[::1]:5001/api/v1/readablepassphrase/dictionary", host: "makemeapassword.ligos.net", referrer: "https://makemeapassword.ligos.net/faq" |
Seems that Nginx creates files (I’m guessing pipes for cross process communication, or perhaps temporary files to buffer the response) when it does its reverse proxy thing.And here’s confirmation that my root filesystem has gone read-only.
My random number generator, Terninger, logs pretty frequently when it re-seeds itself based on external entropy.It goes silent from 9:01.
2020-06-23 08:33:32.5018|INFO|MurrayGrant.Terninger.Random.PooledEntropyCprngGenerator||5-|Re-seeded Generator using 128 bytes of entropy from 2 accumulator pool(s). |
After reboot, it springs back into life.But adds nothing to what we know.
2020-06-24 07:12:31.7333|INFO|MurrayGrant.Terninger.Random.PooledEntropyCprngGenerator||1-|Starting Terninger pooling loop for generator 08db99c3-fdf2-413d-a621-94db1d38288f. |
Finally, I record statistics about each password generated.Nothing identifiable, but enough so I have a very basic idea of what types of passwords people are requesting, how much randomness I need to serve those requests, and how long it takes to generate them.
2020-06-2308:48:05.876+10:00ReadablePassphrase1810.1773810.1773InterNetworkV6 |
Once again, there’s a conspicuous gap from 9:05 onwards.
The evidence points to some kind of hardware failure between 9:05 and 9:08 local time.This eventually caused the root filesystem to become read-only.Which in turn caused some things to stop, but others to continue without problem.
Because the reboot worked, my best guess is there was a USB error which caused the USB HDD enclosure to stop working.
One very bad thing was I didn’t find out about the problem until 22 hours after it started.
I use Uptime Robot to alert me if any of my websites or computers go down.It works by sending an HTTP HEAD or GET request to the website and expecting an HTTP 200 response.
The problem was the request was directed to https://makemeapassword.ligos.net not https://makemeapassword.ligos.net/api/v1/passphrase/json.The only part of the site which had failed was the part which generated passwords.Every other endpoint was still responding normally.
I even have SSH based monitoring for the server, and it was still working!
So I’ve told Uptime Robot to also monitor one of my API endpoints, so I find out if it breaks again.
IO errors usually mean a HDD is about to fail.They’re a special kind of bad.The sort that makes sys-admins break out in a cold sweat.
A reboot is a good start, but I did a full check of the disk using badblocks.This was only the read-only test, but it gives me some confidence the HDD itself isn’t about to die.
$ sudo badblocks -v /dev/sdc |
That leaves the USB subsystem of the laptop, or the USB HDD enclosure as the most likely offenders.
In the week since the downtime happened, the server hasn’t had any further issues.
If I was using a desktop, there would usually be 4 or 6 SATA ports and I wouldn’t be bothering with USB anything.But the laptop only has 2 SATA ports, both in use by my mirrored zpool.There isn’t a physical SATA port to connect the root disk to.
There is an internal expansion slot for an mSATA device, however.Page 82 of the Linkpad T530 Hardware Maintenance Manual confirms I could install a 60mm mSATA solid state drive.Which would perform much better than a USB2 connected HDD, and is likely to be more reliable.
I shopped around my usual Australian online computer parts stores.Only to find that mSATA barely rates a mention.Seems that M.2 form factor SSDs are all the rage and no one cares for old slow mSATA in 2020.There are EBay stores that sell mSATA devices, however even that page has plenty of M.2 devices.Looks like for AUD $40-$80, I can get a 64GB mSATA device.
I’ll keep it in mind if I get repeated failures.
Downtime makes me sad.And this downtime was particularly bad.
My apologies to all users of MakeMeAPassword.
The additional monitoring should prevent extended downtime.And I’ll be keeping a close eye on the server to ensure it’s very reliable.
If worst comes to worst, I’ll be spending money on an mSATA drive.
]]>Debian 10 “Buster” is available.
Actually, it’s been available for ages. But I was very slack publishing this article.
Here are my notes compared to Debian 9.
As in Debian 9, sudo
isn’t installed by default.However, the group to be a “sudoer” is now sudo instead of sudoers.
$ su |
The biggest challenge was the laptop I installed Debian 10 on wasn’t blanking the screen any more.
It used to do that in Debian 9 automagically.And I was a bit disappointed Debian 10 wasn’t co-operating.
My first attempt was to use setterm.It lets you configure a timeout to blank the laptop screen.
$ su |
This worked OK when I had used su
to become root.It didn’t work via sudo
(which was rather surprising), and I couldn’t make it work as a systemd
service either (despite running as root).
A solution that involves me manually running a command isn’t going to work.So I looked for other options.
The other approach is to tell the kernel to blank the screen.There is a Linux kernel option called consoleblank
, which does what I want.It blanks the console after N seconds (default = 600, or 10 minutes).
Seems the out-of-the-box Debian kernel sets this to 0, which disables console blanking.
I’ve never set a kernel option.Heck, I never knew you could pass options to the kernel - although I should have known better, the kernel at least needs to know what device to boot from.So, StackOverflow, how do you set a kernel boot parameter?
Apparently, you need to modify the grub
configuration, update the bootloader and then reboot.
$ cat /etc/default/grub |
After the reboot, you can verify the kernel parameters via /proc/cmdline
:
$ cat /proc/cmdline |
And lo! There’s consoleblank
, configured for 10 minutes.
Finally, wait for 10 minutes and the screen does indeed go blank. Success!
There’s not much difference between Debian 9 and 10.I consider that a feature.
Thanks for not moving much of my cheese, Debian developers :-)
]]>