Webservers that fail on startup make me sad.
After migrating all my hosting to Debian 9 + Nginx on an old laptop, everything was going really nicely.
Until the weekly scheduled reboot failed.
And I got a message from Uptime Robot that all my hosting was down. At 5am.
I fixed the immediate problem by doing a
sudo service nginx start, but I don’t like servers that won’t reboot cleanly.
They’re a recipe for disaster, and I don’t like 5am disasters.
When Nginx started up, there was an error when parsing configuration which caused it to fail:
2019/03/15 04:37:36 [emerg] 658#658: host not found in upstream
The config for this line is:
loki.ligos.net is a reverse proxy site of my old Windows based web server.
But Nginx couldn’t do a host name lookup for it, and failed.
Where failed = crashed hard and stopped serving all web requests.
So how to solve this?
Before I did any research at all (that is, at 6am when I restarted Nginx as started getting ready for work) I naively thought I could just add
/etc/hosts and the problem would go away.
Because Nginx doesn’t look at the hosts file, only a real DNS server.
OK, if the really obvious didn’t work, I’ll try the collective knowledge of the Internet!
There were several posts on blogs and Stack Exchange that all proposed essentially the same solution: Nginx doesn’t cope with a failed DNS lookup when it parses its config, so make sure it has a valid DNS server!
That is, install dnsmasq (
apt install dnsmasq) and change your config like so:
# Tell Nginx to use local dnsmasq server
Unfortunately, my webserver is also my pi-hole server. I can’t run dnsmasq because pi-hole is already running a caching DNS server!
But why not just try the variable thing anyway. So I modified my config.
And it didn’t work: same error.
Darn. If the Internet can’t magically fix my problem, what’s a man to do!?!?!
So I got out my old troubleshooting hat and started poking around in log files.
/var/log/syslog was my first (and last) place to look.
Mar 15 04:37:34 cadbane kernel: [ 11.552726] Adding 3004412k swap on /dev/sda5. Priority:-1 extents:1 across:3004412k FS
OK, that looks like systemd starting up the pi-hole DNS server (
I decided to confirm that really is the DNS server using
sudo netstat -n -p -l, which gives me a list of addresses, ports and programs which are listening for network connections.
Active Internet connections (only servers)
pihole-FTL is definitely a DNS server.
Let’s continue with
Mar 15 04:37:34 cadbane kernel: [ 13.813397] Process accounting resumed
A bit further down and
pihole-FTL logs that it isn’t running.
Well, that’s not entirely helpful.
Mar 15 04:37:35 cadbane systemd: Starting Permit User Sessions...
That line with Starting A high performance web server and a reverse proxy server is where Nginx starts. Of course, whoever configured Nginx’s systemd unit made really sure it wouldn’t log “Nginx” anywhere on startup, because that would make troubleshooting too easy. Sigh.
However, this started me thinking: pihole-FTL has started but isn’t running yet (and presumably can’t do DNS lookups) and Nginx is starting to load. Perhaps Nginx really can’t do DNS lookups because DNS isn’t ready yet.
Mar 15 04:37:40 cadbane systemd: Started User Manager for UID 999.
There’s the Nginx failure, a few seconds further into the log.
And nothing more logged from
And the very next log message:
Mar 15 04:37:43 cadbane pihole-FTL: FTL started!
So, it appears my DNS server takes 2 seconds too long to start. And Nginx can’t cope.
On a hunch, I kept looking:
Mar 15 04:37:44 cadbane dhclient: DHCPDISCOVER on enp19s0 to 255.255.255.255 port 67 interval 16
OK, my server doesn’t even have an IP address until 3 seconds after Nginx fails.
Which explains why DNS isn’t working - it has to do a real network call over to my router where my
*.ligos.local entries are kept.
So even if I could magically make
pihole-FTL load faster, it wouldn’t help because I don’t have a network yet.
It looks like the root cause of my problems is the Nginx is starting too early.
If I can convince it to load a bit later (like after the network is up and after
pihole-FTL is working) it might be happier.
Sigh. Time to wrap my head around systemd. Strangely, the Arch Linux wiki page for systemd gives a concise list of what systemd does and what commands you can give it (the list assumes you already know what half the commands do, but its a place to start). Digital Ocean had a reasonable reference as well.
After some digging, I found that
sudo systemctl status nginx give me a hint of where to find the unit definition for Nginx.
Otherwise known as the startup config file.
After directive under
[Unit] determines when Nginx is started: after the network target.
That is, when the network is up.
Of course, “when the network is up” doesn’t mean “when then network is ready to send & receive traffic” (aka “online”), just enough to bind to an IP address / port.
If I can convince Nginx to load after
pihole-FTL, that might fix things.
After some digging, the “systemd” way to edit config files is:
sudo systemctl edit nginx.
Which dumps you into
nano with a blank unit file (not entirely obvious what needs to happen next).
But apparently you can “override” the stock definition with your own.
I gave it a go:
# Force Nginx to load after our DNS comes up, which can take a few seconds.
I couldn’t tell the difference between
Requires, but Arch Linux’s example used both, so I did the same!
After a bit of reading about unit files, it seems you can tell systemd what to do if a service fails. That makes sense, given a good part of systemd’s job is managing system wide services (not too different from the Windows Services I’m familiar with).
sudo systemctl edit nginx and add the following:
# Restart Nginx if it crashes.
And, after all my changes, apparently I need to
sudo systemctl enable nginx.
All of them!!
Why put all my eggs in one basket? I switched to an Nginx variable, so the DNS name can change. I fixed the service dependencies so Nginx starts after pihole-FTL. And, I set the Nginx service to restart after 15 seconds if it crashes.
Hopefully, at least one of those will keep it running!
At this point I decided to reboot the server and see if its working.
Nginx loaded without a problem.
Mar 17 22:28:33 cadbane pihole-FTL: FTL started!
pihole-FTL completes its loading sequence, and Nginx start up right afterwards.
Most importantly, it doesn’t crash!
Any server I manage must restart cleanly and without any human intervention. In the case of my new Debian web server, that means changing the start order of services so that DNS will be up and running before Nginx. Now I have a basic understanding of systemd unit files, and have a more reliable server.