Even Longer Way To a Stable Cloud
It is important to control your own data on your own servers. A «cloud» is nothing but running your data on someone else’s computer, and thus they control your data, not you. Encryption may help a little bit, but the best solution is to fully own your data and infrastructure. This is also the philosophy of Pacta Plc: We do not store data on hardware that we do not own. We may use external clouds or servers, but not where customer data is involved. We protect your data.
Running a full stable infrastructure is not a simple task, and there is much to learn. So here you find the story of my adventures when learning and setting up a full cloud infrastructure, as it is also currently used by Pacta Plc.
1995 – 2010 Outdated Workstations as Server
I’m running my own server for my websites, e-mail, chat and much more since the nineties. It was always on a single computer with many services in a shared bare metal environment. In the beginning, I ran the server on my workstation, then on an old Siemens workstation that I inherited from my employer.
2011 Root Server and HP ProLiant Micro Server
Later I hired a root server and in 2011 I bought my first dedicated HP ProLiant micro server. Up to now, I have three of them. On those machines, processes were directly started with no virtualization.
In 2015, I started my first experiments with Docker and started to migrate my services from the standalone bare metal servers to Docker containers. For this, I built dozens of images.
Constantly Extending Hard Disks
Over time the HP ProLiant server became brothers and sisters, so that I currently have three of them, plus two huge Servers, storing more and more data. Whenever there is no more space on the hard disk, in Linux, you can easily extend it or replace a drive without downtime, only with a single reboot.
2017 – 2018 From Kubernetes to Docker Swarm
Then in 2017, Kubernetes came in to my focus. But that’s totally overcomplicated. With Docker Swarm there is a much simpler and much more stable solution available. There’s no need for Kubernetes nor OpenShift, unless you want to loose your time. So in 2018 I’ve setup a Docker Swarm on some cheap PC-Engines mini workers.
But with a swarm solution, there is need for a distributed cluster filesystem, so I came across GlusterFS, which turned out to be a complete disaster. At the beginning, it was very promising, but later, when filled with terrabytes and terrabytes of data, it became slow and very unstable.
So I started a research which pointed me to LizardFS. The Result is much more stable than GlusterFS, but still slow. Other than for GlusterFS, the LizardFS development team was really helpful and assisted me in getting it up relatively fast and stable. But especially the masters tend to require huge amounts of memory. That’s why I bought a large HP and a large Dell server as master and backup master server. The whole LizardFS now holds 90TB of data.
Since about 2020, I experiment with CephFS, which is my currently proposed cluster file system. You can run it on PC-Engine APU hosts with 4GB RAM. For the OSDs, put in a mSata HD of 1TB or 2TB. Chose at least three OSDs and thre manager nodes. You cannot run OSDs and Managing nodes on the same device, because 4GB RAM is not enough, but you can run MON, MGR and MDS server on the same node.
2021 CephFS OSD Desaster
Wacky Initial Setup
My CephFS initially ran on three PC-Engines with 4GB RAM, where all had a 1TB mSATA SSD and should run as OSD that provides the disk to the cluster. In addition, one was setup as MGR, MON. MDS metadata and monitoring server, but for these two duties, 4GB RAM is not enough. So I initially fixed the problem by restarting the OSD on the management host every hour. This way, it was more or less stable. Later in 2020, I stopped the OSD on that host, lost one of three redundancies, but got a stable system. I then bought four more devices, three together with a 2TB mSATA SSD from AliExpress each, and one as separate monitoring server.
First OSD Fails
Unfortunately, before I had the time to add the new nodes to the network, there was a power failure, the UPS could not catch, and after rebooting, the BlueStore in one of the remaining two OSDs was corrupt. With only one OSD left, the full filesystem degraded and was offline. So I added the third OSD again, then the recovery started in this constellation on Monday and finished on Thursday. But all data was back, the filesystem up and running again.
Second OSD Fails
But when I then tried to add the new hosts, I learned, that the were incompatible: The old nodes run on Ubuntu 18.04 that comes with Ceph 12 Luminous, the new already on Ubuntu 20.04 with Ceph 15 Nautilus. So they were incompatible and could not talk to each other. In a first step, I upgraded the old Ceph installations to 14 Octopus, but they were still incompatible. Unfortunately, in the upgrade process, one of the remaining two OSD corrupted, one succeeded, one failed. And I was in the same position, as one week earlier, only one OSD left. So I downgeaded the new hosts to Ubuntu 18.04, upgraded to Ceph 14 and added one of the new OSDs to get back to a factor two. The recovery started once again on Monday and finished on Saturday. During the week, I added the remaining hosts to get full redundancy.
Currently, one OSD node is still broken, so five of six OSDs are now running, In addition, I bought two additional hosts to also run the management servers in redundancy. Now the system is up and running and stable, with enough redundancy.
The learnings from this: Never run less than three OSDs, and run the monitors on a separate device, or even better also on three devices. Replace a failed OSD immediately, before the whole system degrades.
2021 CephFS MDS: Five Month to Recover
Update on 2021/10/22: Since Wednesday the services are finally fully back, after a downtime of five month due to a recovery of the CephFS Meta Data Server. Our mail server was hosted on an external provider since years. This configuration was still a classcial setup, no dockerisation, no cloud. Just an installation of postfix and dovecot on a dedicated server with encrypted hard disks. That worked well, but then, after a power outage, the hard disk could no more decrypt. So I decided to restore the backup into the CephFS. Unfortunately, the 4GB RAM per machine were just enough for the normal workload. But when I unpacked a backup of about 50GB mails, the filesystem completely screwed up. Then the MDS was in recovering state. That used too much memory on the OSDs and on the MDS, so that they crashed and restarted after a couple of minutes. I tried to recover that way, but iiit didn’t finish after a week, not after two, not after three. Since I could not add more RAM to PC-Engines, I bought 4 HPE MicroServer Gen10 Plus servers with a simple fast USB Stick for the Ubuntu operating system, 3 as OSD with 8GB RAM and a Western Digital 10TB harddisk each, and one as MON, MGR and MDS with 64GB RAM. They were delivered begin of September. The OSDs are fine, the recovery of the MDS was successful, but the rejoin state again used too much memory: Every 4min the server crashed. Unfortunately, the HPE MicroServer has only two slots for max. 16GB RAM each. Finally, I abused my LizardFS master server which has 160GB RAM as an additional MDS server. With that configuration, recovery was completed over night!
CephFS is stable enough to recover from this kind of disasters Add enough RAM! The cheap PC-Engines with 4GB RAM are not enough. 8GB RAM seem to be enough for an OSD with 10TB harddisk. 64GB RAM is normally enough for MON, MGR and MDS in normal operation, but not necessarily in during disaster recovery. Since for MDS only one server is running and the others are standby, adding just one MDS with high memory during disaster recovery solves the problem. For the MDS and all other services, add a memory limit short below of your system memory, e.g.
MemoryLimit=55G in section
Later I’ll add one or two more manager nodes and eventually three more OSDs to expand the storage.
Current Status and Recommendation
So my current stable cloud runs 10 dedicated Docker Swarm nodes, one manager and nine worker, and is backed by a five node CephFS, two manager and three OSDs. For setting up all these nodes, I use Ansible. All in all, there are 19 PC-Engine nodes, 3 HP ProLiant, 4 HPE MicroServer Gen 10 plus, a large HP and a Dell. Only my huge amount of multimedia data, terabytes of scanned documents, family photos and home videos, is stored in LizardFS, all other data is now stored in CephFS: