Long Way To a Stable Cloud
It is important to control your own data on your own servers. A «cloud» is nothing but running your data on someone else’s computer, and thus they control your data, not you. Encryption may help a little bit, but the best solution is to fully own your data and infrastructure. This is also the philosophy of Pacta Plc: We do not store data on hardware that we do not own. We may use external clouds or servers, but not where customer data is involved. We protect your data.
Running a full stable infrastructure is not a simple task, and there is much to learn. So here you find the story of my adventures when learning and setting up a full cloud infrastructure, as it is also currently used by Pacta Plc.
1995 – 2010 Outdated Workstations as Server
I’m running my own server for my websites, e-mail, chat and much more since the nineties. It was always on a single computer with many services in a shared bare metal environment. In the beginning, I ran the server on my workstation, then on an old Siemens workstation that I inherited from my employer.
2011 Root Server and HP ProLiant Micro Server
Later I hired a root server and in 2011 I bought my first dedicated HP ProLiant micro server. Up to now, I have three of them. On those machines, processes were directly started with no virtualization.
Constantly Extending Hard Disks
Over time the HP ProLiant server became brothers and sisters, so that I currently have three of them, plus two huge Servers, storing more and more data. Whenever there is no more space on the hard disk, in Linux, you can easily extend it or replace a drive without downtime, only with a single reboot.
2017 – 2018 From Kubernetes to Docker Swarm
Then in 2017, Kubernetes came in to my focus. But that’s totally overcomplicated. With Docker Swarm there is a much simpler and much more stable solution available. There’s no need for Kubernetes nor OpenShift, unless you want to loose your time. So in 2018 I’ve setup a Docker Swarm on some cheap PC-Engines mini workers.
But with a swarm solution, there is need for a distributed cluster filesystem, so I came across GlusterFS, which turned out to be a complete disaster. At the beginning, it was very promising, but later, when filled with terrabytes and terrabytes of data, it became slow and very unstable.
So I started a research which pointed me to LizardFS. The Result is much more stable than GlusterFS, but still slow. Other than for GlusterFS, the LizardFS development team was really helpful and assisted me in getting it up relatively fast and stable. But especially the masters tend to require huge amounts of memory. That’s why I bought a large HP and a large Dell server as master and backup master server. The whole LizardFS now holds 90TB of data.
Since about 2020, I experiment with CephFS, which is my currently proposed cluster file system. You can run it on PC-Engine APU hosts with 4GB RAM. For the OSDs, put in a mSata HD of 1TB or 2TB. Chose at least three OSDs and thre manager nodes. You cannot run OSDs and Managing nodes on the same device, because 4GB RAM is not enough, but you can run MON, MGR and MDS server on the same node.
2021 CephFS OSD Desaster
Wacky Initial Setup
My CephFS initially ran on three PC-Engines with 4GB RAM, where all had a 1TB mSATA SSD and should run as OSD that provides the disk to the cluster. In addition, one was setup as MGR, MON. MDS metadata and monitoring server, but for these two duties, 4GB RAM is not enough. So I initially fixed the problem by restarting the OSD on the management host every hour. This way, it was more or less stable. Later in 2020, I stopped the OSD on that host, lost one of three redundancies, but got a stable system. I then bought four more devices, three together with a 2TB mSATA SSD from AliExpress each, and one as separate monitoring server.
First OSD Fails
Unfortunately, before I had the time to add the new nodes to the network, there was a power failure, the UPS could not catch, and after rebooting, the BlueStore in one of the remaining two OSDs was corrupt. With only one OSD left, the full filesystem degraded and was offline. So I added the third OSD again, then the recovery started in this constellation on Monday and finished on Thursday. But all data was back, the filesystem up and running again.
Second OSD Fails
But when I then tried to add the new hosts, I learned, that the were incompatible: The old nodes run on Ubuntu 18.04 that comes with Ceph 12 Luminous, the new already on Ubuntu 20.04 with Ceph 15 Nautilus. So they were incompatible and could not talk to each other. In a first step, I upgraded the old Ceph installations to 14 Octopus, but they were still incompatible. Unfortunately, in the upgrade process, one of the remaining two OSD corrupted, one succeeded, one failed. And I was in the same position, as one week earlier, only one OSD left. So I downgeaded the new hosts to Ubuntu 18.04, upgraded to Ceph 14 and added one of the new OSDs to get back to a factor two. The recovery started once again on Monday and finished on Saturday. During the week, I added the remaining hosts to get full redundancy.
Currently, one OSD node is still broken, so five of six OSDs are now running, In addition, I bought two additional hosts to also run the management servers in redundancy. Now the system is up and running and stable, with enough redundancy.
The learnings from this: Never run less than three OSDs, and run the monitors on a separate device, or even better also on three devices. Replace a failed OSD immediately, before the whole system degrades.
Current Status and Recommendation
So my current stable cloud runs 10 dedicated Docker Swarm nodes, one manager and nine worker, and is backed by a nine node CephFS, three manager and six OSDs. For setting up all these nodes, I use Ansible. All in all, there are 19 PC-Engine nodes, 3 HP ProLiant, a large HP and a Dell. Only my huge amount of multimedia data, terabytes of scanned documents, family photos and home videos, is stored in LizardFS, all other data is now stored in CephFS: