Tonight I found myself in quite the predicament, was rather unexpected (more on this later) to say the least. I was in the middle of something (life outside a computer) and it was a rather loud environment and a place where taking your phone out is discouraged. Some gut feeling (or something) made me check my phone, and no it wasn’t the vibration because that was turned off. Upon doing so I had a new email from the Craftyn Forums and it was from a concerned player about the server acting strange and kicking people off, basically something strange was happening to the server and they wanted to let me know in case I wasn’t already aware. That’s where this long journey begins, around 7:41pm.
Once I got the email and SSH’d into the server then looked at Minecraft Server Console to see what was up I realized that something major had happened, so I forced the server to restart and almost thought about not watching it start back up but I did anyways. When it was trying to start up I saw errors in the startup where I normally don’t see them and then one error caught my eye “not enough disk space” and I thought to myself “But we have gigs upon gigs of free data!“. So, I excused myself from where I was then started looking into it. One of the first things I noticed was that one of the partitions on the primary hard drive was 100% full with 0 bytes free, that was extremely worrisome and explains why the Minecraft Server starting up was throwing those errors. The bad thing about this was the fact that it wasn’t the /home/ directory but was the root directory / which contains all the other folders not located somewhere else. Thinking this through, the only possible thing which could be using this up was the MySQL databases. Then I remembered that it probably was in fact the huge MySQL databases we have which was holding all the storage hostage. Currently our three largest databases are:
- Prism Test
The reason these three databases are large, up to 80gb, is because they record pretty much every action which happens on the server. We rarely use Hawkeye anymore, Prism is actively being used, and purged, and Prism Test is dedicated to testing. These tools help us rollback griefs which people do to others or help us look up information if we are asked to look into a harassment report.
But that aside, I quickly tried to release some storage and did the following few things:
- Removed a ton of files under the /tmp directory
- Ran a sql query to remove quite a bit of data from our Prism Test database
After verifying those things cleared up a few, only 374MB (megabyte) to be exact, I felt better. That’s when I posted on our shoutbox around 8:27pm that the server would continue to be down until I was able to look into it on a computer. Then I went back to what I was doing, I got a few strange looks since I had been gone for quite a while (maybe they thought I had an extremely upset stomach or something?).
Fast forward to 12:00am, midnight, and I was finally at a place I would get on my laptop and tether internet from my phone to my laptop. One bad though, I was in an area of horrid signal. I was tethering one bar of 3g at the best while at the worst I was tethering no internet connection at all. While not on any internet connection, waiting for signal, I thought through a few things. You see our server has the following hard drive and partition setup:
- Primary Hard Drive (regular sata) – 2TB in a raid 1 configuration
- Partition 1: /boot – 1.4GB (not sure what we was thinking when we set this up, should only have been like 50MB)
- Partition 2: / – 118GB
- Partition 3: swap (didn’t grab the size, sorry)
- Partition 4: /home – 1.6TB
- SSD – 120GB in a raid 1 configuration
- Partition 1: /speed – 120GB
- SSHD – 2TB
- Partition 1: /data – 2TB
- RAMDISK – 62GB (only ~13GB is used)
This means that the MySQL data is being stored in Partition 2 and that hit the max of 118GB, so I need to move all that data to either another partition or another hard drive altogether. Moving the data to Partition 4 doesn’t make sense because the mount point is /home and we don’t want to store that in the home directory of anyone. So, I decided to move it all to the SSHD (which is a hybrid drive) as it only had 13% usage.
After waiting like ten minutes, I finally had internet back. I then made sure everything which was connecting to MySQL was stopped and then stopped the MySQL service:
sudo service mysql stop
After doing that, I made a backup of the configuration file for MySQL:
sudo cp /etc/mysql/my.conf /etc/mysql/my.conf.backup
I then made the folder for the MySQL in the data mount:
sudo mkdir /data/mysql
Then I proceeded to move all the data to the new directory, this took forever due to so much data:
sudo cp -r /var/lib/mysql /data/mysql
Once that was completed, I made sure that the MySQL user and group owned that folder (your username and group for MySQL is probably mysql if yours is configured default):
sudo chown -hR mysqluser /data/mysql sudo chgrp -hR mysqlgroup /data/mysql
Another change I wanted to make was changing the tmp directory for MySQL to point to the SSD, this way it might operate faster even by a few milliseconds:
sudo mkdir /speed/mysqltmp sudo chmod 777 /speed/mysqltmp
Now the only thing left to do was change the two settings in the MySQL configuration file:
datadir = /data/mysql tmpdir = /speed/mysqltemp
After changing those values I started up the MySQL service back to running:
sudo service mysql start
Once it started successfully I went and verified it by running a few queries and stored procedures. The next thing I did before removing all the data on the full partition was to make a full backup of it. So, I ran the following command:
tar -czvf /data/backup/mysql-2015-01-31.tar.gz /var/lib/mysql
By the time this completed successfully, it was around 1:45am. I made the decision to go ahead and remove all the MySQL data on the full partition since we essentially had two backs – the folder which had the data running the server and then the backup I had just completed. These are how I did it, to verify I was in the correct folder:
# cd /var/lib/mysql # pwd # rm -r *
Once that completed, took around five to ten minutes. I verified the full partition was no longer full:
Thankfully it was, now only using ~6% of the partition. By this time I was ready to get home, so I went ahead and enabled a whitelist on the Minecraft Server and brought it back up since I had a forty-five minute drive. I figured that leaving it running for that long without anyone on it would help see easily if anything happened.
Transport to home and I verified that no errors had occurred, I’ve also been monitoring it while typing up this post. I feel it is stable so far and ready to be un-whitelisted. Here is a combined screenshot right before I deleted the data from the full partition, you can see a drastic difference.
Now, at the start I mentioned something about unexpected. I kind of told a half truth there. You see, I knew the databases were growing very large but I didn’t realize at what rate they were growing. For the past year since we implemented the new server I’ve put off getting a server monitoring service which watches hard drives growing, insane CPU usage, heavy disk io, massive ram usage, etc etc. So, tonight I spent a good while before typing this post searching for a simple and hopefully free solution for monitoring but also keeping my options open for paid. I was able to find one, New Relic, and so far the past three hours it has been monitoring the server and MySQL server I’ve been impressed. I’ve setup alerts to alert me when disk usage on partitions get to a certain percentage and also a few other alerts. Here are a few screenshots from it:
Hopefully someone found this post somewhat useful, I know I for one will find it useful in the future in case I have to move MySQL data again.
Thanks for reading!