====== Router VM Problems ======
===== PROBLEM: Router VM Unable to Start =====
==== Symptoms ====
- Normally, this is first noticed when newly created or stopped VMs unable to start after long wait (usually after one / few of Cloudstack hosts got rebooted / reset).
- In Cloudstack UI --> Infrastructure --> Virtual Routers. In normal / good status, all 4 router VMs (1 router per VM subnet) are up and running in Green.
- If any of the virtual router above stays in Stopped state, or looping between Starting and Stopped states, it is possible that the router VM is down or in a booting loop. This will not affect Cloudstack customer VMs directly, but may result in new or stopped VMs unable to start (if the VM uses an IP address within the range covered by the router VM).
- Click on the VM having the issue, and take note of the following details, as shown below. \\ **Guest IP Address**: The public IP address of the router VM. This is within the range of the network it is associated with.\\ **Link Local IP Address**: The internal IP address of the router VM. This IP is used to connect from it hosting host. \\ **Host**: The host where the router VM happens to run on. \\
- SSH to the host found above. From there, verify that the router VM is running there.
virsh list
Sample output:-
Id Name State
----------------------------------------------------
1 r-689-VM running
2 v-697-VM running
3 s-696-VM running
5 r-710-VM running
11 r-717-VM running <---- this is the router VM corresponding to example above
- From the host, SSH to the internal interface of the problematic router VM, as follows:-
ssh -v -i .ssh/id_rsa.cloud -p 3922 169.254.1.201
Note that sometimes you need to retry many times until successful. This is due to the router VM initial boot process which may take much longer time than expected/normal.
- Once inside the router VM, monitor the cloud router VM bootstrap logs:-
tail -100f /var/log/cloud.log
After following the log output closely, you should spot the loop during booting. In this example, the loop occurs here:-
2017-01-24 01:49:19,336 CsHelper.py execute:184 Executing: ip link show eth0 | grep 'state DOWN'
2017-01-24 01:49:19,342 CsHelper.py execute:184 Executing: arping -c 1 -I eth0 -A -U -s 10.x.x.x None
.....
2017-01-24 01:51:32,978 CsHelper.py execute:184 Executing: ip link show eth0 | grep 'state DOWN'
2017-01-24 01:51:32,983 CsHelper.py execute:184 Executing: arping -c 1 -I eth0 -A -U -s 10.x.x.x None
.....
2017-01-24 01:52:58,426 CsHelper.py execute:184 Executing: ip link show eth0 | grep 'state DOWN'
2017-01-24 01:52:58,430 CsHelper.py execute:184 Executing: arping -c 1 -I eth0 -A -U -s 10.x.x.x None
.....
In this example, the ''arping'' command hung for a while, than seemed to time out and repeated the process again.
- Look for the hung / looping command in the cloud scripts located in ''/opt/cloud/bin/''. In this example:-
grep -ri 'arping' /opt/cloud/bin
Sample results:-
/opt/cloud/bin/vpc_ipassoc.sh: sudo arping -c 1 -I $ethDev -A -U -s $pubIp $pubIp
/opt/cloud/bin/vpc_ipassoc.sh: sudo arping -c 1 -I $ethDev -A -U -s $pubIp $pubIp
/opt/cloud/bin/ipassoc.sh: sudo arping -c 1 -I $ethDev -A -U -s $ipNoMask $ipNoMask;
/opt/cloud/bin/ipassoc.sh: sudo arping -c 1 -I $ethDev -A -U -s $ipNoMask $ipNoMask;
/opt/cloud/bin/ipassoc.sh: sudo arping -c 1 -I $ethDev -A -U -s $ipNoMask $ipNoMask;
/opt/cloud/bin/ipassoc.sh: sudo arping -c 1 -I $ethDev -A -U -s $ipNoMask $ipNoMask;
/opt/cloud/bin/vpc_guestnw.sh: sudo arping -c 3 -I $dev -A -U -s $ip $ip
/opt/cloud/bin/cs/CsRedundant.py: "arping_gateways.sh.templ"
/opt/cloud/bin/cs/CsAddress.py: cmd = "arping -c 1 -I %s -A -U -s %s %s" % ( <------- the looping script
Binary file /opt/cloud/bin/cs/CsAddress.pyc matches
Binary file /opt/cloud/bin/cs/CsRedundant.pyc matches
In this example, the looping script is ''/opt/cloud/bin/cs/CsAddress.py''. Edit that file and disable the looping command:-
vi /opt/cloud/bin/cs/csAddress.py
Remove/comment out the looping command:-
.....
def arpPing(self):
cmd = "arping -c 1 -I %s -A -U -s %s %s" % (
self.dev, self.address['public_ip'], self.address['gateway'])
#CsHelper.execute(cmd) <----- comment out execution of the above looping arping command
.....
Save the script (:wq) and continue monitoring via ''tail -f /var/log/cloud.log''. Wait for the boot process to rerun the script in the next loop. If all goes well, the boot process resumes until the end, and the router VM successfully booted into ''Running'' state.