Forum Discussion

Nikoolayy1's avatar
Jun 27, 2021

Knowledge sharing: High CPU/Memory/Swap investigation/troubleshooting

I will share some basic knowedge about troubleshooting and resolving high data plane or contol plane CPU. First there is an already great article, so first check it:

 

 

 

 

1. If the CPU 0, 2, 4 are high then it a data plane tmm issue and if the CPU 1, 3, 5 are high then it is control plane CPU issue. Please read:

 

 

 

2.If the control plane CPU is constantly high just run linux "top" command see which process causes the issues and reboot the process and check for known bugs in google, askf5, the bug tracker, the release notes for known bugs and ihealth but be carefull as restarting critical processes may do some impact and if the process is not critical like the bigd then just restart it as I have seen bugs where the bigd or the snmpd have memory leakage and need restart.

 

 

 

3. If the control plane CPU jumps from time to time then it is harder to catch the issue but if you see there is a patern check the logs, cron job any REST-API scripts that may run at the same time when the CPU jumps. For example the F5 ASM datasyncd may cause periodic jumps as mentioned in https://support.f5.com/csp/article/K02827102 or the ASM policy builder is enabled and learns too many thigs as mentioned in https://support.f5.com/csp/article/K58571155. Also if you see the top command has many "tmsh" processes that means there is a REST-API script that does not close the connection correctly that causes many tmsh sessions to hang causing high CPU and Memory(in this case configure tmsh timeout as it is not configured by default https://support.f5.com/csp/article/K9908). To catch an issue that runs at a random time then you may need to follow the below article or run the top command with some arguments like "top -n 10 -d 10 >> /var/tmp/top.txt" as this will run the top 10 times with interval of 10 seconds:

 

 

4.If the CPU is high for the TMM process during peak working hours your system maybe overutlized then you may need increase the number of cores for virtual systems (if the license allows it) or VCMP (of there are free cores) or buy another device. Things like log messages in the /var/log/kern for the idle enforcer or the "clock advanced" messages in the /var/log/ltm may also indicate tmm cpu issues:

 

 

 

You my try some small optimizations like upgrading to the lates version, checking the /var/log/ltm and turning off any irule logging, forgoten TCP RST variables, resolving SSL handshakes, removing orphaned configuration objects, stopping an other debugs or modified system logging variables(better set the F5 to send the logs to external log server server with HSL as this can be done also in an irule with the "HSL::" command) etc.

 

 

 

5. For memory issues don't forget that "top" command shows the memory for the date plane and "show sys memory" shows the memory for the F5 tmm subsystems. For example a bad irule is causing the "tcl" subsystem memory to go high. Also logs in /var/log/ltm for the memory sweeper are a good indication https://support.f5.com/csp/article/K13302777 and https://support.f5.com/csp/article/K15740. Also a DDOS may cause high memory so be carefull. For control plane memory don't forget that if you see many tmsh sessions opened in the top then check your REST-API scripts and automations and configure tmsh timeout as I have seen this to many times to even count. The memory for vCMP is increased by adding more cores if needed and for virtual edtions is much more easy.

 

Note: A high Other Used memory usage on the BIG-IP system Dashboard may not indicate an issue, as Linux kernel allocates memory to buffers and disk caching that can be released as needed.

 

 

 

Example high control plane memory:

 

Examples for high tmm data plane memory:

  • K02620345
  • K13889
  • K09336400
  • K15245
  • ID633402.html
  • K44385170

 

 

 

6. For SWAP issues now you can enable the top to show you the process causing the issue or jst upload qkview to ihealth and see from there:

 

 

Also don't forget to check the hard disk as it can cause high CPU if the logs can't be written, because of full or faulty hard drive:

 

No RepliesBe the first to reply