Forum Discussion

davidfisher's avatar
Jun 13, 2019

Script to run TCPDUMP when monitor goes down

Hi

We have the script here and I have created this alert to trigger it in /config/user_alert.conf but seems its not enough and something is a miss.

Any ideas please?

https://devcentral.f5.com/s/articles/run-tcpdump-on-event

alert endb_mon_down "01070638:5: Pool /Common/pool_one member /Common/10.1.62.61:0 monitor status down." {
exec command="/config/var/tmp/Pool-tshoot-script.sh";
}

9 Replies

  • Seems about right. Just tested with these settings, and it works.

    [root@nielsvs-bigip:Active:Standalone] tmp # cat /config/user_alert.conf
    alert TEST2 "Non-existent pool member for pool /Common/demo.app/demo_adfs_pool_443" {
            exec command="/shared/bin/test.sh";
    }
    [root@nielsvs-bigip:Active:Standalone] tmp #

    And the script. Make sure it is executable (chmod +x <filename>). You might also want to consider to put your scripts somewhere in the /shared/ directory. The data on this partition will still be available after an upgrade.

    [root@nielsvs-bigip:Active:Standalone] tmp # cat /shared/bin/test.sh
    #!/bin/bash
    TCPDUMP="/sbin/tcpdump"
    ${TCPDUMP} -nni 0.0:nnn -s0 -w /var/tmp/test-$$.pcap -c1
    [root@nielsvs-bigip:Active:Standalone] tmp #

    And to show that it works:

    [root@nielsvs-bigip:Active:Standalone] tmp # ls -ltra /var/tmp/test*
    -rwxr-xr-x. 1 root root  22 Jul 19  2017 /var/tmp/test.sh
    -rw-r--r--. 1 root root 427 Jun 13 15:27 /var/tmp/test-30157.pcap
    -rw-r--r--. 1 root root 436 Jun 13 15:27 /var/tmp/test-31298.pcap
    -rw-r--r--. 1 root root 432 Jun 13 15:27 /var/tmp/test-31293.pcap
    -rw-r--r--. 1 root root 425 Jun 13 15:28 /var/tmp/test-3310.pcap
    -rw-r--r--. 1 root root 425 Jun 13 15:28 /var/tmp/test-3314.pcap
    -rw-r--r--. 1 root root 431 Jun 13 15:29 /var/tmp/test-7948.pcap
    -rw-r--r--. 1 root root 431 Jun 13 15:29 /var/tmp/test-7943.pcap
    -rw-r--r--. 1 root root 426 Jun 13 15:30 /var/tmp/test-12551.pcap
    -rw-r--r--. 1 root root 436 Jun 13 15:30 /var/tmp/test-12555.pcap
    -rw-r--r--. 1 root root 432 Jun 13 15:31 /var/tmp/test-17454.pcap
    -rw-r--r--. 1 root root 432 Jun 13 15:31 /var/tmp/test-17458.pcap
    -rw-r--r--. 1 root root 424 Jun 13 15:32 /var/tmp/test-21889.pcap
    -rw-r--r--. 1 root root 426 Jun 13 15:32 /var/tmp/test-21884.pcap
    -rw-r--r--. 1 root root 426 Jun 13 15:33 /var/tmp/test-26257.pcap
    -rw-r--r--. 1 root root 426 Jun 13 15:33 /var/tmp/test-26253.pcap
    -rw-r--r--. 1 root root 432 Jun 13 15:34 /var/tmp/test-30619.pcap
    -rw-r--r--. 1 root root 432 Jun 13 15:34 /var/tmp/test-30614.pcap
    [root@nielsvs-bigip:Active:Standalone] tmp #
    • davidfisher's avatar
      davidfisher
      Icon for Cirrus rankCirrus

      The timer setting is in secs or mins? What do you think?

      And I am trying to trigger it with this:

       logger -p local0.notice "Pool /Common/pool_one member /Common/10.1.62.61:0 monitor status down."

      Should I see an SNMP TRAP for this test logger command as well?

      Normally I see this when a pool mon fails:

      Jun 13 14:41:24 bigip2 notice mcpd[8183]: 01070638:5: Pool /Common/gateway-failsafe member /Common/10.1.62.61:0 monitor status down. [ /Common/gateway_icmp: down; last error: /Common/gateway_icmp: N
      o successful responses received before deadline. @2019/06/13 14:41:24.  ]  [ was up for 0hr:0min:55sec ]
      Jun 13 14:41:25 bigip2 notice mcpd[8183]: 01070638:5: Pool /Common/auction-php-pool member /Common/10.1.62.61:80 monitor status down. [ /Common/http: down; last error: /Common/http: Unable to connec
      t; No successful responses received before deadline. @2019/06/13 14:41:25.  ]  [ was up for 0hr:0min:56sec ]
      Jun 13 14:41:25 bigip2 notice mcpd[8183]: 01071682:5: SNMP_TRAP: Virtual /Common/auction-http-vs has become unavailable
      Jun 13 14:41:25 bigip2 notice mcpd[8183]: 01071682:5: SNMP_TRAP: Virtual /Common/auction-https has become unavailable

      How did you choose the msg ""Non-existent pool member for pool /Common/demo.app/demo_adfs_pool_443".

      I think my msg is not matching the trap I configured, but why..?

      • Niels_van_Sluis's avatar
        Niels_van_Sluis
        Icon for MVP rankMVP

        I've just selected a message from /var/log/ltm. The error message I've used is shown every minute in my log file. So that was an easy message to test with.

  • After some frustrating experiences, I found that you cannot run tcpdump out of the alertd execution context - SELinux gets in the way and prevents access to the network devices.

    And yes - it does work in some circumstances, but not reliably for all releases/platforms/situations.

    I had to build out a hardware/version compatible repro to demonstrate and solve this problem when I first ran into it.

    I solved it like this:

    Have a startup script that creates a named pipe, and waits on the named pipe to run the tcpdump

    This is running in the root context and has permission to run tcpdump.

    /config/startup/monitor_down_dump.sh

    #!/bin/bash
     
    NP=/var/run/monitor_down_tcpdump.pipe
     
    if [ -e $NP ]; then
        echo "$NP already exists; is this script already running?"
        exit 1
    fi
     
    mkfifo $NP
    read x < $NP
    /bin/rm $NP
    logger -p local0.info "$x"
    # start a tcpdump
    # THIS count VALUE MAY NEED TESTING AND TUNING
    -nni 0.0:nnn -s0 -w /var/tmp/`uname -n`_`date +%F_%H:%M`.pcap

    You also need a trigger script run from your user_alert that pushes data into the named pipe.

    This runs in the alertd context and does not have permission to run tcpdump, but can push a message down the named pipe.

    /shared/monitor_down_trigger.sh

    #!/bin/bash
     
    NP=/var/run/monitor_down_tcpdump.pipe
    echo "debug_triggered" > $NP

    and your user_alert.conf snippet

    alert endb_mon_down "01070638:5: Pool /Common/pool_one member /Common/10.1.62.61:0 monitor status down." {
    exec command="/shared/monitor_down_trigger.sh";
    }

    For my implementation, the customer also had a cron task that checked to see if the script was still running every 10 minutes, and restarted it if it had triggered or stopped. This may or may not be required.

    • davidfisher's avatar
      davidfisher
      Icon for Cirrus rankCirrus

      I was trying the script on v12.1.

       

      Is this workaround required for all versions? Which version are you running?

      • Simon_Blakely's avatar
        Simon_Blakely
        Icon for Employee rankEmployee

        I developed that solution on 12.1.2, and I expect it to be required for all later versions.

         

        It's complex, but it is reliable - just trying to run tcpdump out of user_alerts.conf may work (for example, it initially worked on my development 12.1.2 VE), but not for all cases (it didn't work on a physical 12.1.2 VCMP guest in the lab).

         

        The solution I documented above does provide results.

        However - it isn't instant, but using alertd introduces a delay anyhow.