Forum Discussion

Martin_Sharratt's avatar
Martin_Sharratt
Icon for Nimbostratus rankNimbostratus
Feb 24, 2015

Radius load balancing not balancing

We've got a pair of LTMs (running 11.5.0) and I've been trying to set up load balancing for our radius servers which are authenticating our wireless traffic. We're using FreeRadius. We're using MSCHAP as the authentication method. I used the f5.radius iapp template to set up the virtual server and after some initial problems, I now have two servers in the pool and apparently working. They both happily respond to the health monitor authentication requests and to direct requests. However, no matter what load balancing method I use for the pool, pretty much all the traffic goes to one or other of the servers. When I turn off the server getting all the traffic off, the load goes to the other one and stays there when I turn it back on. I've tried changing some of the settings for the UDP profile. I set Idle Timeout to indefinite and turned on Datagram LB. Both servers stopped authenticating, which makes sense as I guess that the challenge-response got broken, however when I turned Datagram LB off the 'idle' server got a proportion (about a third) of the traffic. This stayed like that for a couple of days until I had to restart the Radius services on that server and now I'm back to one getting all the traffic and the other getting none. I'm now at a bit of a loss - I guess there's some persistence issue somewhere but I don't know where to start looking. Has anyone successfully used the iapp template to get Radius with MSCHAP load balanced? If so how?

 

18 Replies

  • Fred_Slater_856's avatar
    Fred_Slater_856
    Historic F5 Account

    Martin- The iApp does not configure persistence, no matter what choices you make. As long as you are not attaching iRules, there should be no persistence taking place. The iApp configures the pool with slow-ramp (300 seconds by default), meaning that whatever changes you make could take several minutes to take effect, but that would not explain a permanent exclusive preference for 1 server when 2 are shown as active by the pool monitor. I will see what I can find out from one of our telecomm experts.

     

  • Hi Martin, Have you checked out the deployment guide associated with the iApp template? There may be some helpful information in there, such as this from the prerequisites (which you may have already done):

     

    The RADIUS server must be configured to accept connections from BIG-IP Self IP address. Consult your RADIUS documentation for specific instructions.

     

    In our example, we are using FreeRADIUS, so we add the BIG-IP address to the clients file, found in /etc/freeradius/clients with the following command syntax:

     

    client 192.0.2.230 {

     

    secret = testing123

     

    shortname = bigip0

     

    }

     

    There is also a section at the end that describes our testing in detail. Joe

     

  • Hi Joe

     

    Thanks for replying.

     

    I started with the guide. There's no problem with either of the servers responding to radius requests. They both happily authenticate the health monitor requests and one or other of them is carrying out several 100 thousand authentications from our wireless controllers per day. It's just that it's one or the other. My problem is that I don't seem to be able to have both at the same time.

     

  • Fred_Slater_856's avatar
    Fred_Slater_856
    Historic F5 Account

    Martin- I think I have both an explanation and a solution. First, the explanation:

     

    ...excerpt from the post... however when I turned Datagram LB off the 'idle' server got a proportion (about a third) of the traffic. This stayed like that for a couple of days until I had to restart the Radius services on that server and now I'm back to one getting all the traffic and the other getting none.

     

    In connection-based UDP load balancing (means no datagram LB and no MBLB), if traffic comes from same src port, it always hits same connection entry, so it goes to same server.

     

    ...excerpt from the post... I set Idle Timeout to indefinite and turned on Datagram LB. Both servers stopped authenticating, which makes sense as I guess that the challenge-response got broken

     

    When datagram LB is turned on, now BIG-IP pick new pool member per every message. it may breaks the “challenge-response” kind of radius traffic because, it may separate pair of request which should go to the same server. typical persistence is not enough as we need to extract persistence key from server’s response. This situation can be addressed by iRule. the idea is we read “challenge-response” from server and set as persistence key. When access-request comes from client, we check if it contains “state” attribute or not, if it does, we use it as a persistence key.

     

  • Fred_Slater_856's avatar
    Fred_Slater_856
    Historic F5 Account

    There are at least 2 solutions. One is to build and iRule and attach it to your iApp. See https://devcentral.f5.com/wiki/iRules.RADIUS__avp.ashx. The other is to apply the radius profile with persist-avp (tmsh create ltm profile radius radiusLB persist-avp). I believe the latter is more straightforward, but unfortunately it is not implemented in the f5.radius iApp in 11.5.

     

    • Martin_Sharratt's avatar
      Martin_Sharratt
      Icon for Nimbostratus rankNimbostratus
      Thanks very much for this Fred. I'll hopefully be able to give this a try over the next few days. Will post back with results.
    • Fred_Slater_856's avatar
      Fred_Slater_856
      Historic F5 Account
      Thanks Martin. I am especially interested in the result when you create the following attribute-value persistence profile and attach it to your radius virtual. ltm profile radius my_radiusLB { defaults-from radiusLB persist-avp 1 }
    • Fred_Slater_856's avatar
      Fred_Slater_856
      Historic F5 Account
      Martin- I set up a pair of freeradius servers, and am successfully load balancing between them with datagram-load-balancing and no radiusLB profile using a simple radtest -t mschap test. Is there an easy way for me to reproduce the problem you are seeing?
  • Fred_Slater_856's avatar
    Fred_Slater_856
    Historic F5 Account

    The wireless router's requests could each be unique, or they could be successive responses to the challenge. You will need to look deeper inside the packet to find out which. If they are unique requests, then the wireless router is not responding to the challenge. If they are successive requests (with embedded challenge response), then the server is not accepting the challenge response. You say this works when you bypass the F5?

     

    From the RFC:

     

    If all conditions are met and the RADIUS server wishes to issue a challenge to which the user must respond, the RADIUS server sends an "Access-Challenge" response.... The client then re-submits its original Access-Request with a new request ID, with the User-Password Attribute replaced by the response (encrypted), and including the State Attribute from the Access-Challenge, if any... The server can respond to this new Access-Request with either an Access-Accept, an Access-Reject, or another Access-Challenge.

     

  • Ok - done some deeper packet inspection and all seems to be in line with the RFC. One authentication dialogue below (afraid it doesn't format well):

     

    Packet Type State ID 110 Request 162 113 Challenge AVP: l=18 t=State(24): 5d5772375d556b2f478e8a00b3c8166c162 117 Request AVP: l=18 t=State(24): 5d5772375d556b2f478e8a00b3c8166c12 119 Challenge AVP: l=18 t=State(24): 5d5772375d556b2f478e8a00b3c8166c12 126 Request AVP: l=18 t=State(24): 5d5772375c546b2f478e8a00b3c8166c52 128 Challenge AVP: l=18 t=State(24): 5d5772375c546b2f478e8a00b3c8166c52 130 Request AVP: l=18 t=State(24): 5d5772375f536b2f478e8a00b3c8166c50 131 Accept50

     

    Just to be clear, there's nothing wrong with the workings of radius either through the F5 or not - people are being authenticated ok - about 4 million individual authentications per day across our two wireless controllers. We want to load balance so we can add more capacity easily. The controllers will only point to one IP address so using the F5 seemed to be a perfect solution. As I say, the authentication bit is working fine, it's just that all the traffic goes to one or other of the radius servers.

     

    Clearly the state AVP is being used and I know if I turn on Datagram LB the requests seem to get distributed evenly so I may try the Radius profile with persist-avp. I wanted to test this before applying it to a live service but it seems that radtest does not work in the same way as the wireless controllers so I may have to just go for it (in a suitable 'quiet' time)

     

    • Fred_Slater_856's avatar
      Fred_Slater_856
      Historic F5 Account
      Does it seem strange to you that the controller issues 3 challenges and gets 3 responses before accepting? The responses all look identical. Why doesn't it accept the first one? Regardless, I think you are on the right track. An internal resource here suggested that the RadiusLB profile may work without the persist-avp parameter, but it will be easy enough for you to try both with and without. Too bad radtest is not a good proxy for your wireless controllers. Standing by.
  • Late breaking - just discovered eapol_test. Now if I can get it to work and script it I may still be able to replicate the problem ....

     

    Will post more - hopefully - tomorrow

     

    • Martin_Sharratt's avatar
      Martin_Sharratt
      Icon for Nimbostratus rankNimbostratus
      Latest: Unable to replicate problem with eapol_test on test system - load balanced perfectly even under quite heavy load. When I tried datagram LB and parstitence on avp it generally didn't load balance or broke altogether. So, yesterday during a maintenance period, I activated datagram LB and avp persistence profile (using attribute 1) and so far all seems good. The load is roughly equally balanced between the two radius servers (< 5% difference on bits) and has been for the last 24 hours. I agree with you however that we may still have a problem - probably with our wireless controllers as I don't think they should be sending varying numbers of requests. I suspect it may be load - we have considerably increased our number of access points over the past 6 months - or the number of separate devices in the chain (user device->wap->wireless controller->F5->radius server->AD controller). So, I think we need to look at that in more detail But it looks like the initial problem is resolved. Thanks very much for your advice and support