ICC Home / Members / Meetings / Peer Support / Documentation / Projects
Minutes of April 10, 2008 ITAC-NI Meeting:
Link to ACTION ITEMS from meeting
CALL TO ORDER:
This meeting was scheduled in CSE E507 at 1:00 pm on Thursday, April 10th and was made available via videoconference with live-streaming and recording for future playback. Prior announcement was made via the Net-Managers-L list (late afternoon of the day prior). The meeting was called to order by ITAC-NI chairman, Dan Miller, Network Coordinator of CNS Network Services.
ATTENDEES: Twenty people attended this meeting locally. There was one attendee via Polycom videoconference but there are no records of how many may have listened into the stream via a web browser using the web interface.
Twelve members (or their proxy) were present: Charles Benjamin, Clint Collins, Dan Cromer, Tim Fitzpatrick, Craig Gorme, Stephen Kostewicz, Shawn Lander, Steve Lasley, Tom Livoti, Allan West (proxy for CLAS), Dan Miller, and Handsford (Ty) Tyler.
Viewing the recording
You may view the recording via the web at http://126.96.36.199:7734. Currently, you will need to click on the "Top-level folder" link, then the "watch" link next to the "ITAC-NI Meeting 4/10/08" item. This will likely be moved into the ITAC-NI folder shortly. Cross-platform access may not be available; on the Windows platform you will have to install the Codian codec.
An archive of audio from the meeting is available.
1) Approve prior minutes
No corrections or additions were offered and the minutes were approved without further comment.
2) Review IP allocation and reclamation plan
Dan Miller introduced Marcus Morgan who was on hand to discuss this topic.
2-1) IPv6 may not come to our rescue for quite some time
Marcus mentioned that this matter had been discussed prior back at the ITAC-NI meeting last October. At that time Marcus had believed that we were in pretty good shape and would be able to reclaim numbers gradually over time until we were rescued by IPV6 in 2-3 years. It now appears that no rescue is coming; rather he thinks this may turn into a more long-term problem. Marcus believes it is now time to formulate policies and procedures by which we can reclaim space.
2-2) Movement to private IP
What has been done the last several years involves movement into private IP space. Over the last few years our private IP space has grown by about 3-4000 a year. Supporting private IP requires using a facility called NAT (Network Address Translation) whereby a public IP is associated temporarily with a private number so that the host using the private number can communicate with the outside world. That facility works well and is a very efficient way to use public IP numbers. Currently in the 128.227 /16 address space we have about 26,000 active hosts and about 17,000 of those are in NAT space. Thus our NAT pool is the largest single group of public IP addresses within that subnet.
There are currently about 60,000 other hosts using private IP. On any given day, roughly 30,000 of those hosts utilize the NAT pool. That means that the move to private IP is certainly improving the efficiency by which we utilize our public IP allocation as each public NAT address is supporting roughly 2 private addresses.
2-3) We can't get any new IPv4 space
Having 26,000 hosts in 128.227 /16 means our space is about 40% utilized overall because that subnet has a total of 65,000 addresses available. Removing the NAT and current free space and looking just at what is allocated currently to departmental use, utilization across the remaining 153 subnets averages only 23%. The problem is that we are never going to get any more public IP space in IPV4. The regional registries such as ARIN will run out of public IP space in 2-3 years; this means the global pool will be exhausted. Also, we don't meet the ARIN guidelines because we are only 40% utilized overall and they require something like 80% or greater in order to be considered. This means we have to live with what we have currently and we have to try to formulate policy and procedures for doing that.
2-4) Fortunately we still have free space with more on the way
The good news is that we currently have about a two year supply of free space (about 9,000 addresses) assuming that things continue as they have over the last 2-3 years. We also expect about 2,000 addresses to be returned from the Health Science Center, hopefully, this year. We also have an agreement with HSC that they will supply their own NAT needs with numbers from their 159.178 /16 subnet; thus we will not have to subsidize their use of our NAT space.
2-5) More difficult recoveries are ahead
We have already located all the readily available free space, having retrieved space from IFAS and Housing among others. Several subnets were liberated internally by CNS recently by converting dial-ups to private IP. There is no remaining readily available free space. We are now faced with renumbering various units in order to reclaim further space. This is why we need to consider policy for reclamation and for allocation.
2-6) Is PAT an option?
Charles Benjamin had a couple of questions regarding NAT, mentioning that he believed Ryan Vaughn was the person who handled that facility. Charles asked what the current idle timer was set to and Marcus replied that this was at 10 minutes. It is set that low for efficiency; any shorter than that and you start interrupting things. Charles also asked if they had considered PAT (Port Address Translation). Marcus replied that they had looked at PAT but that it had associated problems. Marcus couldn't speak to the exact details, but it was his understanding that it breaks enough things that we would have to change our model of providing the full spectrum of services through NAT.
2-7) How are new allocations being handled?
Charles asked what the plan will be regarding new allocations. Marcus responded that they have been trying to use a small range of public IP addresses for departmental services which require public numbers and back that up with a generous sized private space for use by the majority of a department's hosts. This model has worked quite well for the last 2-3 years and quite a bit of our space has been done this way already.
2-8) Will increasing use of VoIP make things worse?
Dan Cromer asked about VoIP phones and how their use related to this topic. He had assumed that those utilized NAT when calling somewhere outside campus. Charles stated that in Housing they use a private IP address and then the single line is doing a form of trunking (802.1x); it carries both the IP address from your workstation and also from your phone. Whenever you go outside you go through a gateway, which avoids the issue. Dan mentioned that he was hoping that we might someday have some agreement with an outside company like Vonage in order to allow VoIP to be utilized outside without the concomitant long-distance charges. John Madey said that they have plans for trunking but that is currently a couple of years down the road.
The reason Dan raised the issue is that he expects doing that would create additional load on our NAT space. The response was that SIP isn't NATed in the traditional sense and would not cause the proposed ill-effect. Dan Miller didn't think we have to worry about this for two reasons. One is that phone utilization will remain a fairly small fraction of our overall port usage. Dan also mentioned that they had looked at toll bypass and had backed away from it because of Communications Assistance for Law Enforcement Act (CALEA) concerns. Because of those, toll bypass probably isn't going to be big in the near future. There are other ways around that, but as far as the individual phones acting as hosts and getting a NAT translation in order to get out on the Internet, that is probably not going to happen any time soon. Thus, VoIP should issue not provide additional pressure on our public IP space.
2-9) Prognosis for IPv6
Dan Cromer asked about the prognosis for IPV6. It seemed to him that there would be increasing pressure internationally to move in that direction due to the shortage of IPV4 space. Marcus does not believe that IPV6 will be swift in its arrival; he thinks the timeframe for that will be something like five years or longer. In the ARIN public policy list where they discuss these things, they are still waffling about what size allocations they should make and they don't really have a good handle on routing aggregation. The other aspect of this is that if there is not an audience for IPV6 on the outside world then there is little motivation (other than these space issues) to go there; that is why many are holding back. Basically, there is no place to go.
Marcus also mentioned that it appears a commodity trading market will be defined for IPv4 addresses; this might further serve to delay movement to IPv6.
2-10) Are we still handing out /24 subnets?
Charles asked if past practice has involved handing out /24 subnets for various departments. Marcus said that this hasn't occurred for the last several years. The size of the allocation historically really relates to the routing protocol capabilities we had at the time. In early times we allocated /23 subnets because we had a subnet mask of that size due to the lack of the necessary equipment to route otherwise. Then we moved that back to a /24 and stayed there for a long time until we had better routing capabilities. Now we run OSPF and border gateway protocols and we can have essentially any small-sized subnet that we wish. Currently we typically allocate a /28 for public and a /22-/24 for private depending on need. There will be many exceptions to those rules, but allocating a /24 for public space is very rare now.
2-11) How will reclamation efforts relate to Wall-Plate roll out?
Dan Miller asked Marcus to discuss this project in relation to Wall-Plate roll outs. Marcus said that Wall-Plate roll out is a good time to consider this because you are already going to be looking at usage during that and in many cases rebuilding the network. There is some concern there with adding yet more change to the process. If a unit doesn't want to change their entire network and restructure their servers all at the same time, then Marcus understands that and can revisit at a later time. We will at least be asking the questions and prompting the thought at roll out time, however.
2-12) Action plan
Tim Fitzpatrick asked if Marcus had thought through the process by which he is going to talk to each department and evaluate their use of space according to some standard as well as the methods for requesting the return of space for those who may be way out-of-tune with those standards. Marcus said that he had and that we have some units for which this would be easy and some where it would be more difficult. Marcus believes we need to identify and notify the more difficult cases early on so that they would have more time to make the adjustments. In some cases there are subnets that have very little on them and a little bit of renumbering would free up considerable space; those are instances where units would not have to move to another subnet, but rather simply downsize their existing subnet.
Tim said that if Marcus is going to be reaching out to many and asking them to help within the smaller spaces then it would help to have the guidelines written in advance. Tim suggested that the ITAC-NI committee would be a good sounding board for those.
Marcus will put together a guidelines document based on the existing practices which he believes is a good starting place. He will then bring that back to this group for further input.
2-13) Committee support for the renumbering plan
Dan Cromer commented that he feels this committee should make a statement of support in favor of the renumber scheme for recovering unutilized public IP space. Dan believes that the IP addresses at UF are UF property; consequently, there should be willingness on the part of units to work with CNS on numbering changes in order to negotiate the return of IP space.
Dan Miller appreciated the voice of support and said that he believed they do generally get cooperation on that. Marcus agreed and believed that getting the word out on what was needed and why would continue to help with that process.
3) Review Post Outage Analysis from 3-20-08
Dan Miller distributed a two-page handout containing a post-outage analysis of the network outage which occurred for a portion of the UF network on March 20th, 2008 between 9:15am and 11:11am.
Dan Miller related that our network had a fairly large outage two weeks ago which caused some wide-range issues but which did not bring down the core network. A switch failed in Stadium 411. CNS is still attempting to determine whether the cause was hardware or software. There are plans underway to replace that specific switch.
3-3) Looping caused serious traffic storm
When the switch malfunctioned it looped up the entire stadium network. Thanks to the benefit of modern network hardware those packets were able to storm as far as the routing and bridging protocols would let them go. This was a significant event involving a flood of about 400Mbps and 400,000 packets per second. The traffic itself was originally legitimate the first go-round within the stadium network and then, because of the loop, got blasted out everywhere over and over again.
3-4) Problem isolated after approx. one hour
The degree of disruption depended on location within the network. The stadium was down for the duration, but most locations had their problems resolved by 10:22 when it was found that the stadium was the source and that link was shut down. That then localized the problem to the stadium itself.
3-5) Affected services
Most service-affecting issues were centered on the stadium network and Operations Analysis. There were a small number of VoIP phones in a few areas that were affected and SSRB wireless was affected as well. Some other networks fed off the SSRB core were affected by the volume of the traffic.
3-6) Remedial changes
CNS has already made some remedial changes to the management VLAN, which was where Network Services saw the greatest impact. It's a bad thing to have a storm of packets anywhere, but it is a real bad thing to have a storm of packets on the network where your switches and routers are actually trying to listen to traffic, because a significant portion of that traffic was broadcast as well.
3-7) Problem detection
Charles asked about how CNS first detected the problem. The problem began around 9:15am and was first noticed by Network Systems due to disruptions in the SSRB local workstation network. The monitoring systems detected that there was a problem there which created a problem ticket through operations that was dispatched to Network Services. At about that same time, Dan Miller imagines that people around CNS were noticing they couldn't get anywhere on the network.
3-8) Monitoring methods
Charles asked if CNS monitors traffic at various points throughout the backbone using MRTG or something similar. Dan Miller responded that, yes, they use similar tools and have a project underway to provide a dedicated partly out-of-band monitoring solution focused on the core network. Whenever there is a problem of this scale, the first thing they ask is how big is the problem and they start looking at the core to determine that. In these instances it is good to have that secondary view, but we don't have that currently.
3-9) Hardware specifics
Tim Fitzpatrick asked Dan Miller to confirm that the problem started with a Wall-Plate switch, which Dan did. Tim asked if we now knew what was wrong with that switch and Dan responded that we did not but were still investigating. At this point they are going to swap the switch (3550 PWR) out. CNS is not installing these switches any more in new installs, but does have a couple of these in the lab for use as emergency spares; consequently, they are going to replace the switch which caused the problem with another of the same model. Dan Miller mentioned that if they can isolate the problem to something specific with that model of switch then they will, obviously, have a new action item. So far, however, they have been unable to do that; it appears to be an isolated failure.
Craig asked if they had heard anything back from Cisco; he assumed CNS sent them the logs on this? Dan Miller responded that it was difficult to get the logs for various reasons and Dan was not sure if they had a Cisco case open on this matter. When CNS first arrived on the scene, the lights on the switch were "Christmas treed" and it was completely non-responsive; so there was no way to get local logs off the switch. Again, they do not know what caused the failure but it appeared to be a problem internal to the switch itself which caused a hard loop on all the interfaces of the switch. The ones which caused the biggest problem were the VLANs which led outside the building, including the management VLAN.
4) Wall-Plate status update
4-1) Current port status
Todd Hester jokingly introduced the topic by saying that as of April 1st we had 2,000,234 data ports. The actual numbers were 16,930 ports and 2,877 VoIP phones.
4-2) Port security enabling
Todd mentioned the March 31st notice which was sent out regarding the port security deployment date. There have been some responses already, including two groups asking for assistance to identify their problems and one group asking for an exception.
Brian Bartholomew suggested that CNS turn on their monitoring tools for a brief period of time on some specified day and not disable anything but simply send e-mails so people could see what things were triggering the port security. Dan Miller responded that this was an interesting idea, but they had discussed this internally and could not easily decouple the action from the notification.
Short of that, CNS has some other tools which they are using for monitoring and building up lists. Very shortly they will be sending out notifications to current Wall-Plate customers regarding where CNS believes their problems are. They will give units a "heads-up" on what they should be working on resolving. They also plan to send out a more detailed list of what events will cause a port security trigger so people will have a better understanding of that. There is a false positive caused by a particular brand of NIC; they will send out instructions on how to eliminate that problem.
4-3) Self-service network status web application planned
Dan Miller mentioned that by October 1st CNS plans to have a self-service network application deployed. This will provide an aggregation of all the current statistics that they currently have available, on a public web page that will be convenient to the local administrators. Marcus responded that local admins would get access via Gatorlink credentials and then be able to see the status of the various resources on their network and be able to select which pieces to look at.
Dan Miller was not sure how granular the reports could be, at least by July 1st, citing difficulties in tying individuals to building locations. Marcus believed this could be done by individual VLANs and should be sufficiently granular for most purposes. It won't be perfect, however; if there are two groups that share a switch, both will be able to see that switch.
4-4) Capacity and growth predictions on target
Tim Fitzpatrick said that their capacity and growth prediction estimates had been placed at roughly 7,000 ports installed per year and 2,000 VoIP phones installed per year. Tim asked Todd how those figures matched what has been happening. Todd responded that we have had a growth of 4,900 ports since July 1st with about two more months to go. There were roughly 1,000 phones deployed during that time as well.
4-5) Procedures and schedule being tuned
Todd mentioned that they have been looking at the internal procedures they are using for roll out and are changing some of the processes, mostly focusing on giving customers a clearer estimate of their up-front costs and any continuing costs if there are any. They are also working to provide a clearer statement of exactly what work they are doing so that no misunderstandings arise over what is going to be converted and what is not. Also, we now have 10 months of data and Todd is now able to revise and refine the model which he had originally used to make a three-year schedule; consequently, you will likely see a revised schedule posted within the next couple weeks.
4-6) Central funding concerns
Craig asked if, with the situation at the Provost's office being "fluid", CNS had any idea what to expect regarding the funding of the Wall-Plate project. Tim responded that CNS gets money from very many different sources and he is a little bit concerned about the stability of all of those sources. Less than one-third of the total funding for CNS comes from the Provost. Tim views the Wall-Plate funding as program funding and their approach is to manage a five-year continuing cycle whereby the network is updated and maintained on into the future. This requires a recurring funding stream.
4-7) Budget cuts should not affect the Wall-Plate program
Tim has been asked to determine 4%, 6% and 8% budget cuts and fully expects to get an 8% budget cut. If CNS gets that it will come from the Provost's recurring funding, but in Tim's budget-cut model this did not touch the Wall-Plate program. From CNS's perspective at this point, Tim expects the program to continue as planned.
4-8) Tough times may cause some units to opt-out
CNS has observed, however, that for every five dollars that CNS spends an end-user department may have to match that with one or two dollars doing wiring cleanups or other sorts of things. Those end-user departments are becoming more and more cautious with their spending. Although this is all being done on an opt-in basis, originally almost everyone during the first year signed on the line for both Wall-Plate and VoIP. Tim thinks that during the second year VoIP will become a tougher sell and he believes there will be some who will not be able to get in the game at all and will choose to wait it out for awhile because of the troubled times. Tim believes they will still be busy moving ahead, but it may not be all the same buildings involved which they had anticipated originally. Maybe things will be better by year three.
4-9) Seeking cost estimates prior to fiscal year-end
Dennis Brown said that he has been having some trouble getting information on costs for his department (Horticultural Sciences) joining the Wall-Plate. Dennis said that several e-mails had been sent to Sheard Goodwin about the matter without response. They are trying to get ready to spend end-of-year money. Todd asked Dennis to call him, or Todd would call Dennis this afternoon, and they would get that sorted out.
Summer meeting schedule may be revised
When Dan Miller asked if anyone had any topics to offer for future meetings, Tim asked if any thought had been given to the schedule of our committee over the summer. Dan responded that we might have a month or two off here or there as the current queue of agenda items is not large.
The next regular meeting is tentatively scheduled for Thursday, August 14th.
last edited 10 July 2008 by Steve Lasley