May 24th, 2011 by admin
Posted 5/24/11 By Dan Baldwin, ATEL Sales Director, 800-500-ATEL
As a telecom agent, it's my job to help my business customers choose the right voice and data telecom solutions for business. They already know that there's about a half dozen or so main phone companies to choose from, my job is to help them pick the right one, at the right price to match up to their needed telecom applications and "business phone service outage" risk tolerance.
"Business phone service outage", you ask? Yes, believe it or not, in an age where your business phone service can ring simultaneously on three different devices and migrate smoothly from an instant message to a voice call to a video phone conference, your ability to make or receive a business phone call can simply stop entirely in the middle of a business day for your whole company as it has recently for many larger business phone companies including TelePacific Communications and Freedom Voice.
While my business clients and prospects want to know which providers are best for all the new high-demand business telecom "whistles and bells", they also want to believe that I'm doing everything needed to PREVENT or MINIMIZE their business phone service outages by A.) only recommending those carriers that are least likely to go down, and B.) putting some sort of "voice fail over plan" in place that kicks in when the primary voice communications carrier does go down.
To prevent or minimize outages and create believable fail over plans for their business clients though, telecom agents need to have already done plenty of homework to be able to make proper recommendations to business clients that match the client's desire for price appropriate features without undue risk. To make all this happen I use the following five point checklist to help all my business clients prevent or minimize their next business phone service outage.
The Business Phone News Checklist to Prevent or Minimize Your Next Business Phone Service Outage
1. Understand What Caused Recent Business Phone Outages
If you don't know what caused recent outages, preventing the next one is impossible. When I was assigned to a ship in the Navy, "safety rule number one" stated that when something bad happens on a ship (someone gets injured or equipment is damaged), the "bad thing" is never categorized as an "accident" but some sort of "preventable mishap".
TelePacific's own business continuity information sheet states that no outage is 100% preventable but knowing where the weak links are is a key to prevention. See the image below to see where mishaps can occur identified as the areas under the 5 red letters.
The A area is everything that happens to the incoming or outgoing phone call before it hits or after it leaves the TelePacific controlled network. The B area is represented by the TelePacific "network cloud" - this is the magical area where TelePacific has all it's switching equipment in hardened bunkers impervious to nuclear bomb blasts. The C area is the dangerous local area around your business where the "circuits" are that carry your phone calls between TelePacific's network and your phone system. The D area is your physical phone system and the E area are the network connections between your phone system and your users' phones.
When you read TelePacific President and CEO Dick Jalkut's letter to the left explaining the reason for TelePacific's recent outage involving their SmartVoice SIP voice service, Mr. Jalkut suggests that the outage was caused by a failure in area B.
Unfortunately the area and nature of the outage prevented the normal automatic business continuity plans to work. In the "redundancy" diagram above, whenever the PRI or SuperTrunk circuit (area C) goes down, the call automatically reroutes to preordered LEC backup "POTs" lines.
In last week's outage though the mishap occurred not in the circuit but in the "brain" that would have known to re-route the call over the POTs lines if the circuit was deemed to have gone down.
(To TelePacific's credit, they were communicating with customers during the March 24-35 outage via many mediums including Twitter to manually re-route their customer's main numbers to cell phones or other non-affected phone lines so customers could receive their incoming business calls.
Mr. Jalkut, in his outage letter makes many statements about the cause of the outage and what TelePacific has done to eliminate or minimize future outages. We'll review the specifics of the statements a bit later.
To the right you'll see a similar "what happened" letter from Eric Thomas, President and CEO of FreedomVoice about two outages that occurred in one week last December with their FreedomIQ hosted VoIP service.
In Mr. Thomas' letter he basically suggests that a design limitation and sales success led to their system crashes. In the TelePacific letter, Mr. Jalkut's states that the outage was a result of an external cyber attack.
Whatever the reason, the first step to preventing and minimizing future business phone systems and service outages, especially the ones involving SIP trunk services, is to obtain specific documentation from the phone companies that have suffered recent service outages so that the suspected reasons and suggested solutions for the outages and fixes can be independently analyzed and debated.
(In addition to the two documented outages above I've heard very reliable reports that at least two other prominent business phone service providers in California that have suffered recent outages. One cited a massive DoS attack that lasted 2 hours and I've not yet heard the reason for the other. We're trying to get the carriers to "come clean" as TelePacific and FreedomVoice have done for the "greater good".)
2. Understand What Can Prevent Recently Reported Outages
This may seem like it goes without saying - but not really. SIP and hosted VoIP are relatively new technologies and the challenges that SIP and hosted VoIP providers endure to keep their systems up and problem free are ever evolving.
(In contract, the old TDM or "time division multiplexing" voice transmission technologies that phone service providers are abandoning for SIP/VoIP have been around for over 100 years and were a bit more "hardened" and resistant to attack. But longing for a return to TDM will not solve outages problems as the low cost and rich features of SIP/VoIP have caused an irreversible migration to SIP/VoIP by ALL the phone carriers a business customer has to choose from.)
With the outage documentation obtained in "step one" above, telecom agents and their SIP/VoIP customers need to get together with the sales engineers of the carriers they are with or are considering migrating to to really "peel apart the onion" to determine if one carrier is truly better prepared to avoid or minimize future known causes of phone service outages.
In looking again at the Eric Thomas letter, Mr. Thomas states that their outages were caused because "our current core switching system has an architecture wherein some ancillary services are interdependent ... (and) a conflict between these dependent services is the root cause of the issues experienced this week." Mr. Thomas states that they've been upgrading equipment since 2009 to prevent mishaps and that "in 2011 we will also be bringing on a 2nd data center site, and ultimately an additional presence on the East Coast."
The "2nd data center" in 2011 quote above seems to suggest that the earlier lack of a second data center was at least partially to blame for the outage. Well that's good information because now a telecom agent or end-user customer can ask prospective SIP/VoIP providers how many data centers they have and if the multiple centers would prevent an outage like the one Mr. Thomas said they had.
Now when I've asked Mr. Thomas' competitors for their thoughts on his documented outages, they quickly say, "It's because he's not using a BroadSoft 'carrier grade' switch at their core which prevents growth and expansion past certain known limits..." OK, that sounds good, as an agent I'll now know to ask, "Can you expand to carrier class capacity as BroadSoft says only their customers can?" (In Mr. Thomas' defense, every non-BroadSoft SIP/VoIP carrier I've ever talked to says they've engineered a "secret way" to get past these rumored limits.)
In looking at the Dick Jalkut letter, the reason for the outage is cited as a "cyber terrorist attack" that was "external" in origin. I interviewed one person with an opinion on the TelePacific outage, Michael Collins, who saw TelePacific's "cyber terrorism label" as a somewhat melodramatic and sensationalistic characterization of what he thought might have been a run of the mill DoS attack by someone simply trying to hack into their SIP equipment to make free phone calls.
We spoke for awhile about a set of SIP testing tools with the somewhat less-than-innocuous name of "SIPvicious". Apparently the code was originally created to probe and test SIP/VoIP equipment for security breaches. We also spoke about how the originator of the free SIPvicious code ended up creating an "antidote" for SIPvicious to be used by SIP/VoIP administrators who found themselves the target of an "unauthorized" SIPvicious attack.
We also spoke for a bit how TelePacific's outage might have been the result of someone having intimate knowledge of TelePacific vulnerabilities. Mr. Collins opined that "a two-day outage from a run of the mill DoS attack would be a surprise" since most carriers spent quite a bit of time protecting themselves from known threats.
In any event, the "takeaways" from step two here are to create a list of outage sources and then match them up to imagined solutions for every outage recently documented.
3. Challenge Current and Prospective SIP/VoIP Carriers to Document Their Experiences and Defenses Against Outages
Then take this list and ask existing or prospective SIP/VoIP providers to give their "this is what we've done about those threats" statement to the agent and the customer in writing. One better step would be to have an addendum to the agreement that lets the customer out of the carrier contract if they become a victim of an outage they state they are protected against.
For big deals carriers may actually do this for prospective customers and agents. But even for small deals where they won't provide documentation, it's a good practice for customers and their agents to ask the sales engineers of the prospective SIP/VoIP carriers the hard questions like,
A. When's the last time use probed your network with a tool like SIPvicious and what was the result?, or
B. How many DoS attacks have you endured over the past 24-months and what was the affect on your network?, and
C. How many network outages have you suffered over the past 24-months and can I please see the outage documentation you provided your customers about those outages?
If your prospective SIP/VoIP provider won't come clean about their outages in the same public manner that TelePacific and FreedomVoice has then my professional opinion is that you should look for a carrier who will. Everyone has outages. It's best to be a business phone service customer of a carrier that is open and honest about their outages with the people enduring them - their customers.
4. Engineer and Have in Place Your Own Fail Over Plans that are Independent of Your SIP/VoIP Carrier
I have a dirty little secret.
In the late 1990's, after the CLEC deregulation created a whole bunch of "wanna-be phone companies" that insisted on selling below their costs to buy up market share, I'd discuss with my customers the advantages of switching their phone service to these companies to save them money - as long as we had a good escape plan for when the these low ball phone companies drove themselves out of business.
It worked pretty well.
Today, telecom agents and their business phone customers need to have the same conversations with one another. Not because there's a lot of phone companies to take advantage of anymore, but because you just never really know when something new and unexpected will take down a customer's phone service due to a problem at the carrier.
TelePacific's network redundancy diagram above shows the most often used fail over plan where one or more local phone company analogue POTs lines are in place that the carrier will automatically re-route to in the case that the main phone company switch senses that the "transport circuit" between the carrier's switch and the customer's phone company goes down. This most often happens in California whenever it rains. This fail over is a good thing to have because if the physical transport circuit wire needs to be replaces there can be a delay of several days or more.
This sort of direct physical fail over plan usually runs about a hundred buck a month as it usually calls for about 3 extra POTs lines (the same as 3 extra fax lines) run into the businesses phone system (which requires wiring and an analogue line card in the phone system). If a $1,200 per year insurance policy like this is called for to prevent the loss of the ability to receive and make the minimum of calls during an outage then by all means order this sort of backup today.
As mentioned at the beginning of this blog though, this sort of fail over plan requires that the main phone computer still be working. If the main phone computer goes down, like in the apparent case with TelePacific recently, then this plan won't work.
5. Get a Carrier-Independent Phone Number and Move to "the Cloud" with Phone System Independent End-Points
If you have an 800 number as your main business phone number and you use an "independent RespOrg" like www.ATLC.com then rerouting your main toll free phone number during a carrier outage to an alternate working phone number or system (cell phone, answering service, back up location, etc.) takes only minutes. ATL also has the ability to, in some cases, reroute SIP/VoIP DID numbers when an outage is properly planned for. Contact ATL to see what they can do in advance for you.
The ultimate fail over plan is to totally escape the bounds of your phone system and your phone carrier.
To get the ultimate business communication survivability that is independent of your phone system, your phone carrier and even your main office location, look at a solution such as that which is offered by Telecom Recovery.
Telecom Recovery represents one flavor of "cloud" solutions what can revert your inbound calls to different "end points" (also known as cell phones or iPads). Most any hosted VoIP system today is also considered a cloud solution that will help bypass almost all circuit and physical phone system outages save the sort that happened to TelePacific this past week.
As a telecom end-user or agent representing an end-users' best interest it's important to ask and demand answers to the brutal questions this blog suggest. It's completely insufficient to settle for answers like, "we have backup solutions".
You need specific answers to specific questions like, "I'm going with TelePacific of XO - whomever can prove to me in detail how they're better than the other guy." If one carrier's sales engineer says, "I'm not familiar with the other carrier's fail over designs so I can't comment," you need to respond with, "Hold on, I'll get him on the phone now and you can ask him (or her)."
Everyone's mean, rude and nasty during an outage. If you want to avoid or minimize your next business phone service outage then you need to be mean, rude and nasty during the carrier choosing process by pitting one carrier directly against the other until you've forced them to reveal their own and each other's strengths and weaknesses.
if you don't have a mean, rude and nasty telecom agent to help you choose your next carrier (and avoid or minimize your next outage) ask around until you find one.
NETWORK OUTAGE REPORTING SYSTEM (NORS) The FCC portal page for reporting network outages
FCC Disruption to Communications Reporting Rules this seems to say that outages of 30 minutes or more that affect 900,000 user minutes need to be reported to the FCC
47 CFR 0.461 - How to submit requests for inspection of materials not routinely available for public inspection