Improving service quality in Skype for Business
As written on microsoft.com/itshowcase
When Microsoft IT deployed Skype for Business 2015 to support our highly mobile global user base, our goal was to provide the best user experience in the industry. We learned valuable lessons about hardware requirements, managing our complex network, accommodating diverse and remote clients, and running a unified communications platform in a hybrid cloud environment. We also helped develop a Call Quality Dashboard to help other organizations optimize the user experience.
Microsoft is a leader in unified communications—where voice, instant messaging, and conferencing converge to help employees communicate and collaborate effectively from anywhere. In 2011, Microsoft acquired Skype and integrated it into our Lync unified communications solution to create Skype for Business. Skype for Business has a design inspired by Skype and the security, compliance, and control of Lync.
In 2013, Microsoft IT planned to deploy a pre-release version of Skype for Business to the Microsoft global user base. Feedback from these users would help the product team improve the product before public release. To get Skype for Business to work well for our internal users, though, we would need to manage a complex environment. Unified communications is a real-time service that’s sensitive to change, client-to-client or server health anomalies, network latency, packet loss, and jitter.
Also, we knew that our hardware would be insufficient to support peak usage. We knew this because when we upgraded from Lync 2010 to Lync 2013, users experienced poor call quality, dropped calls, and bad connections. In 2014, we had 10 major incidents when as many as 1,000 Lync users were unable to make calls, join meetings, or were disconnected during a call. We determined that the problem was outdated hardware. The Lync 2013 architecture requires more robust hardware than Lync 2010, but we were still running the old servers. Skype for Business has the same architecture as Lync 2013, so without a hardware upgrade, the user experience would be poor, no matter what else we did.
Together with the product team, we launched the Get to Green program in March 2014, with “green” being the desired state of the service as shown in our metrics. Our goal was to make the end-to-end Skype for Business user experience the best in the industry. In addition to upgrading hardware, we needed to address issues arising from incompatible client drivers and hardware and a variety of networking environments. Also, more and more of our users were connecting to Skype for Business using personal devices and personal wireless networks that we don’t manage. We would need to find ways to improve the way our service performs on these unmanaged devices and external networks.
Creating a plan for great service quality
We got together with the product team to plan the Get to Green program. Our goal was to improve the user experience so there would be fewer dropped calls and better voice and video quality. To succeed, we would need to assess the environment and identify areas of opportunity to improve the service.
We would measure our success by using the Global Employee Satisfaction Survey and the Poor Call Rate (PCR). The employee satisfaction survey is administered bi-annually to a cross-section of employees that represent all roles and regions. It gathers their opinions about Microsoft IT services and resources, including their unified communications user experiences. PCR is an objective measure of call quality, based on a mean opinion score (MOS) for packet loss, jitter, concealment ratio, and round-trip times.
Defining problem areas
To plan improvements that would have the most impact, we assessed the service environment and identified the following areas that affect the user experience the most.
- Our server hardware was outdated. When we upgraded from Lync 2010 to Lync 2013, we used existing hardware. This created problems because Lync 2013 had a new architecture that ran all of the services on each server, rather than running each service on its own server. The old hardware didn’t have sufficient CPU or memory to handle peak load with the new architecture, so users experienced dropped connections and poor service quality. Also, we were running Windows Server 2008 R2, which did not have the performance advantages of Windows Server 2012.
- Our network environment is complex, and use is changing. Our unified communications service runs on multiple networks, such as PSTN, wireless, and the Microsoft corporate network. Our networks were designed to support mostly hard-wired connections, but users increasingly connect to our unified communications service by using Wi-Fi networks.
- We had incompatible client versions, drivers, and hardware. Clients using the service include Windows-based PCs, Android and iOS clients, and a variety of mobile devices. Some of these devices had drivers, versions, and hardware that were incompatible with Skype for Business. Also, we had the further issue that users’ personal (BYOD) devices were unmanaged.
- We have a limited ability to manage remote scenarios. Because Skype for Business is an access-anywhere technology, we only can manage it to the edge of our infrastructure. Yet 50 percent of our users are outside of our data centers. In these cases, we cannot control the environment, but only influence user behavior.
- We have a mixed environment. At Microsoft, Skype for Business runs on-premises, in the cloud, and on hybrid infrastructure, as shown in Figure 1. On-premises infrastructure creates IT management and support overhead and requires that we use telecommunications providers for voice service. This overhead and complexity doesn’t support our need for great quality and reliability. Also in the on-premises environment, we share infrastructure with other services and can’t manage end-to-end service health. Changes made by other services often affect our service quality.
Identifying areas of opportunity
To improve the user experience, we focused our efforts on improving these areas:
- Upgrading server hardware and creating redundancy.
- Improving network performance, particularly Wi-Fi in our buildings.
- Doing a better job managing a wide variety of devices.
- Educating users about the best practices and devices to use with Skype for Business.
- Creating a user feedback loop, so we can quickly identify and correct issues.
- Eventually moving all of our users to the cloud.
Focusing on the remote user experience
We decided to focus on improving service quality for our most challenging group of users, field sales people. Out of all our users, they’re the most dependent on the Skype for Business service. They don’t have the benefit of our stable corporate network, so their calls are often affected by network anomalies. Field sales users are often not in corporate offices and they rely heavily on unified communications to do their work. They often connect over external wireless networks of variable quality, and are the most affected by quality and reliability issues. We knew that once we got the service working well for them, all of our users would benefit.
The following two tables show the roles that are most affected by service quality, and the percentage of field sales people that are affected by poor PCR, respectively.
Optimizing Skype for Business
Over a period of several months, we made improvements to the server and network infrastructure, client devices, and user support. We’ve also continued migrating more of our user base to the cloud. While we still have a way to go, early results show that our approach is working, and the user experience is improving.
Increasing server capacity and redundancy
For the on-premises deployment of Skype for Business, a key area that we needed to address was server reliability and availability. To improve reliability and availability, we needed to increase server capacity and introduce redundancy to support the Skype for Business architecture. The old hardware we were using had been designed for Lync 2010, which had a distributed architecture where a capability or service runs on a separate server. To increase scalability, the Lync 2013 architecture allows multiple services to run on a single server or across server farms. Capacity can then be increased by adding servers. This architecture boosts the need for server performance, though. More CPU and memory is required to serve peak loads. For redundancy, we would need to add servers.
Skype for Business uses the same architecture as Lync 2013. To increase reliability and performance, we deployed more robust hardware to meet the new requirements. Also, to take advantage of its threading improvements over Microsoft Windows Server 2008, we decided to run the infrastructure on Windows Server 2012 R2 instead. Upgrading to Windows Server 2012 R2 yielded the added benefits of Windows Fabric, which Skype for Business makes extensive use of.
While still running Lync 2013, we upgraded all of our hardware to support the new consolidated architecture, where multiple services run on the same server. We first set up the new hardware infrastructure and then migrated our Lync 2013 servers over to it. This increased server capacity and network bandwidth to support optimal performance at peak load. It eliminated single points of failure and created redundancy to make the service highly available. Once Lync 2013 was up and running on the new hardware, we were able to do an in-place upgrade to Skype for Business.
To do this migration, we started with the backend servers and user pools, and then migrated the front-end servers. We migrated groups of users in a phased manner so that we could monitor and correct issues as we went along. When all users were migrated, we decommissioned the old hardware. After the servers were upgraded, we upgraded the Lync clients to Skype for Business clients.
We needed to ensure that the network could support peak load, which meant upgrading our data center circuits. We also made appropriate firewall settings, provided better DNS infrastructure, and enabled end-to-end Quality of Service (QoS) on the network to prioritize voice and video traffic.
We also needed to account for changes in the way users access unified communications. With Lync 2010, most of our users had hard-wired connections. By the time we were ready to deploy Skype for Business, most of them used wireless connections. The wireless infrastructure in our buildings was creating a huge bottleneck that we had to fix.
We’ve improved our networks and upgraded our unified communications devices to gain better performance and call quality, as follows:
- To increase the available bandwidth for Skype for Business in our data centers, we moved to dedicated 10 GBps bandwidth through all edge and core routing and network hardware.
- We enabled network QoS, and configured it to give priority to voice traffic first and video traffic second.
- We opened the appropriate ports to provide optimal performance.
- To increase bandwidth and throughput, we upgraded our building Wi-Fi networks globally from 802.11n to the 802.11ac standard and configured them to preferentially select the 5.0 GHz radio band over the 2.4 GHz band. All Microsoft IT-approved devices support the new standard and are slowly replacing incompatible devices.
- We upgraded all of our managed clients to Microsoft Windows 10, which has improved Wi-Fi drivers.
For details on network planning approaches for Lync Server and Skype for Business Server 2015, seeNetwork Planning, Monitoring, and Troubleshooting with Lync Server.
Improving device management
We developed a Skype for Business tool called the Call Quality Dashboard to help us track down call quality issues. Some of these issues are caused by devices that have incompatible drivers and hardware. The dashboard lets us drill down and identify exactly which devices are causing problems, even personal, unmanaged, devices. We can then work with the users to correct the issues. We’re now able to manage all of our devices better. The Call Quality Dashboard is discussed in more detail later, in Monitoring service health.
Moving to the cloud
We’re gradually moving our users to the cloud-based Office 365 Enterprise E5 service, which includes Skype for Business. By 2017, we plan to move 90 percent of our users to this service (keeping some users on-premises so we can continue to support our on-premises server product). This will resolve many of our current reliability and availability issues. It will also reduce the cost of supporting unified communications.
- Reliability gains. Our on-premises environment is shared with other systems. Some of our reliability problems are caused by changes made for other network-based services and technologies that affect our Skype for Business and Lync servers. Changes to networking, routing, ACLs, hardware, load balancing, firewall, GPO, and Active Directory changes can all affect the service. Having our service entirely in the dedicated cloud environment managed by Azure will eliminate these issues.
- Cost savings. Moving to the cloud eliminates the need to support servers in a data center or to support networking. Plus, no in-house expertise is needed to manage this complex infrastructure. The E5 service provides PSTN conferencing and voice calling, so we will eliminate the cost of telecommunications service providers.
We’re migrating our users in steps. Within the United States, we’ve moved almost all of our users to the Office 365 Enterprise E5 service. To support our customers outside the United States, we still use the Skype for Business 2015 on premises solution. This is because, until recently, Office 365 Enterprise E5 was available only in North America. Now the service is expanding globally, and we plan to move all of our international users to it by 2017. We’ll do this in stages as the service becomes available in different parts of the world. As we gradually migrate our international users, we’ll be able to eliminate the on-premises infrastructure in other countries/regions and data centers.
In the meantime, some of our users are hosted on a cloud server, but still have on-premises voice service provided by a telecommunications company. Ultimately, when we move everyone to Office 365 Enterprise E5, we will no longer need the external telecommunications provider, but will receive all of our communications services through Office 365 Enterprise E5.
Creating a feedback loop with users
Telemetry doesn’t tell the entire story. We also collect and prioritize user feedback to reveal blind spots and drive improvements to the product and service. The Global Employee Satisfaction Survey—our main mechanism for listening to users—tells us where we need to improve. In addition, we’ve created an internal SharePoint site called Skype@Microsoft (shown in Figure 3) that gives users ways to send us feedback and requests. It’s the starting point for everything to do with using Skype for Business: community engagement, information, self-service tools, and alerts.
We also gather data from a questionnaire that pops up when a user finishes a Skype call. It lets us know about call quality issues. We view the data in our Call Quality Dashboard, described later.
Helping users help themselves
We depend on our users to make good technology choices. Using the right kinds of devices, peripherals, and Wi-Fi networks with Skype for Business improves their experience. Our Skype@Microsoft SharePoint site gives users help on using Skype for Business, including guidance on technology selection and self-service tools to help them assess how well their client is working. We recommend that they select from a list of peripheral devices that we certified for Skype for Business. The certification process ensures that the devices work well. For the list, see Phones and devices for Skype for Business. We also provideinstructional videos.
For our field sales sellers, our most challenging user group, we’ve also developed an outreach program that includes training on tools, tips, and best practices to get the best Skype for Business user experience. These are summarized in the following figure.
Monitoring service health
We use a number of tools to continuously monitor service health, so that we can correct issues that might interfere with a good user experience.
Call Quality Dashboard
To help us diagnose network infrastructure issues affecting call quality, we developed the Call Quality Dashboard, which is included with Skype for Business Server 2015. For each phone call, it shows the type of call (wired or wireless, internal or external) and provides a measure of call quality. It uses PCR as a key performance indicator and rates calls from 1 to 4 based on packet loss and jitter. We also developed the Call Quality Methodology to use with the dashboard data. It provides a step-by-step approach to improving call quality. This has helped us to speed up our investigations and quickly resolve issues.
Using the dashboard, Microsoft IT managers drill down into the metrics—even to the individual call—to ensure that we’re delivering the best user experience at each location or building. We look at the following information:
- Service health. For both wired and Wi-Fi network infrastructure—both internal and external—we look at PCR to see how healthy the service is. For server-to-client or client-to-client call streams, it provides MOS score for packet loss, jitter, ratio conceal, and round-trip times.
- Client health. For each client device, we look at information about hardware, settings, client version, wireless driver, and peripheral devices, such as headsets and speakerphones. It also shows us whether a particular device complies with our current standards.
We use this data along with the Call Quality Methodology to drive improvements across Microsoft, and so far have reduced PCR from 8 percent to less than 2 percent. We’re training IT managers to use the tools to drive improvements in their buildings by correcting issues with underperforming devices, incompatible drivers and client versions, and insufficient network bandwidth.
Performing site investigations
Our IT site managers perform site investigations by drilling down into Call Quality Dashboard data to uncover the source of issues. Once they know the source, they can remediate it. The following screen capture shows a top-level view of the data for one of our buildings. The yellow trend lines in the graphs represent the PCR rates on wired and Wi-Fi networks and by day of week. In this case, they’re all trending down, which means the service is getting healthier. The red sections in the graphs represent calls with a PCR that’s higher than the target desirable state. We drill down for more detail, such as the type of calls involved, the network device drivers being used, the wireless hotspot in use, the wireless channel, and so forth. The user ratings that we capture on call quality are also included in the dashboard.
System Center Operations Manager
We use the management packfor Skype for Business Server 2015 to monitor our servers and get alerts on issues, such as when Skype for Business processes exceed a defined performance threshold.
Key Health Indicators
We use the following Key Health Indicator (KHI) performance counters to get metrics about server health: CPU and memory utilization, and TCP transmit time. Along with other resources, you can download the KHI Guide that outlines the methodology that we use to measure KHIs on servers and our environment.
We use tools such as the policy assurance manager tool in HP Network Automation to ensure that routers and switches in the data centers are running a compliant configuration and to ensure QoS is enabled end to end. We can also determine where we need to provide additional capacity to achieve availability and reliability for the network and server infrastructure. We use another internal tool to ensure all the network devices are running the gold code and that they’re meeting our capacity and compliance standards.
We also use tools such as Unify Square PowerMon to measure quality during synthetic transactions. We set up probes and test accounts in data centers.
While we’re continually improving, we’re already seeing improvements in the user experience and also enjoying cost benefits:
- The PCR was reduced to 1.73 percent from 8 percent, mostly due to network improvements and improved Windows 10 Wi-Fi drivers.
- The Global Employee Satisfaction Survey—our main mechanism for listening to users—showed double-digit improvements in user satisfaction. Users have already reported improvements in availability, reliability, and performance. We’ve turned a corner in terms of understanding the key satisfaction drivers for users, and for the last two quarters we’ve made gains in driving service improvement.
- We have double-digit increases in employee satisfaction, with an average 18-point increase in user satisfaction across audio, video, IM, meetings, and sharing.
- We’re saving about $132,000 per day by reducing the cost of using the public switched telephone network (PSTN) and third-party conferencing services, thanks to migrating our users to the Enterprise Voice features of Skype for Business.
- With more than 127,000 of our users enabled for Enterprise Voice, we’ve been able to decommission 70 percent of our old PBX equipment, saving more than $4.03 million over the last six years.
- Over time, we expect savings to increase. As we move more users to Skype for Business in the cloud, our datacenter infrastructure needs will decrease, and we will eliminate the cost of telephone carriers completely, which will reduce overall costs significantly.
- We’re also looking forward to further improvements from new Skype for Business features in coming months, like Keynote for Enterprise Connect, translation services, and better conferencing solutions.
Best practices for a great user experience
Use these best practices to improve the user experience with Skype for Business in your organization.
Provide sufficient capacity and bandwidth
Make sure that server capacity and network bandwidth support optimal performance at peak load. Use redundant systems to make sure that the service is highly available. Enable networking QoS, and open the recommended ports for optimal performance. To ensure your infrastructure supports the best possible service, be sure to follow the capacity planning guidelines for Skype for Business.
Put the right tools in your toolbox
Acquire and set up the tools discussed in this paper so you can monitor and manage Skype for Business service quality.
Move to the cloud
To gain performance and feature benefits, plan to move your Skype for Business users to the cloud—Office 365 Enterprise E5. Not only will it cost less, but it will increase your unified communications capabilities. Also, users like the Skype for Business client. Our Microsoft users are much happier with it.
If you haven’t already deployed a unified communications service, you can start offering a 100-percent, cloud-based service through Office 365 Enterprise E5. Not only will you avoid needing to support the infrastructure, but you’ll no longer have to pay telecommunications providers for telephone services. Rather, your users can connect to the Internet using Skype for Business, and Microsoft Azure will route telephone calls for them. This can represent a large savings for your organizations.
Listen to your users
Take these steps to ensure a great user experience:
- Understand use cases. Build personas and scenarios. Understand a “day in the life” of each group of users.
- Listen to your users. Create dedicated listening systems.
- Collect and prioritize feedback and use it to improve your service.
Help your users get good results
Make sure that users are empowered with tools and training to get the best possible Skype for Business experience. There are many situations that users can manage better than IT can. Help your users help themselves by giving them guidance and the right tools. Provide real-time notification of incidents and self-service workarounds. Make information on best practices easy to find.
Ensure client health before a meeting starts
Provide tools to ensure that the client is as healthy as possible before a user joins a meeting.
Use the recommended home router and best practices guide
For remote users, provide guidance for selecting and configuring a home router. Have a list of recommended Wi Fi routers. Use diagnostic tools to make sure the home Wi-Fi network is performing well.
Use approved headsets and peripherals
Recommend Skype-certified headsets and peripherals to ensure the best possible experience for your meetings. The certification process ensures that peripherals work well.