Multi-master, multi-slave clustering module for FreeSWITCH supporting automatic failover and call recovery.
One of the most important things to any telephony service provider is maintaining a high availability (HA) network infrastructure. Most commercial softswitch manufacturers provide HA options with their solutions. However, many service providers today are looking for lower cost options than the traditional softswitch manufacturers, so they turn to open source solutions like FreeSWITCH. The main drawback with open source solutions is the lack of strong, carrier-quality high availability options. A carrier-quality HA solution should accomplish, at a minimum, the following three goals:
- Automatically detect failures
- Automatically move calls from a failed switch to a working switch
- Easily support swapping out systems and performing scheduled maintenance tasks without interrupting calls in progress
These three items are the base features required for an HA solution. However, additional features are useful and desirable. For a complete list of features, as well as a break down of the current development status for this FreeSWITCH module, please visit:
Risks and challenges Learn about accountability on Kickstarter
The main challenge I must overcome to complete this module is finding time to work on it. I work a full time job, and I have a wife and two children. All of these things demand a lot of time. However, most week days I have from around 8pm until 11pm which I can spend writing code for this module. I also have 10 - 16 hours on the weekends I could spend towards writing code for the module.
Most of these extra hours are spent on consulting and family activities. Paid consulting jobs take priority when they are available. By raising the funds requested, it will provide the income necessary to displace all other paid consulting jobs which might be taken during the hours mentioned. It will ensure that I can work on the project for 20 - 25 hours each week.
If all of these hours are available to me to work on the project, I should have something to demonstrate at ClueCon 2013, August 6 - 8 which covers the base set of features and equates to at least "beta" quality code. This means it should be able to automatically detect failures and fail calls over to a slave system, as well as provide a means of taking nodes offline for easy maintenance, without dropping the calls. There may still be bugs to address, but those will be resolved as more people start testing the code.
If the $50,000 goal is not reached, I will try to relaunch the project with a smaller goal, but that will mean it will most likely not be ready for demonstration by ClueCon 2013.
Obviously, with software projects also comes the challenge of finding and fixing bugs, as well as the challenge of architecting a solid solution to the problem.
I have been working with FreeSWITCH since 2009 and programming in general since I was 5 years old. I authored the core-pgsql support in FreeSWITCH, have submitted several other minor patches to it, and I work with it in my day-job on a regular basis. My co-workers are the core developers and maintainers of the FreeSWITCH project.
In addition to this experience, I have worked in the VoIP industry since 2006 and have experience designing, developing, deploying, and maintaining carrier-grade VoIP networks. Prior to that, I worked for an ISP for four years and was solely responsible for the design and implementation of the ISP's networks. Thus, I have had direct experience dealing with high availability systems and architectures covering everything from the physical layer to the application layer since 2002. I have also written a resource agent for Pacemaker and Corosync which manages a pair of FreeSWITCH nodes as a master-slave resource.
The largest possible obstacle I might encounter is the ability inside FreeSWITCH to easily resurrect calls for single other FS instance in a timely fashion while simultaneously making each FS slave be capable of resurrecting calls for any FS master. I have some ideas on how to accomplish this, and I am very familiar with the sections of the core FS code involved in the process, so I do not anticipate any major issues. That being said, nothing is certain until I get to that point in the development process and have the chance to test it. The worst case scenario is that it requires me to make some modifications to the FS core to do what I need to do. Having already added the core-pgsql support, and working daily with the lead developer of FreeSWITCH, I do not anticipate any complications or major issues with that process should it be required. However, if it is required, it will add some significant extra work to the project and delay the release.
None. The point of mod_ha_cluster is to enable FreeSWITCH to perform automatic call fail over and recovery without the need for 3rd party software of any kind.
This is something that needs to be handled mostly by the person / organization deploying the HA module. The HA module supports multiple heartbeat NICs. The entity deploying the module needs to design the physical layer of their network to ensure a full network split can never occur. You do that in general by deploying 2 or more physical networks. This means you need redundant power, battery backups, multiple physical switches, etc. You need to design and deploy your physical network to ensure that no matter what fails (power, wiring, switch ports, switches, NICs, etc), the systems using the module always have an alternative method of communicating with each other. Typically, two physical networks are sufficient for most users. However, carrier deployments who want to offer 99.999% or better uptime might want to go with three physical networks. This means placing three NICs in each system (one on each network) and configuring mod_ha_cluster to send and receive messages on all three NICs.
The module keeps several seconds of cached messages in a hash table to eliminate / ignore duplicate messages. The first one received will be used and it's message ID is stored in the hash table. If another copy arrives before that entry is pruned from the cache, the additional copy is ignored.
This is not some general purpose HA solution. This is a very focused and narrow-minded project. The only goal is to provide FreeSWITCH with just enough information about other nodes to be able to handle automatic call fail over and recovery, and to do it in a robust way. It will not detect the failure of other resources than what it manages by itself (basically, anything that you would normally expect FreeSWITCH to be capable of pointing out as a problem this module can intercept and act on, but pretty much everything else is not on the list of things for it to do). Other HA systems like Pacemaker and Corosync are very general-purpose systems which are designed to have other software packages bolted into them. They manage endless different possible configurations for endless different pieces of software doing endless different tasks. The complexity of systems like Pacemaker is insane. This project is not trying to compete with a system like that. This project is a module for FreeSWITCH specifically and everything it does is very specific to just what FreeSWITCH needs. It is a lean, mean, finite state machine! It is not another general purpose HA system.
Pledge $250 or moreYou selected
Receive a copy of the Pacemaker resource agent I wrote for managing FreeSWITCH as a master-slave resource. This resource agent will be provided for all pledges of $250 or more, upon successful funding of the project.Estimated delivery:
Pledge $500 or moreYou selected
You get 1 priority bug fix request during the first year after the module releases, plus the Pacemaker resource agent.Estimated delivery:
Pledge $1,000 or moreYou selected
You get 3 priority bug fix requests during the first year after the module releases, plus the Pacemaker resource agent.Estimated delivery:
Pledge $2,500 or moreYou selected
You get 10 priority bug fix requests during the first year after the module releases, plus the Pacemaker resource agent.Estimated delivery:
Pledge $5,000 or moreYou selected
You will receive unlimited priority bug fixes for the first year after mod_ha_cluster hits release status, plus the Pacemaker resource agent.Estimated delivery:
Pledge $7,500 or moreYou selected
You will receive unlimited priority bug fixes for the first year after mod_ha_cluster hits release status. You may also request the addition of a single unplanned feature which will be treated as a priority bug fix request. Any feature requests treated as a priority bug fix request must be approved by me. You also receive a copy of the Pacemaker resource agent.Estimated delivery:
Pledge $10,000 or moreYou selected
You will get my personal assistance setting up the module in your environment for 30 days from the time of first contact (after mod_ha_cluster is released). You will also receive unlimited priority bug fixes for the first year after the module is released. Finally, you can request the addition of a single unplanned feature which will be treated as a priority bug fix request. Any feature requests treated as a priority bug fix request must be approved by me. You also receive a copy of the Pacemaker resource agent.Estimated delivery:
- (60 days)