In a clustered environment, vital server parts and session processings are considered in the fail-over process.
Server part fail-over
Server part fail-over is implemented by assigning the vital Server parts different priority numbers, and then run each part on the host where it has the highest priority number (provided that the Server is up and running). So if a Server running a vital server part goes down, the Server with the next priority of that Server part will take over. The images below show what happens the during a Server part fail-over procedure:
Server part status is supervised by the Server monitor (SM). Every Server runs an instance of the Server monitor, which continuously checks the status of Server parts running on that Server and communicates the status to the other Server Monitors.
If a Server part is missing, a fail-over of the Server part is initiated and the Server monitor on the Server with the highest priority of the available Servers starts activating the missing Server part.
If the option "auto fail-back" has been enabled (see System settings), a Server part will automatically "fail back" to the Server with highest priority. So if the previously failed Server becomes available again (and the priority has not changed), the Server part will be moved back to the original Server.
During the fail-over process, the following Events are created:
- An Event of Event type __iCore_ServerpartFailoverStarted (when fail-over starts).
- An Event of Event type __iCore_ServerpartFailoverCompleted.
- An Event of Event type __iCore_RuntimeErrorOccurred (If fail-over is unsuccessful).
Session processing fail-over
If a Job that is executing a Component with property Lifetime set to "session" terminates for any reason, the Job manager will reset the state of the Job to "Pending", which restarts the Job. The properties Restart delay time and Max restart attempts (configured on the Component) are also taken into consideration. The Job may be restarted at another Server.
If the connection between the Job Manager and the Server running a Component with Lifetime="session" is lost, the session Job will be started on the next available Server, if contact with the missing Server cannot be re-established within a specified time frame .
During some exceptional circumstances, the Job may actually still be running on the second ("missing") Server but for some reason the Servers cannot communicate properly. This situation may cause two instances of the same session Component to run in the same Job concurrently. If this happens, the problem will be detected once the communication between the Servers is re-established, and all but one instance of the Component will be canceled.
if the number of Max restart attempts is exceeded, a System Event of type __iCore_SessionJobDiscontinued is created, and the session Job will no longer be restarted in the current session.
System defined Event types
Technical architecture & Runtime
Servers and Server parts