NovaStar Reference / Database / Replication

Database replication is used to provide redundancy so that a failure of the "master" database server will be compensated for by the "backup" server, in which case the backup server temporarily becomes the master server. Replication can can also be used to create "slave" servers, primarily for data access and to load-balance a system.


Introduction

The NovaStar system uses PostgreSQL for its database and PostgreSQL database replication features are implemented using the Slony add-on package for PostgreSQL. Slony is used rather than built-in PostgreSQL replication features due to technical requirements, although the approach is evaluated as the PostgreSQL version is udpated.

Each NovaStar base station server "node" runs a PostgreSQL NovaStar database. The contents of each database are the same except for the following tables, which need to be different because the servers have different names, network addresses, and scheduled processes:

Data Collection

During normal operations, data collection programs such as nsrecdata collect data and insert into the NovaStar master database.

The first level of redundancy in a NovaStar system is that each NovaStar node can run data collection programs to ensure that if a process is interrupted on one server, data collection will continue on other servers. For example, nsrecdata may run on the master and backup nodes, and the servers are connected via network. The embedded logic in the data collection programs filter out duplicate data from multiple sources so that the database only includes one observation. This redundancy takes into account the potential that ALERT radio transmissions may follow a travel path to one or both servers and the travel path may change for various reasons.

Replication Data Flow

During normal operations, data collection from multiple NovaStar nodes flows into the master database. The contents of the master database are then replicated to the backup node and slave nodes. To be clear, the direction of data flow is uni-direction from master to other nodes.

Failover

When we have a master node lose connection to the backup (either because of network failure or hardware failure), we initiate what is called a Failover.
This process causes the backup node to take on data ingest as though it was the master and forget about the old replication setup.

Recovering from Failover

The process for recovery from a Failover must be carefully determined, in order to prevent data loss.

There are several things to consider:

  1. Replication state

    1. Has the former master relinquished its replication status?
    2. Has the backup created a new replication configuration on one or both nodes?
  2. Node state

    1. Is the master node physically failed? or just disconnected from the backup node (network failure)?
    2. Are all data sources present on both the master and backup node?

Restoring Replication after Failover

  1. Recovering from failover:

    1. backup node has become the new master, but the old master thinks everything is OK
    2. old master must be forgotten via nsdbrepdropcluster
    3. Via Administrator: Stop, Drop, Setup
  2. How to restore replication from the above states.

  3. How to restore data from log files.

Assuming that your master actually went down and was not recieving data during the failover event, recovery is fairly straight-forward.
You will log into the backup node (which has failed over to become the new master).
In the database replication configuration Edit screen you will:
1. Stop Replication (if the command fails, just try again)
2. Drop Replication (this will disconnect any nodes that are hanging on to the old configuration)
3. Setup Replication

In the event that your network was disrupted and the master node continued to ingest data during failover, the data between the two nodes will begin to diverge. In this case, we advise that you contact us and allow us to do a step-by-step check and rebuild of the two databases.