Background: SQL-Hub: Database Merging Tool

Introduction

Burr IST Ltd have been working with a wide range of customers since 2001 on all sorts of reporting, integration and analysis projects using SQL-Hub, JetViewer and other products.

Over the years we have come across a lot of the same challenges and looked at many ways of solving them

These articles explore some of the things that are often discussed and the pros and cons of a number of different solutions. These discussion are intended to give an understanding of some different types of scenarios and how they can be addressed.

Articles...

Approach Y

Periodically combine the data from the multiple sources into a single database

In this approach a 'combined' database is constructed to hold all of the required data. Data from the various source databases is collected by a 'batch job'. The batch job runs periodically to update the data. The overall results are obtained by simply querying the combined database.

Firstly we need to define a structure for the 'combined' ('central' in SQL-Hub language) database and then how this database will be populated.

Scenario A

For scenario A the structure of the central database of the data is fairly clear - it is pretty much the same as that of the source database, but we need to deal with the primary keys.

Primary Keys

It is very likely that the primary keys in the source data bases will have conflicting duplicate values - i.e. in database X there is and order with OrderID=1 and the same in database Y but they refer to a different orders. In some cases the primary keys in the source databases might be GUIDs - in which case there should be no problem, but more commonly though they will be integers or strings. This will mean that these primary keys (and associated joins etc) will not work in the combined database.

We need to find a way round this problem.

One approach is to add an additional field 'DatabaseID' to all of the tables and primary keys in the combined copy - this provides unique primary keys, while maintaining relationships defined by the source primary keys and indicating which database each record came from. The one disadvantage of this is that the DatabaseID will often need to be included in joins made on the resulting combined database.

Batch Job

Once the central database structure is in place a batch job can populate the tables from each of the sources (adding the DatabaseID as it goes if required). If the source database are on the same or linked servers something like this could be done with a union query:

MS-SQL:

    Truncate table CentralDatabase.dbo.Orders;

    Insert into CentralDatabase.dbo.Orders (DatabaseID,OrderID,ProductID,OrderDate,[Value])

    select 1, OrderID, ProductID,OrderDate,[Value] from
    Database_1.dbo.Orders union

    select 2, OrderID, ProductID,OrderDate,[Value] from
    Database_2.dbo.Orders

As usual remove the .dbo for MySQL. The earlier code snippets can be modified to achieve this.

There are of course other (better?) ways of doing this.

SQL-Hub produces a single database, with the same structure as the source databases, but with an additional DatabaseID, containing the data from all the source databases. The source servers don't need to be linked and can even be on a different DBMS from the resulting combined (central) database. SQL-Hub does this "out of the box"

Scenario B

In this scenario the batch process needs not only to transfer the data but also to standardise the content. A standardised data schema for the central database, and rules to populate it from the source systems need to be defined, though this does necessarily not need to cover the entire contents of the source databases - a limited scope may still be very useful. The result of this process are extremely powerful - a single database presenting standardised data from multiple source of different types.

This scenario has been implemented very successfully using SQL-Hub in combination with Burr ISTs Reporting Cube Platform (across multiple continents with sources and on multiple different source DBMSs).

Scenario C

In this scenario the the structure for the central copy of the data is also clear - it is the same as that of the source databases for the types of information required from each different source. SQL-Hub can be used in this context as well.

The advantages of this approach are:

Low and predictable overhead on the source servers
Results are always available independent of the source servers
Very quick and efficient response
Simple to query the resulting data
The complexity of more involved scenarios (e.g. scenarios B and C) is hidden from the analysis/presentation layer

The disadvantages are:

Duplication - multiple copies of the data exist, however this is only a question of disk space not data integrity (see Is Data Duplication Always Bad?)
Latency - the data in the central copy may not be completely up to date

next...