Background: SQL-Hub: Database Merging Tool

Introduction

Burr IST Ltd have been working with a wide range of customers since 2001 on all sorts of reporting, integration and analysis projects using SQL-Hub, JetViewer and other products.

Over the years we have come across a lot of the same challenges and looked at many ways of solving them

These articles explore some of the things that are often discussed and the pros and cons of a number of different solutions. These discussion are intended to give an understanding of some different types of scenarios and how they can be addressed.

Articles...

Implementing data retrieval from multiple databases/servers

Whatever the scenario you are dealing with (whether or not you need to standardise the format or match data) you have a number of options for how to actually implement retrieving and combining the data.

X) Query the multiple sources on demand, on the fly, for each enquiry

X1) Query the multiple sources on demand, on the fly, for each enquiry, using a query across multiple databases
X2) Query the multiple sources on demand, on the fly, for each enquiry, individually, and combine the results programmatically

Y) Periodically combine the data from the multiple sources into a single database from which you can get the for all the questions (queries)

Approach X1

Query the multiple sources on demand, on the fly for each enquiry, using a query across multiple databases

This will only work if all the databases are on the same server or linked servers.

The advantages of this approach are:

No duplication - the data comes straight from the source
No latency - the data comes straight from the source

The disadvantages are:

It falls down if any of the sources is not available
The SQLs statement can become very complex
The response may be slow (or very slow) especially with poor connectivity to remote servers
It unpredictably loads the source servers
In scenario B it will be increasing complicated and possibly unworkable

more about this approach

Approach X2

Query the multiple sources on demand, on the fly, for each enquiry, individually, and combine the result programmatically

This will work ok for servers that are not linked.

The advantages of this approach are:

No duplication - the data comes straight from the source
No latency - the data comes straight from the source

The disadvantages are:

It falls down or provides an incomplete result if any of the sources is not available
The implementation can become complex
The response may well be slow (or very slow) especially with poor connectivity to remote servers
It unpredictably loads the source servers
Core database functionalities (joins, aggregate functions etc) need to be implemented programatically and applied to the resulting data from the different sources which makes for extra work

more about this approach

Approach Y

Periodically combine the data from the multiple sources into a single database

In this approach a 'combined' database is constructed to hold all of the required data. Data from the various source databases is collected by a 'batch job'. The batch job runs periodically to update the data. The overall results are obtained by simply querying the combined database.

The advantages of this approach are:

Low and predictable overhead on the source servers
Results are always available independent of the source servers
Very quick and efficient response
Simple to query the resulting data
The complexity of more involved scenarios (e.g. scenarios B and C) is hidden from the analysis/presentation layer

The disadvantages are:

Duplication - multiple copies of the data exist, however this is only a question of disk space not data integrity (see Is Data Duplication Always Bad?)
Latency - the data in the central copy may not be completely up to date

more about this approach

next...