Introduction
Burr IST Ltd have been working with a wide range of customers since 2001 on all sorts of reporting, integration and analysis projects using SQL-Hub, JetViewer and other products.
Over the years we have come across a lot of the same challenges and looked at many ways of solving them
These articles explore some of the things that are often discussed and the pros and cons of a number of different solutions. These discussion are intended to give an understanding of some different types of scenarios and how they can be addressed.
Implementing data retrieval from multiple databases/servers
Whatever the scenario you are dealing with (whether or not you need to standardise the format or match data) you have a number of options for how to actually implement retrieving and combining the data.
- X) Query the multiple sources on demand, on the fly, for each enquiry
- X1) Query the multiple sources on demand, on the fly, for each enquiry, using a query across multiple databases
- X2) Query the multiple sources on demand, on the fly, for each enquiry, individually, and combine the results programmatically
- Y) Periodically combine the data from the multiple sources into a single database from which you can get the for all the questions (queries)
Approach X1
Query the multiple sources on demand, on the fly for each enquiry, using a query across multiple databases
This will only work if all the databases are on the same server or linked servers.
The advantages of this approach are:
- No duplication - the data comes straight from the source
- No latency - the data comes straight from the source
The disadvantages are:
- It falls down if any of the sources is not available
- The SQLs statement can become very complex
- The response may be slow (or very slow) especially with poor connectivity to remote servers
- It unpredictably loads the source servers
- In scenario B it will be increasing complicated and possibly unworkable
Approach X2
Query the multiple sources on demand, on the fly, for each enquiry, individually, and combine the result programmatically
This will work ok for servers that are not linked.
The advantages of this approach are:
- No duplication - the data comes straight from the source
- No latency - the data comes straight from the source
The disadvantages are:
- It falls down or provides an incomplete result if any of the sources is not available
- The implementation can become complex
- The response may well be slow (or very slow) especially with poor connectivity to remote servers
- It unpredictably loads the source servers
- Core database functionalities (joins, aggregate functions etc) need to be implemented programatically and applied to the resulting data from the different sources which makes for extra work
Approach Y
Periodically combine the data from the multiple sources into a single database
In this approach a 'combined' database is constructed to hold all of the required data. Data from the various source databases is collected by a 'batch job'. The batch job runs periodically to update the data. The overall results are obtained by simply querying the combined database.
The advantages of this approach are:
- Low and predictable overhead on the source servers
- Results are always available independent of the source servers
- Very quick and efficient response
- Simple to query the resulting data
- The complexity of more involved scenarios (e.g. scenarios B and C) is hidden from the analysis/presentation layer
The disadvantages are:
- Duplication - multiple copies of the data exist, however this is only a question of disk space not data integrity (see Is Data Duplication Always Bad?)
- Latency - the data in the central copy may not be completely up to date