Wednesday, March 31, 2010

Data as Service - Thoughts

I am exploring multiple ways to expose data as a service. Let's take a brief look at problem description. You have a central data repository which acts as source of truth for multiple supporting application. These supporting applications usually reads data and manipulate/transform it and do what it needs to do with it.

data-as-a-service.png

This is not very atypical scenario. In most of the organization you will have a central application - usually of the shelf product with specific customization. e.g. ERP systems, Billing systems, CRM systems etc. The data store supporting this application is not flexible to customization. You will also find supporting applications for custom needs of organization to use the central data store. In usual scenarios this application will be handling this via direct read from central data store - potentially giving way to duplication of effort to manage this objects.

These problems have been solved in multiple ways.
  1. Allow each individual supporting application to access data directly or indirectly through data warehouses
  2. Expose the catalog of finely-tuned queries expose by the supporting application as web services.
  3. Define organization wide data directory containing well defined business entities/objects and expose web services to retrieve defined entities.
Types of Services
We can classify this data services in multiple categories
  • Data-Read Services - These are plain and simple data read or data lookup services. This services provide only minor transformation like column renames, etc
  • Transformation Services - These services provide some transformation of data. It operates on the column performs operation like truncation, concatenation, etc.
  • Filtering Services - These services allows opportunity to filter data based on input conditions. e.g provide active accounts, provide prospect accounts, etc.
  • Aggregation Services - These services provides performs certain aggregation function. e.g. sales by region, accounts by regions, etc. Though this logic could have served better purpose if kept in application, there are some performance benefits to do this in the data store layer.

Define Business Entities data dictionary
First and foremost identify organization wide data dictionary for the domain. This data dictionary is the definition of entities that applications needs to use. So there is not duplication. It also reduces confusion. For example, an entity named BillingAccount will mean the billing account for any application using it. This solves a larger problem, in the scenario where the supporting applications use the data directory it defines these central entities in multiple ways. So over time the definition of the same thing becomes convoluted. e.g. account number for application A is 10 digits and same for application b is first 8 digits of that. This results in duplication of data that can not inter operate. Defining a central data dictionary and allowing exchange of that in that contract helps avoid this problem.

Logical Representation
In most scenario the data store that comes bundled with the product is really generic so that product can be customized to different business. eg. ACCTPF having fields like ACCTPFNO, ACCTPFHONO, etc. One side effect can be seen on naming convention of entities and its properties. They are wierdly named and does not make any sense without actually looking at product documentation. My take is that some naming transformation is required. This can be done by "select as" queries or actually overlaying a logical data model on top of the existing one. Of course after performance consideration.

Expose as Web Service
Once you have the logical representation any of following two options can be followed
  • Write Select queries and expose them as either SOAP/REST web services
  • If select queries does not provide necessary transformation/aggregation, write db procedures and expose them as a service.
    • One of the sub variation of this scenario is to use DB specific data structures. (In case of Oracle, use oracle's user defined types). Performance consideration is essential because remember these services may very well be the most atomic activity overarching application will perform and it is essential that each data read is trivial and inexpensive.

Just some thoughts...



Blogged with the Flock Browser

No comments: