|
Criteria for Achieving High Quality, Cost Effective Enterprise Data or How to design, build, and implement a high quality, cost effective, shared Enterprise Database!
Prepared by: Douglas T. Erickson -ENTARCO USA Inc.
Managing the enterprise data is becoming more and more of an essential requirement for any enterprise. Decades of less than even mediocre effort have left most enterprises with a data environment that is beset with problems of data quality, availability, flexibility, and excessive cost. Increasingly, people are recognizing that fundamental changes in how we manage the data are necessary. This change requires a change in the methods of how to manage the data, not just changing the technology. Just changing the technology and not the “habits” that have evolved over the decades will not solve the problem…..it will most likely make it worse. We are already seeing evidence that this is happening.
I think it is also very important to recognize that the management of the data is a very strategic issue. As most people are aware, the consequences of past data management practices are becoming more troublesome and they are persistent. Stop gap, quick-fix, operational and tactical “fixes” may provide some temporary relief but the basic problems continue. And they continue with an increasingly higher cost.
This paper addresses the database component of an overall Enterprise Data Architecture referred to as the Enterprise Database (EDB). This is perceived as a relatively new concept in IT thinking. Given this perception and a lack of experience with developing, implementing, and operating with an EDB, it is not well understood and therefore, perhaps, a controversial component of the Enterprise Architecture. This component is a skill-based component not a technology-based component. In reality, the idea of an Enterprise Database is not really new. What is new is the knowledge and experience that now exists to make it a reality. The concept of an EDB, which implements a non-redundant, highly shareable, cost effective, high quality database environment, contradicts the conventional thinking and assumptions that are often made regarding applications and application databases. Therefore, the predominant predisposition is to resist and discard the EDB concept primarily because most IT people have never had the opportunity to develop their experience and knowledge or the inclination to pursue the implementation and therefore experience the value of an EDB. This situation also exists because IT and the enterprise management have never asked for an EDB!
An EDB, as discussed here, is developed and implemented in accordance with a Conceptual Data Model (CDM) that, at a minimum, conforms to the criteria discussed in this section. Accordingly, I am using the “CDM” as a shorthand means of referring to the specification for an EDB.
The parochial or conventional position is that the CDM/EDB might be a good idea but it:
-
Takes a long time to implement, and
-
Costs a lot of money.
People who have never developed a real CDM nor implemented a real EDB successfully typically make both of these assertions. Since they have never done it they do not know how to do it and they can discard the idea and get others to discard the idea by asserting that it takes too long and costs too much.
Our experience is that it actually takes less time and money to develop, design and implement a CDM and corresponding EDB than it takes to design and implement traditional multiple application databases that continues the legacy of redundant, inconsistent, and less than acceptable data quality. The status quo is a lot more expensive, takes longer, and is the cause of most of the “data problems” experienced by most enterprises today. Also consider that the whole data warehouse movement came into existence to help solve many of the data availability, access, and quality problems inherent in the application database approach.
If the CDM/EDB is highly normalized, the adversaries will say it cannot be implemented because it would not perform and development using the database is too difficult. These assertions usually scare anyone within hearing distance from even thinking about the idea any further.
Many a naysayer will readily agree that the theory and concept are sound but it just cannot be implemented. So, the issue must be one of execution, that is, how to develop a CDM and implement a corresponding EDB.
We now have pretty convincing evidence, based on several years of production use that has demonstrated that a well developed CDM and an implemented EDB that closely resembles the CDM results in pretty dramatic benefits. These benefits come in two basic categories.
-
Dramatically improved data availability and quality, and
-
Dramatically lower data and application development and maintenance time and cost.
If quality data is collected and stored once:
-
It is typically more readily available and the quality can much more readily and easily be ascertained and sustained.
-
The logic (program code) that collects and uses the data is dramatically reduced and is much more readily identifiable for reuse. This dramatically reduces the amount of application logic that has to be specified, designed, developed, tested, and maintained.
One of the major misconceptions is that the entire enterprise CDM must be completed before any EDB implementations can occur. This is false. With a valid modular data modeling approach, significant subject area segments of the enterprise CDM can, should, and are typically implemented in a progressive manner. With a little forethought, these subject area segments can readily be identified, prioritized and implemented. Two of the most common, universally usable subject areas are:
-
Person
-
Enterprise
The EDB should not be viewed as a Data Warehouse (DW), nor an Operational Data Store (ODS) or any variation thereof. The EDB is a full-fledged Enterprise Database that can serve as the operational database for all enterprise applications. A very strategic feature of the EDB is that all future applications can and should be developed using the EDB database.
In addition, if the EDB is designed and implemented in accordance with the criteria discussed below, it will inherently serve the purposes of a DW or an ODS to a very large extent. It will also very quickly evolve to contain the data of record for the enterprise. In other words, it will dramatically reduce the requirements for a DW or ODS and thereby dramatically reduce the time, effort and cost of developing and maintaining them.
The critical factor in implementing a successful EDB is the specification and the design of the EDB. Of course, almost all products of architectural efforts are dependent on the quality or “goodness” of the specification and design of the end product, in this case the EDB. Further, it is an imperative that assuming you have a “good” EDB specification (the CDM), the EDB that is actually implemented must be pretty much an exact replica of the CDM. Deviations from the CDM must be exceedingly rare and when they do occur, they are made based on absolute necessity and in the right manner.
There are several criteria, which should be used to evaluate the “goodness”, that is, the correctness and completeness of a CDM. Of the following criteria, I find that most people’s primary position is to disparage the normalization criteria (No. 7 below) and the temporal data or “keeping track of every fact over time” criteria, (No. 8 below) typically because they cannot meet those criteria. On the other hand, most people will not have a very robust set of other criteria on which to judge the “goodness” of their data model/database offering.
Note: There is only one CDM per enterprise which means that the criteria for evaluating a CDM applies to all data modeled for the enterprise. Also, the term model as used here means to develop a specification or representation of something so that it can be built in accordance with that specification.
-
The data must be modeled to reflect how things exist in the real world, not based on how the data is used.
Example: An address exists because it has been assigned to a place by some recognized authority such as a Post Office. (They have quite a vested interested in the addresses!). You could correctly say, “An address is an address in and of itself.” We do not need to model addresses more than once. We should provide for storing addresses only once and then use them many times. We then would use associative entity types to specify who was using which address for what and when. The address itself would only be specified and stored once.
In other words, it is not acceptable to model Addresses as Person Address where the address attributes of Street, City, State, Zip are part of the Person Address and then model Enterprise Address where the address attributes of Street, City, State, Zip are part of the Enterprise Address. In the real world, a Person and an Enterprise may use the same address.
-
The data model must include the extensive use of natural search arguments to facilitate the access and use of the data.
Search arguments are those pieces of data (attributes) and whose values business people inherently, or can easily know. Further, the search arguments should be the primary key when they meet the criteria for being a key.
-
The data model must incorporate the extensive use of identifying relationships between things.
The combination of using the natural search arguments and identifying relationships significantly enhances the ability to find and access the data. It is also a major factor in simplifying the application code and in improving performance of application code against the database.
The data must be modeled to be atomic.
Every attribute/data element must have one definition and only one definition. This means that:
-
A named attribute cannot mean one thing one place and something else in another place in the CDM.
-
No attribute can have positions within the attribute that have meaning different from other positions in the attribute,
-
No attribute can have data values whose meaning is dependent on the value or values of data in another attribute.
-
No attribute can have values from more than one defined set.
-
Each attribute/element of data must have one name and only one name.
This means that every attribute in the CDM must have a unique name and the name of that attribute cannot be changed to cause it to be represented redundantly in the CDM or the resulting EDB. By convention, the specific name of an attribute is prefixed by the entity type name. Entity type names must be unique and therefore, each attribute name will be unique.
-
The data must be modeled to be explicit.
The CDM must not contain any many-to-many relationships. This means that “many” of something cannot be related to “many” of something else. These situations must be modeled using associative entity types with the appropriate”temporal” data attributes. (Refer to Item 8 below for a discussion of temporal data.)
The CDM must not contain any subtype structures without a very specific explanation of how to implement them. As a general rule, all subtypes get promoted to entity types in order to eliminate the need for optional attributes or fully optional relationships.
The CDM must not contain any fully optional relationships. A fully optional relationship means that the data analysis is incomplete. A fully optional relationship means that the data about the relationship has not been recognized and made explicit. (There may be a very rare case or two where a fully optional relationship is legitimate, but I have never seen one yet.)
-
The data model must be normalized to at least 3rd normal form.
The reason for normalization is to ensure the integrity of the data in the database. Normalization is a technique for ensuring that there are no insert, update, or delete anomalies in the data. It is the basis for ensuring what is known as referential integrity (RI).
Referential integrity means that:
-
An instance of an object cannot exist if its parent object instance does not exist.
Example: If we have a rule that says we must have an instance of a Person for us to have an Employee, we would not allow an instance of an Employee to be created without being able to reference which Person was the Employee.
-
An instance of an object cannot be deleted if it is a mandatory dependent of a parent object.
Example: If we have a rule that says we must have an instance of a Person Name for a Person instance to exist, we would not allow a Person Name to be deleted if it was the only current name we had for a referenced Person.
Any credible database management system (DBMS) will have the built-in capability to enforce referential integrity. However, because it is known that there are people designing databases that will not conform to referential integrity rules, the DBMS’s often provide the capability to shut off the referential integrity protections. In these cases, the referential integrity must be enforced by writing program code that enforces the referential integrity. Every piece of code that potentially would insert, update, or delete the data must ensure that the referential integrity rules are enforced. This is time consuming, expensive, and risky in terms of protecting the quality of the data. Having the DBMS enforce the RI eliminates this risk and expense.
The prevailing thinking in the IT community is that “normalization is okay and you might want to do it in the CDM but it cannot be implemented in the physical database”. The predominate assertion is that normalized databases do not perform. The second assertion is usually that it is too difficult to develop applications using a highly normalized data model/database.
There are a couple of fundamental reasons why people make these assertions.
-
Many in the IT community have never really seen a fully relational data model that was really normalized, much less one that is modeled to keep track of every fact over time; and where the database has been designed and implemented based on that model. This is a major problem in the IT community. Many data models are declared to be normalized but in reality they are not. This is a very tough issue to deal with since to deal with it, it is necessary to be questioning the competence of the people that developed the data models and/or those that judged the data model to be normalized. There are very few highly qualified data modelers in existence today. The very good news is that if you get a very good CDM and you implement the enterprise databases accordingly, you are not going to have to model that data again for a very long time, if ever.
-
The reason most IT people have such a problem with this concept is that their primary paradigm is based on the usage of the data as opposed to identifying, defining, and describing things of interest to the enterprise without regard for the usage of the data.
My basic premise for taking this position is that we have pretty well established that the set of data an enterprise needs is identifiable, predictable, and finite. On the other hand, it is the usage of the data that is extremely dynamic. The same piece of data can appear on hundreds if not thousands of different screens, reports, windows, or any other user view of the data. Understanding this makes it pretty easy to see that the data is stable and finite and it would be most advantageous to collect and store each fact one time and reuse it as many times as possible. This would result in maximum return on investment. This approach turns a recurring expense into a one-time investment in an asset….very quickly!
-
A fundamental principle in operating any enterprise is to minimize cost, especially recurring costs. Usually this involves making an informed decision and a prudent tradeoff regarding what is a recurring cost and what is a non-recurring cost. It usually does not take a reasonably intelligent person too long to figure out that minimizing the non-recurring cost and not the recurring cost is a fast track to oblivion.
By applying this principle we should eliminate the perceived recurring cost that is incurred every time we identify, define, design, and implement a piece of data redundantly. This should not even be considered a recurring cost; it is a redundant non-recurring cost which is totally indefensible. We can eliminate this redundant cost by eliminating the cost of moving and storing the same data over and over. The recurring cost should only be the cost for the access and use of the needed data. (Just an aside. I have rarely heard an IT management person talk about the principle of recurring and non-recurring cost. Therefore, it is unlikely that it is a factor in most IT decision-making.
-
The data must be modeled to keep track of every fact over time.
This criterion is relatively new. Therefore it is not well understood so very few know how to do it, and thus it will be hotly debated. This results in the phenomenon of “if I don’t understand it or I don’t know how to do it, it must not be important or necessary; and therefore you don’t need it”. The underlying motivation here is that if you want something and I do not know how to do it or provide it, you do not need me.
Last year, Chris Date, who is considered (and has been for many years, decades as a matter of fact) one of the leading thinkers and the leading author on the subject of data modeling and database design, published a book on temporal data which was probably the first book that dealt with the issue of keeping track of facts over time. We, of ENTARCO, have been developing and maturing the techniques for modeling data to keep track of every fact over time (temporal data) for over ten years. We have much empirical evidence of the enterprise business need, use, and value of universal and consistent temporal data to the point where the enterprise that now has it extensively implemented would not give it up. It would be their highest priority criteria for any data model and database.
There is also a case for saying that the data model cannot be properly normalized without providing for keeping track of every fact over time. To properly identify an attribute value, we need to know when it came into existence and/or when was it a fact.
The use of this technique has resulted in the following benefits. By applying the technique universally and consistently with a specified set of patterns:
-
It provides an enhanced capability to identify and resolve issues because you can universally and consistently know “who did what when” which is fundamental to identifying and resolving many operational issues.
-
It provides a very useful means to make assessments about the quality and currency of the data. Because data can deteriorate with age, this technique provides a means to always know when the data was created and how old it is.
-
It dramatically reduces the amount of time and effort to deal with the issues of keeping history. The result is we always have the history. Often when history is not captured when it is available it cannot be re-constructed. This is typically a very troublesome issue when developing a data warehouse.
-
It provides the ability to take snapshots of the status of anything at any point in time that provides the ability to see what was true in the past and when, what is true now, and to the extent that future data is entered, you can see what we expect to be true in the future.
-
It dramatically facilitates the ability to time phase the data for analytical and reporting purposes. This inherently supports classifying data about anything of interest to the enterprise people by any time dimension such as:
-
Daily, weekly, monthly, quarterly, annually, semi-annually, bi-annually;
-
Week-to-date, month-to-date, year-to-date, quarter-to-date, inception-to-date,
-
Being able to compare day-to-day, week-to-week, month-to-month, year-to-year data;
-
Or any combination of the above
From a more technical and operational perspective there are some very useful operational benefits of using temporal data. If all business data is kept as temporal data, using inserts where updates are only used to show that older data has been superseded by newer data, then some very costly issues are immediately resolved. Batch processes can now become background processes running in the same environment as online processes in the foreground. Since business data is never updated and lost, then backups are not required before beginning batch processes and data entered by online processes will not be lost since recovery does not require restoration to a prior backup. This eliminates the need for a batch window or for separate copies of databases where online processes are only allowed to access data as of the end of the last batch processing.
-
There should not be redundancy of data.
This means that the same attribute should not be modeled nor stored in any database more than once. This includes:
A. Replicating the attribute in two or more places in the model or in the database, B. Summaries or derivations of data using the attribute, or C. A value representing a fact modeled or stored redundantly by different names.
Note: Attributes that are also unique identifiers or database keys will be stored redundantly for purposes of implementing the relational model referential integrity. This redundancy is, however, controlled by the Database Management System (DBMS) and is necessary for referential integrity purposes.
Other than for database keys, there are a few exceptions to this criterion but they are very specific exceptions and occur only after a very serious evaluation of the need and the way for the redundancy of the data to occur.
First of all, the data, which is considered the base data, that is, the data used to produce the summary or derived data, must always be modeled and implemented.
The few cases under which summary or derived data may be considered for redundant modeling and storage are:
-
If the volume of data needed to be processed to produce the summary or derived data is so massive as to prevent its creation when used, it may be a candidate to be modeled and stored.
-
If the required summary or derived data does not change once it has been created and it cannot be reasonably re-created, it is a candidate to be modeled and stored.
-
If the summary or derived data values need to be preserved and for some reason cannot be re-summarized or re-derived when needed, it may be a candidate to be modeled and stored.
An attribute may be replicated if and when it is absolutely necessary to facilitate accessing some data. This case should be extremely rare since our experience is that it might occur once in a set of 8,000 to 9,000 attributes.
-
The data must be modeled to accommodate change.
There are several modeling principles and techniques for developing a data model, or process model for that matter, that will yield a structure and content that is flexible or adaptable to changes in the structure and content itself and in the use of the database designed and implemented using the data model.
This is not to say that there will not be changes to the data model structure and content but the objective is to minimize the amount of changes that will occur and also to develop the model in a manner that minimizes the impact of change and the cost and effort to accommodate changes. A very well designed CDM and EDB often will accommodate changes in the enterprise without changes being required to the CDM or the EDB and the application logic also.
First of all, modeling the data based on specifying and defining the data based on what the objects or domains of data that exist are and how they exist in the real world, will produce a model that is both highly adaptable and the data structure and content will be highly shareable and reusable.
For example, whether we are talking about Employees, Claimants, Members, Subscribers, Dependents, Licensees, Taxpayers, or Enrollees, these are all Persons about whom there is a very predictable and definable set of descriptive data. A Person Name is data about a Person. It is not data about an Employee, Claimant, Member, Subscriber, Dependent, etc. Therefore, do not model Employee Name, Claimant Name, or Member Name. Since every one of these must be a Person, they all inherit the characteristics of a Person. To do otherwise would be creating an excessive amount of redundant data. If and when we discover that there is another role played by a Person we would require adding a whole new set of data about a Person. By modeling Person and all of the Person data once it can readily be used when a new Person role is needed.
Further, if the data model is highly normalized and we subsequently discover the need for an additional attribute about a Person, it is relatively simple to add that attribute without disturbing all the other data about the Person. And we will not have to add it to all the other roles for a Person, even if we remembered to do that. It is also relatively easy to add new associations with the Person object without disturbing the rest of the data structure and content.
My intent here is just to demonstrate that there are ways to model the data to accommodate change. I do not intend to provide an exhaustive tutorial on all of the principles and techniques for modeling for change in this document, only to make the point that the principles and techniques do exist and it is imperative that these techniques be known and applied to the development of a CDM and incorporated in the design and implementation of the EDB.
-
There must be a standard method for naming entity types and attributes in the model.
We have found that a standard method for naming entity types and attributes is very beneficial for all User and IT people to be able to effectively and efficiently know and understand the content of the CDM and the EDB. The standard should be based on a set of rules that facilitate an understanding of what the data is and what it means. The names should be concise and relevant to the meaning of the object being named.
-
All entity types and attributes must have written definitions in accordance with a standard for format and content.
Each entity type and attribute (database table and table column) must have a written definition that conforms to a standard format and guidelines for content. It is highly recommended that the definitions be short and include the frequent use of examples to help clarify and convey the meaning of the data.
-
The data must be modeled to be able to maintain and enforce valid domain value sets.
This means that the data model must provide for specifying a set of valid values for all attributes for which a set of valid values can be specified. This is typically accomplished by using what is commonly known as Reference Data entity types. However, each Reference Data entity type must only specify values from one domain.
Further, the enforcement of the valid values for a domain must be by placing an associative entity type between the Reference entity type and the entity that is referencing the reference data and using identifying relationship to the entity type that is referencing the Reference Data entity type and the Reference entity type itself. The enforcement of the valid domain values using program code is not acceptable.
-
The data modeling must be based on an approach that achieves the completeness and stability of the CDM and resulting EDB.
Stability and completeness are highly related characteristic of a data model and resulting databases.
Stability is a measure of the ability to sustain the initial structure (organization of the data) of the data model and the resulting implemented databases. If the structure is very good, the ability to accommodate change will be maximized and the on-going cost of maintaining the data model and the databases will be minimized. This feature is also fundamental to reducing or eliminating the creation and implementation of redundant data. Often, new stand-alone application databases are created because of the obstacles and barriers to reusing data in existing databases, such as the cost of changing an existing database, to accommodate another usage of the data. Remember, a data structure based on one usage is not always readily reusable.
Completeness refers to how much of the necessary data about an object of interest is modeled and included in the first database implementation for that object. The more complete the set of data on first implementation, the less time and money will be spent adding newly discovered data to the data model and the databases. Also, the application that is designed to collect the data will be designed to be inherently more complete, thereby further reducing future time and effort to accommodate other related data requirements causing a subsequent change. In addition, many times needed data is not collected and when needed, the “past” data cannot be recovered or it is usually quite expensive.
In the above paragraph I make a reference to “future time and cost” and “other related data requirement”. Very often, adversaries will respond with a comment to the affect that “if we have to consider all the issues and situation that may come up in the future, we will never get done and it will be much too expensive”. This response is really another example of lack of knowledge and experience at data modeling, database design, and even application development. It can also result from short-sighted thinking on the part of the enterprise users and management.
First of all, it takes very little time and effort to consider and accommodate recognizable future requirements. At the modeling stage, many of these situations can and are recognized and accommodated in an hour or so. This approach takes a lot less time and money than making changes later on to accommodate recognizable future, or other data requirements, much less being able to provide for the data, especially the history the enterprise is going to need.
Any data modeling approach that is based on and emphasizes the “use of the data” or “what data do you need” with the intent for the business people to specify what data they need, is going to produce an incomplete and unstable data model. This approach is also very commonly used to design stand-alone application databases and only the data needed by that application will be considered. This is by far the most common approach used to develop data models, the most expensive, and the primary cause of rampant redundancy.
A much more useful approach to modeling data is based on the premise that the data needed by the enterprise is identifyable, definable, and finite. To some people this idea is unfathomable. What they are usually doing is confusing the “usage of the data” with “what the data is”.
The people, places, things, events, and concepts an enterprise needs to know about and what needs to be known about them can fairly readily be identified and defined with a very high degree of completeness and stability of the structure and the content of the resulting data model.
Another concept that further promotes the completeness and stability of a data model is the identification and incorporation of valid business rules in the data model. This concept has yielded such benefits that we have evolved to the principle that says we incorporate all the business rules we possibly can in the data model. Only if the business rules cannot be represented in the data model, do we model/specify them in the business process model and implement them in the corresponding application logic.
-
The EDB implemented based on the CDM must be usable as a production database supporting multiple business applications.
A good EDB can support multiple production business applications (systems) if it conforms to the previously discussed criteria.
Once any portion of the EDB is in place, all new requirements for data must look to the EDB for availability of that data and if the data is not available yet in the EDB, each new requirement is the vehicle for adding that data to the CDM and implementation in the EDB.
Achieving this result will provide dramatic cost reduction, productivity improvements, and capability enhancement benefits for any enterprise with the foresight to move in this direction. In fact, the EDB can and will become the shared memory of what the enterprise needs to know, it will be of very high quality, it will be very cost effective, and it will make the universe of enterprise data readily available for use by all who need it.
Just as other disciplines such as Accounting, Engineering, Materiel Management, have developed a set of principles which provide a basis for doing their work in a consistent, quality manner. I believe the above criteria provide a basis for bringing data management on a par with other, more mature disciplines.
|