| Center for Technology in Government | |||
| Insider's Guide Home Strategy Policy Data Cost Skills Technology Cases | |||
| The devil is in the data
Data challenges include knowing...
Introduction Many agencies recognize the benefits of sharing information across programs and reusing existing data resources to provide citizens with integrated services. But they run face-first into data challenges when creating systems to realize these benefits. The data challenges agencies face when integrating information and services are daunting. In the world where agencies operate, they must bridge the gap between their business challenges (i.e. program initiatives) and the "relevant" data available to support them. Agencies look to data as the raw material for decision making and planning—for the foundation underneath actions taken by the agency. Turning data into actionable information requires an understanding of what must be done and the data necessary to do it. Lakshmi Mohan, associate professor from the School of Business at the University of Albany, states the key is to focus on determining the difference between what is a "must do" versus a "nice to do" when it comes to designing an information resource. Organizations must focus first on what must be done and then on finding the relevant data resources to do it. Determining the heritage of the data, assessing its timeliness and quality are all critical and complex parts of the process of turning data into "actionable information."
Unfortunately, many policy and program initiatives falter or fail because these challenges are overlooked or are overwhelming. Unexpected levels of effort are often required to:
All organizations face these challenges. The participants in the Using Information in Government Program found the challenges fell into four categories:
Fitness for usethe data quality challenge
Giri Tayi, associate professor from the School of Business at the University at Albany, asserts that data quality management means different things to different people depending on their perspective. From the analyst’s perspective data quality management requires:
From an organizational perspective, data quality management means insuring quality commensurate with the various uses of data through:
In practice good data quality management demands both perspectives. Improving the quality of data is costly and time consuming. Organizations must consider these costs in the context of the intended use of the data to determine if the costs are warranted. A number of steps must be taken before a review of costs can take place. Organizations must first:
If the available data does not meet the requirements, then:
Being clear about what is "good enough" is essential. In order to make reasonable decisions about investments in resolving data quality issues, project managers must define for their organization the difference between "perfect" and "good enough." Since each action or outcome has a cost associated with it, organizations need to decide if the available data are "good enough" for the task at hand. And they need to realize that each notch up the scale toward "perfect" costs time, money, and opportunity. In the Using Information in Government Program, we found these general data quality rules, formulated by Orr (1996), to be useful:
Additional information regarding data quality management issues can be found at:
Common groundthe data standards challenge The lack of common data standards across these various systems creates a significant barrier to information use. The challenge of creating and implementing unified data standards is compounded when the effort to use information spans organizational boundaries. Creating data standards within this environment requires:
Additional information about data standards development and use issues can be found at:
Information about informationthe meta data challenge
This information often resides in the heads of those who were involved in the creation of a particular information resource. They know why a data set was created, what rules governed the creation, who the intended users were, and what it shouldn’t be used for. The creators of the data set may not have written this information down or shared it with others because the value at the time was limited to that particular program or situation. But when others try to use a specific data set outside the confines of the original program, the need for good meta data becomes painfully clear. The information required to guide fitness for use decisions, to determine standards used in data collection, and many other questions about the potential value of a data resource is often unavailable. This leads to unused, or unknowingly misused, data resources. As efforts to integrate data from across multiple programs and governments are increasing, appreciation for the critical role that meta data performs is growing. Meta data can provide knowledge about the fitness for use of a particular data set for a specific decision or assessment. Meta data are not always available or required in the initial implementation of a stand-alone system. But systems that try to integrate multiple data sets without explicit meta data will be, at best, delayed, and at worst, derailed due to the high cost of creating meta data after the fact. More information about meta data can be found at:
Understanding the program environmentthe contextual knowledge challenge Program managers are often involved in the initial discussions regarding the functional design of a proposed system, but are not involved in any subsequent system processes until the system is ready for use. Without their involvement, data inclusion and exclusion decisions can be incorrect. Like meta data, contextual knowledge is important to avoid misuse of the data. All relevant program managers need to be involved in the process of deciding fitness for use. Their knowledge of what the data actually represents is crucial when developing systems that utilize existing data or data obtained from outside sources. Additional information that illustrates these challenges and offers techniques for addressing them:
The following examples from recent data integration projects show how agencies are dealing with data issues: Integrating disparate data sources Managing differences in the data For example, population figures from different sources were used in different calculations. Because of this, the ability to compare across variables was more limited than expected. The limitations of the data could not be overcome by mere technology; it would have required costly changes in each of the underlying data sources. Without program staff who were knowledgeable about the population sources, an error could have been made by allowing unsuspecting users to make invalid comparisons. It was important that the KWIC development team have strong background knowledge about the data, understand its complexities, and manage the data differences. Determining "fitness for use" The issues of data relevance and fitness for use took on a different meaning in the project on Assessing IT Investments at the NYS Department of Transportation. This project involved the lead agency broadening its view of the kind of information needed to make these decisions. It not only had to identify the new data elements, but also had to define and specify rules for others to follow when correcting the elements. Building meta data In the Homeless Information Management System project data were integrated and aggregated based on business rules that were specific to the new repository. The meta data from the existing systems were used as the foundation to determine if a data set could be included. New meta data were created to: document the data source, show how the data were aggregated and changed, and define the meaning of the resulting new data. Communicating the context of the data Both technical and program staff had to actively participate in the decisions to include or exclude specific data sets. Program staff provided the legal, historical, or operational rationale for the use of particular data elements and codes. Without this kind of contextual knowledge, important distinctions in the data would have been lost. |