All official NCVHS documents including meeting transcripts are posted on the NCVHS Website (

Department of Health and Human Services


Subcommittee on Populations

Workshop on Data linkages to Improve Health Outcomes

September 18 – 19, 2006

Renaissance Hotel
999 9th Street, NW
Washington, D.C.

Meeting Synopsis

The National Committee on Vital and Health Statistics Subcommittee on Populations Workshop on Data Linkages to Improve Health Outcomes was convened on September 18 – 19, 2006 in Washington, D.C. The meeting was open to the public and broadcast live on the Internet.

All official NCVHS documents are posted on the NCVHS Website.


Committee Members

  • Donald M. Steinwachs, Ph.D., Chair
  • A. Russell Localio, Esq., MA, MPH, MS
  • William J. Scanlon, Ph.D.
  • C. Eugene Steuerle, Ph.D.
  • Kevin C. Vigilante, M.D., MPH (Sept. 18 only)


  • Robert H. Hungate

Staff and Liaisons

  • Dale Hitchcock, ASPE, Lead Staff
  • Nancy Breen, NCI, NIH
  • Irma T. Elo, U PA (Sept. 18 only)
  • Marjorie Greenberg, NCHS/CDC (Sept. 18 only)
  • Debbie M. Jackson, NCHS
  • Edward J. Sondik, NCHS
  • Cynthia Sydney, NCHS
  • Joan Turek, ASPE


  • Maria Agelli, NIH
  • Miguel Buddle, US Military Cancer Institute
  • Cecilia R. Casale, AHRQ
  • Leslie Cooper, NIH
  • Mark Denbaly, USDA ERS
  • Greg Downihy, DHHS
  • Judith Eargle, US Census Bureau
  • Jenny Laster Genser, USDA FNS
  • Miryam Granthon, HHS
  • Erick K. Ishii, VA
  • Hormuzd Katki, NIH
  • J. M. King, NIAMS NIH
  • Kimberly A. Lochner, CDC
  • Jacqueline Wilson Lucas, CDC
  • Salma Shariff-Marco, NCI
  • Jan Markowitz, NAPHSIS
  • Patricia L. Melvin, US Census Bureau
  • David G. Moriarty, CDC
  • Edna L. Paisano, IHS
  • Jennifer D. Parker, NCHS/CDC
  • Kalman Rupp, SSA
  • Philip Steel, US Census Bureau
  • Kangmin Zhu, US Military Cancer Institute


  • An audiotape and PowerPoint presentation slides from this workshop are posted on the NCVHS Website.
  • Information obtained from this workshop will be processed to further the Subcommittee’s  goal of making recommendations to the Secretary of HHS about data linkages that can improve health outcomes.  With a focus on population health measurement, the Subcommittee will continue to review data sources for health risk factors, socioeconomic status, education, income, and other factors; and work to better understand the health of subpopulations and what can be learned from state data.

See official transcript for full Subcommittee discussion.


Monday, September 18, 2006

Importance of Data Linkages

Donald Steinwachs, Ph.D.
C. Eugene Steuerle, Ph.D.
Nancy Breen, Ph.D.

Meeting Purpose and Constraints Blocking Linkages     The purpose of the workshop is to discuss what data linkages are possible and what is being done; what is not being done; what could be done with additional resources; what interagency collaboration is possible; and what constraints block linkages.  Lack of resources and privacy and confidentiality issues represent two of the biggest constraints.  An important question to consider is how to insure adequate access.  (It was noted that the Census Bureau’s Survey of Income and Program Participation (SIPP), which provided information about labor force and employment patterns and long-term trends, has been defunded).

Census Bureau Overview of Technical Infrastructure

Sally M. Obenski, Asst. Division Chief for Administrative Records Applications, Data Integration Division

Use of Administrative Records, Policies, and Administrative Controls     The Census Bureau’s guiding mandate (Title 13), its strategic plan, and other legal guidance and protections (Title 26 and the Privacy Act) call for the use of administrative records as extensively as possible.  Such use reduces reporting burden, minimizes costs, and allows for the creation of innovative data sources.  The Census Bureau, which ensures consistent application of policies and administrative controls, has a stringent infrastructure (data stewardship program) that houses administrative records.  Respondent privacy and confidentiality are protected at all times. Personally identifiable information is replaced with an anonymous protected identification key  (see transcript for history of the Census Bureau’s administrative records program evolution [StARS]).

Use of Numident and Race and Hispanic Origin Model     See transcript for details of the Census Bureau’s use of Numident, a transaction file of all social security numbers that provides the Census Bureau with its verification and validation system.  Note that the Census 2000 was linked to Numident.  What exists now is a hybrid system that combines Census 2000 race and Hispanicity with high quality Numident age and sex.  This system is now being used by a number of Census Bureau programs including intercensal estimate.

PVS     See transcript for details of the automated Personal Identification Validation system (PVS), developed by the Census Bureau in collaboration with the SSA. The PVS is the Census Bureau’s record linkage infrastructure, which uses the Numident and SSA file as its reference file with no identifiable information passed on.

ACS      Another major enabler is the implementation of the American Community Survey (ACS), which obtains data at smaller levels of geography.  Bureau researchers are working to use the ACS to “model the model.”  See transcript for discussion about how the Bureau processes and anomymizes files for such programs as the intercensal estimates and small area income and poverty estimates.

Uses in the Decennial Census

  1. Explore whether the Bureau can assist the hot deck imputation method by using administrative records to assign age, race, sex, and Hispanic origin when a record can be matched.  This can reduce hot deck work load and improve its standard error.  Because this is promising, the Bureau will determine whether it is operationally feasible in a production environment.
  2. Use administrative records to identify households with coverage problems.  The current coverage follow-up operation is expensive so a research project has remodeled the StARS system to develop probabilities about certain types of housing units in certain types of areas (currently being evaluated).
  3. Enhance the group quarters frame, using Info USA (yellow pages) and ES 202 (business register for states).  Note that seven states and their co-op list have been successfully evaluated.  This effort is being expanded nationally.

Other Survey Improvements

  • Use of StARS database to develop survey controls that reduce ACS small area variance (Bob Faye)
  • Examination of StARS to CPS match to compare non-responders to responders (analysis underway)
  • Katrina response involved alternative survey controls and county-level tallies using NCOA.  The Bureau is exploring the feasibility of developing the next generation of StARS to produce real-time measurements.


Operational constraints include: file acquisition complexities (complex MOU for third party data; state-by-state negotiation [e.g., poverty data]; differences in content definition, quality, and program rules over time); and file lag time (e.g., MSIS).

Technical constraints (such as getting the right data into the right format; varying rates of validation (e.g., Medicare high; Medicaid lower); coarseness of data; and measuring error) pose additional challenges.  There are questions about what “measuring error” means and challenges to how a confidence interval is placed around an integrated data set.  It is very difficult to update the StARS file with its hundreds of millions of records on a quarterly basis to have a near real-time response; and additional challenge to acquire state-based records and understand what integrated sets are.

Actions to Overcome Restraints Some revolving file acquisition (especially for state data) may require OMB or Congressional assistance.  Lag time for general demographics is largely addressed by the national change of address file and possibly a move to the enhanced StARS.  The new Data Integration Division has completely standardized and centralized file acquisition.  There are continual improvements to PVS (including conversion to SAS).  A Data Quality Standards team will measure error in integrated data sets.

Health-related Administrative Records Research

Ronald C. Prevost, Asst. Division Chief for Data Management, Data Integration Division.

Summary  The Census Bureau is committed to: meeting the needs of its customers by enhancing reuse of statistical data through the use of administrative records, survey and census data sets; reducing costs and respondent burden of developing statistics; insuring the trust of data providers and users as well as the public; and continuing a public dialogue on the direct benefits and cautions of the Census Bureau’s use of administrative data.

See transcript for description of the Census Bureau’s small area health insurance estimates program, which produces a consistent set of estimates of health insurance for all U.S. counties.  Estimates for 2007 will be at the country and state levels by race, ethnicity, age, sex, and income.

Medicaid Undercount Project Summary  Survey estimates of Medicaid enrollment are well below the administrative data enrollment figures.  See transcript for information about reasons for the undercount; how such information is used; a description of the phases and intricacies of the study and examples. This project includes the development of a longitudinal data file as well as a good example of interagency collaboration that has been working, in phases, to integrate data sets; determine why the Medicaid and CPS differ so widely on enrollment status; improve information in national files; and examine the impact that state data has on national data.  In its fourth and final phase, the project will match national Medicaid system to the national health interview survey and look at person coverage.  While the goal of this project is to improve the CPS for supporting health policy analysis (especially refining estimates of uninsured), conclusions to date indicate that the survey measurement error (i.e., misreporting) is playing the most significant role in producing the undercount. It should be noted that data can be linked without a social security number and in the future, the Census Bureau will link more data in this manner in response to privacy concerns.

Value of Integrated Data Sets  Administrative data portrays information and experience that federal and state agencies have about a given set of individuals but often does not address demographics, social or economic information measured by surveys.  Integrated data sets consisting of survey and administrative data represent an “important area of growth” because they portray a more robust and accurate picture for use in policy development, implementation, and evaluation.  They build on what the agency sees and what people experience but controls for weaknesses in both arenas.  Integrated data provides better statistics and allows for the development of eligibility models.  Data reuse may be the only cost effective option in the future.

If the Medicaid Undercount Project were to add data to improve research on health outcomes, it would include information from: WIC files; the NHANES survey and the MEPS survey; CPS food security supplement; and the SS-5 (application for social security card) that the SSA collects.  Other important data sets include information related to co-insurance: VA medical health data; tri-care information; and Indian Health Service information.  See PowerPoint presentation for more on issues to be addressed, which include: universe differences; persons receiving Medicaid in 2+ states; what is “meaningful” health insurance coverage; CPS knowledge; CPS responses and respondent recall; and possible bias.

Policy Challenges  While integrated data architectures are the future of American statistics, the benefits of integrating these data sets must be communicated in light of privacy concerns.  Interagency teams need to insure accurate results but interagency agreements can inhibit timely access to data.  The question of who “owns” the data is critical.  If everybody owns them, there are possible disclosure risks.  When data sets are blended and released as statistical data files, how best does the statistical agency providing the data remove identifiable information on individuals?  The vision for the future must include the ability of longitudinal databases to locate the address of an anonymized person at a specific point in time while balancing privacy concerns.

Policy Issues When Using Administrative Data for Statistical Purposes 

Gerald Gates, Chief Privacy Officer, Census Bureau

See PowerPoint presentation for information about Census Bureau’s mission, Title 13 restrictions, data stewardship; legal guidance and protection; policies and Record Linkage policy statement; controls that support statistical use of administrative records; unique privacy and confidentiality concerns.

Four Trust Relationships     (see transcript for details)

Key policy issues related to use of data revolve around trust:

  1. Between administrative data provider and statistical data collector
  2. Between statistical data collector and survey and censuses respondents
  3. Between administrative data provider and program recipient
  4. Between statistical agency and data users

The Census Bureau’s legal ground rules require that data be protected and used for statistical purposes only, to honor privacy and protect confidentiality. The Bureau has a commitment to data stewardship, a formalized program (est. 2001) that acknowledges and addresses legal mandates and ethical requirements of professional statisticians (see transcript for details).  Title 13 is the basic legal framework for Census Bureau programs and Title 26 (IRS code) is also important (see transcript for specifics and definitions of privacy and confidentiality).  The Bureau is also guided by the Privacy Act, the Paperwork Reduction Act, the Freedom of Information Act, HIPAA, and the E-Government Act of 2002.

Four guiding policies that impact administrative records include linkage of decennial census records; record linkage; collaborative arrangements with agencies; and an administrative records handbook (see transcript for specifics of record linkage policy; administrative record controls, review process, and tracking system; and security and confidentiality staff training).

Privacy and Confidentiality Questions and Concerns about Administrative Records Use

  • Is the consent needed for the statistical use of administrative records through the data provider or survey collector?  What are the conditions for that consent? Is it opt in or out; how does it differ for voluntary or mandatory surveys?
  • Will the public accept a near real-time system to respond to immediate statistical needs? What are the concerns about potential non-statistical uses of this information?
  • Does the public trust the protections around interagency data sharing?  How does the Patriot Act influence that trust?
  • How can more record linkage transparency be accomplished?
  • How can statistical needs be met while assuring confidentiality?
  • Administrative records linked to survey data raises unique confidentiality concerns (holders of the administrative records can easily identify survey respondents thereby breaching confidentiality).

Access Options

  • Network of Research Data Centers:  Census Bureau provides access to qualified researchers working on Census Bureau programs.
  • Census Bureau is researching techniques used for the Luxembourg Income Study and by NCHS, which allows users to develop programs to run on the matched data.
  • Census Bureau is dialoguing with researchers, program evaluators and implementers about their needs and how best to meet those needs within existing constraints.


Question:  What is the feasibility of getting food stamps and TANIF data from various state and county agencies before 2020?

Response:  The biggest hindrances are legal requirements for acquiring data sets.

Greater cooperation might be forthcoming if data systems were linked to certain line items that provide appropriate incentives.  The development of model and uniform state and local laws as well as agreement on best practices were also recommended.

Question:  Could CMS obtain an estimate of race and ethnicity from the Census Bureau that could be linked while maintaining privacy and confidentiality?  What issues would arise if Medicare data in a Data Center were linked?

Response:     The issues have to do with perception about whether confidentiality has been violated.  Data Centers have confidentiality filters to prevent individual data from being disclosed.

On Race & Ethnicity

A discussion ensued about data for race and ethnicity categories.  CMS is making an effort to improve race and ethnicity data.  With some success in obtaining more accurate data from the Indian Health Service, they are working to expand race codes beyond white, black, and other.  A significant challenge has to do with the fact that enumeration of social security numbers is done by states and many states refuse to give the Social Security Administration (SSA) information about race and ethnicity of children.  The Census Bureau cannot disclose personal information to CMS (Title 13).  In addition, “disabled” fit into the category of Medicare, which can include children.  As a result, there are many unknowns within the data.  It was pointed out that Social Security numbers (issued at birth) do not include race and ethnicity.  Mr. Prevost suggested that it might be possible to build an imputation model that gives CMS the information it needs to take a best guess at race, based on Medicare information.

Question: To what extent are Research Data Centers utilized?  What ideas do customers convey?  What are models for the future?  How do other agencies approach these challenges?

Response:  To access administrative records data, there is a lengthy approval process although this is not due to the disclosure review.  There was disagreement about how heavily the 12 or 13 Research Data Centers are used, noting that researchers must work on-site.  NSF believes that these centers should be supported as a “niche” approach rather than a bill-based response.  Ms. Turek believes that major changes in law are needed relative to the use of matched data.

Question:  Does Social Security still use SIPP-based models?

Response: Yes.  The SSA is a Data Center that permits use of restricted data but is highly protective of security and privacy.  With restricted data, a MINT program is used, which models income in the near term; and an SSI financial eligibility model called FEM is also used (see transcript for further model information and for information about SSA’s joint research projects).

Dr. Iams noted that the Census Bureau could have made more use of matched data in the past to improve data quality of missing data and imputed data.  He also noted that SSI gets misreported in CPS and in SIPP (an undercount develops in SSI when people getting SSI say that it is Social Security).  The new Economic Well-being Survey system will use administrative records more to improve survey information.

Questions about Data Centers:  What are the restrictions or requirements for an agency to be a Data Center relative to Social Security? Are Data Centers effective?

Response:  See transcript for description.  Reasons why such Data Centers don’t work from a practical point of view were described, to include cost, access restraints, and an elaborate project approval process.  Serious consideration must be given to what types of projects can be done at Data Centers and on what terms and conditions.  Some felt that Data Centers are useful despite the fact that the number of researchers who access the data is very small.  It was suggested that more articles be published that show if Data Centers are working or to illustrate a need for more centers or easier or less costly access.

Further Challenges and Recommendations:  See transcript for description of the Chapin Hall Project model, which is not a “fix” but which provides a clean set of data in the RDC.  Statistical access alternatives include: delivering PUMS files where possible (PUMS data sets match back to individuals and re-identify them); assessing opportunities to enhance or streamline RDCs; researching assisted electronic access (Luxembourg Method); researching synthetic data; and seeking dialogue with research and policy analysts, program implementers and evaluators. To enhance modeling by researchers for policy reasons, differences between survey and administrative data must be noted.  Mr. Gibson described further challenges and suggested the development of core research files.  He hopes for a way to work with agencies that can handle metadata (from policy to systems to manuals to actual claims and enrollment database) while trimming unnecessary information.  A relatively narrow relational database can provide answers fairly quickly.

Question:  What are ways to cut through the “Catch-22s” in order to use integrated data sets? How can policy makers be advised to deal with these complicated issues?

Response: One issue is whether linkages to support the research can be made; and another is whether the data needed by researchers can be provided.  More public discussion is needed to educate people to the fact that, under controlled conditions, these linkages are the “right thing to do.”  It is also important to ask whether current legal requirements must be modified.

Standardized agreements and streamlined processes between government agencies were suggested.  What is the quantitative measure of risk and does it need to be rethought?  A discussion ensued about the lack of preparedness for emergencies.

Using Linked Micro-data

Julia Lane, Sr. V.P., NORC, University of Chicago

Benefits and Challenges of Linked Data       Benefits of linked and administrative survey data include improved analysis of existing data; the ability to do reanalysis; an increase in the feasible set of research; and the capacity to capture new information sources.  The challenge is to not just collect data sets but to ensure their use for intended purposes (see transcript for information about public use data).  While synthetic data has great potential to help researchers, it gets rid of outliers (which define what is happening in the economy and with healthcare expenditures) and thereby affects the utility and quality of the data.  These issues are exacerbated with linked data.  The more data, the easier it is to identify someone, so confidentiality remains a problem.  See transcript for discussion of impact of top coding on quality of data analysis.  A big concern centers around impact on data quality (of research and policy analysis) relative to disparate use of data.  See transcript and PowerPoint presentation for more information on approach and drawbacks of Census Research Data Centers.

A Portfolio Approach:  Establishing Confidentiality Protections     when establishing computer protections, a portfolio approach with multiple access modalities is recommended, to be accompanied by an institutionally-binding agreement that delineates a legal framework (see transcript for details).  Such an integrated approach would provide legal, statistical, operational and educational options but would work differently within different agencies.  This model would utilize public use and synthetic data while allowing for remote and on-site access.  Dr. Lane suggested minimizing the amount of statistical protection (e.g., taking off obvious identifiers) but not changing data quality unless the impact on the quality of analysis could be documented.  Statistical protection should be adjusted as requirements for screening people using the data are operationalized and researchers are trained to protect data.  The law is “reasonably” clear about acceptable purposes for use of data.  Researchers must have an authorized purpose that adheres to each agency’s mandate.

Remote Access to Data Centers and Training    Dr. Lane suggested the use of encrypted connections and smart cards that allow for the restriction of user access from specified predefined IP addresses for remote access to Data Centers.  Citrix technology is being used increasingly.  She recommended a two-day training class with subsequent refresher classes that would include basic information and principles of confidentiality.

Health and Human Service Agencies

NCHS Jennifer Madans, Ph.D., Ass. Director for Science, NCHS; Christine Cox, Special Asst. Record Linkage, NCHS

NCHS Linkage Data Program is used to augment information from major surveys (i.e., by longitudinalizing cross-sectional surveys).   Record linkages are used to link population-based surveys with morbidity and mortality outcome (see transcript for more information).  In addition to record linkage, NCHS develops user tools, documentation, methodologic reports; conducts bias analyses; and works to improve record matching algorithms.

General concerns include: confidentiality and protection of data, especially with linked files, which are especially vulnerable.  General files are becoming harder to release through public access so there is a reduction of what is put out as public use.  Reidentification by those who provide data is an added problem.

NCHS Data Center’s On-Site and Remote System      The primary purpose of NCHS is to disseminate collected information as effectively as possible, so making data as accessible as possible is a principle drive.  NCHS has a first generation automated remote system called ANDRE as well as on-site access (see transcript for specifics).  The “real challenge” is how to conduct appropriate disclosure in a remote environment with fewer controls.  How should NCHS deal with multiple program submissions which are fine as individual submissions but not fine as a group?  The trade-offs of a remote system have to do with easier access but less information.  In addition, a remote system is expensive to develop.  Also mentioned was a need for more data dissemination funding for such things as basic science methodologic development on the disclosure review process and the capacity to more fully staff Data Centers.  At on-site Data Centers, NCHS is considering the assignment a staff person to each researcher using the Center to ensure satisfaction.  A faster process for getting administrative data would be beneficial.

Challenges     NCHS is exercising due diligence to protect data.  Challenges include:  human subjects/privacy issues such as obtaining informed consent; institutional requirements; balancing resources; improving data access; and standardization across federal agencies (see PowerPoint presentation).  A more systematized way of getting informed consent from data providers must be developed that meets their needs while satisfying institutional requirements for permission to link. More collaboration between providers and receivers of administrative data is called for.  Adults are reluctant to provide Social Security numbers, which creates identification problems.  Institutional requirements for receiving and providing data are very complex (see transcript for details).  More resources are needed for access to products and technical assistance; disclosure; and “smarter front-ends” to surveys and vital statistics.

What Works     Standardization of data linkage increases efficiency such that the process is faster and less expensive; and may help with user documentation.  The importance of best practices for linking, data handling, and getting extracts and documentation, was noted.  Increased collaboration and communication among agencies must happen (maybe under OMB) in a way that does not violate ways of interpreting authorizing legislation.  More linkage projects are needed.  New disclosure methodologies using remote access systems provide more control over an ability to evaluate risk, which will improve user access.

Discussion     See transcript for discussion on redisclosure (bottom line is: due diligence is to “do no harm” to people who have done their civic duty).  It was noted that NCHS links to the National Death Index (NDI) without Social Security numbers but that this link does not provide as much information.  Most NCHS linkages are to survey data.  Discussion ensued about how owners of administrative files view their responsibilities to the people in the file and about the approval process.  Mr. Localio suggested that more research is needed on why people feel constrained about participating in interviews; and then policies should be designed based upon those reasons, relative to who is given access.  See transcript for the intricacies of NCHS’s attempts to link to the state-administered Food and Nutrition Service; and file creation relative to purported use.

Office of Research Dissemination and Information

David Gibson, CMS, Office of Research Dissemination and Information

 See transcript for original mission and its evolution.

Barriers to using CMS administrative and other data include:  lack of unique identifiers for beneficiaries within Medicare and Medicaid programs across types of data (e.g., health insurance claim number or HIC).  Identifiers change over time, which makes it difficult to follow people longitudinally.  The same challenge holds for providers.  Unique identifiers across programs are lacking.  Separation of billing exists and as such, associated diagnostic and therapeutic care go into separate bill types.  The use of “ruleout” or confirmed diagnosis presents challenges to a cross-sectional analysis when a final diagnosis has not yet been made.  Getting prevalence rates is difficult, which causes problems with incidence rates of particular diagnoses.  Different coding systems are complicating factors.

Other problems include: the lack of crucial clinical information that differentiates personal level critical pathways; lack of information on the cause of disability; and inadequate information from CMS’s Medicare status code (MSC).  There is also a lack of comprehensive person-level data on primary and secondary health insurance coverage as well as missing information about socio-economic status and cause of death.  CMS is currently unable to link Part D drug event data to Part A and Part D benefits data.  Sample size is another consideration: ideally, a larger sample than 5% would work better for many conditions.  There is an inability to disaggregate program payments from beneficiary payments as well as an inability to link specific services on claims with provider costs to data provider cost functions.  There is inconsistent use of the unique physician identification number (UPIN).

Section 723       Created by Medicare Modernization Act, this law has effected major changes to the Medicare program and established a research database for the chronically ill.  A methodology called the Enterprise Cross Reference system was developed that allows for the creation of patient-level profiles; links data in different systems such as Medicare, Medicaid, and assessment data; and pulls the information together into a relational database that produces relevant statistics (see transcript for more information).

Future phases include providing data to researchers; incorporating lessons from Phase I; and providing ongoing improvements to the database (i.e., expand data sources and sample; enhance data access tools; establish consultation and technical support group; and create Pivot Tables [statistical summaries]) {from PowerPoint Presentation}

Center for Financing Access and Cross Trends, AHRQ

Steven Cohen, Director, Center for Financing Access and Cross Trends, AHRQ

AHRQ’s Center for Financing Access and Cross Trends, which implements integrated survey designs, has developed a model that improves accuracy of survey data, both of which enhance analytical capacity and data quality (see transcript for specifics and examples; and see PowerPoint for Health Outcomes focus; integrated survey design features, and capacity to reduce bias from survey non-response).  Linkage to the health interview survey provides precursor information, which allows for non-response adjustments.  Linkage between the HIS and MEPs surveys allows for estimates of the long-term uninsured in MEPS as well as expenditure estimates.  See transcript for information about the medical provider survey; pharmacy verification survey; and the intersect of predictors of expenditures and non-response.  On administrative data, AHRQ works closely with CMS.  See transcript for information about AHRQ’s establishment survey.  The integrated design optimizes sample designs to minimize variance for fixed costs constraints; and serves as an imputation source for editing.  These integrated results help with a reduction of respondent burden, sample precision and improvements, and modeling research.  See transcript for information about AHRQ’s health care cost and utilization project.

Challenges include:  lack of uniformity of encryption methods; the need for more information relative to quality metrics, nursing staffing data, and Secretary Leavitt’s transparency initiative; development of better and more uniform links across states; and confidentiality.  See transcript for Joining Forces Initiative (which expands upon existing administrative data for consumer choice) and the Decide Research Network (which focuses on developing evidence to inform decisions and effectiveness).  Limitations include greater restrictions in data access for public use; competing demands on host sample frames; more frequent survey contacts, which reduce overall response rate, and a need for greater coordination across data sources and organizations.

Center for Medicare and Medicaid Services and National Cancer Institute

Martin L. Brown, Ph.D., Chief, Health Services and Economics Branch, NCI

Gerald Riley, Social Science Research Analyst, CMS

The SEER-Medicare linked database represents a joint effort of NCI and CMS (see transcript for background and specifics).  The SEER and Medicare data complement each other in providing information on a variety of cancer control activities (e.g., SEER provides clinical information at diagnosis while Medicaid provides a longitudinal set of data).  See transcript for linkage activities as well as specific applications of SEER Medicare data and conditions of data access [and examples of associated studies].  SEER is linked to CMS’ health outcome survey (HOS) and will be linked to the health care providers and systems (HCAPS) survey.  Medicare data is linked to the Medicare beneficiary survey, the national long-term care survey, the health and retirement study data, social security administrative records, and several NCHS surveys.  See transcript for description of the Validation Project.

Advantages of SEER-Medicare Data      A link to SEERS provides rich clinical information at the date of diagnosis.  A link to Medicare provides longitudinal data on medical services and procedures prior to and subsequent to the date of diagnosis.  Pre-diagnostic co-morbidity information can be extracted.  The five percent Medicare file provides matched controls.

Limitations      The biggest limitation is the time lag between when events occur and when data are available.  In addition, data are limited to people over 64 while cancer diagnoses are not, although a person diagnosed prior to age 65 can get into the Medicare program later on.  Problems exist with HMO enrollees and Part D.  Medicare codes cause limitations to examining cancer screening through SEERS.  SEERS registry areas are not totally representative.

Improvements are underway to increase timeliness and efficiency in collaboration with CMS and NIH.  NCI is also creating an HMO parallel to the SEER Medicare data system.

Office of the Assistant Secretary for Planning and Evaluation (OASPE) John Drabek, Office of the Asst. Secretary for Planning and Evaluation, DHHS

See transcript for further discussion of study about national health interview survey data linkage with CMS and Social Security data.  The level of complexity of Social Security and Medicare files presents challenges.  OASPE is supporting (and documenting) the development and use of analytical files for four HIS merged surveys (see transcript for example and other research activities).  OASPE believes that conducting analyses of linked data will help them understand the benefits and limitations of what has been done to date (e.g., of such things as time frames).

Challenges include figuring out how to: summarize and integrate case history with survey data; and how to integrate longitudinal interviews with case history data.

Discussion     Incentives, structures, and encouragement of agencies or programs to provide information were discussed, with examples given.  Questions about how end users can link HIS, SEER, and Medicare data and how barriers to linking agency data with non-governmental data can be overcome were raised.  It was clarified that MEPS is a household-based survey.  The importance of supporting research funds to build a user community was mentioned as were the difficulties of including the VA in cancer and other health data gathering (although the VA is a site for the Cancors Project).  Currently, there are no mechanisms for tracking Americans overseas.

Social Security Administration (SSA) 

Howard Iams, Ph.D., Research Advisor

SSA, with links to the Census Bureau, SIPP and CPS, is a strong advocate of using linked survey data.  See transcript for SSA data linkage activity and future directions (e.g., use of synthetic data; public release of administrative data for the SIPP; a broader health and retirement survey consent form as well as a more complete release of the survey’s administrative data).  Other possible future activity includes linkage to:1040 tax records; the national compensation survey; and potentially, Medicare bills.  SSA has many linked records (such as disability insurance information).  Also mentioned were records related to Title 16; the Numident; summary earnings; and Medicare Part Extra Help application.  Rules for use (e.g., Census and HRS) were delineated.  SSA obtains valid and reliable information from the longitudinal data that it gathers about what people are paid for SSI and Social Security.

Barriers:  Dr. Iams does not think that there are barriers internal to his organization although he mentioned that SSA’s records, which come from administering a program, were not created for research.  As such, the program must be understood in order to use the records.  Another barrier is public reluctance to provide Social Security numbers to the surveys.  There are challenges to getting other agencies to let SSA use their data in SSA’s matched data records.  Disclosure poses other difficulties despite the use of secure files.  Every effort is being made to get synthetic data to work.

Discussion     The question of whether lawyers understand data and its uses was discussed.  Synthetic data was defined, exemplified, and further discussed (see transcript).  Access to 1040 data was further discussed as were W2 data in earnings histories, which have a joint custodianship relationship between Social Security and the IRS.  The validity of Social Security numbers was raised.  Efforts to track lifetime earnings were discussed.

Tuesday, September 19, 2006

Statistics of Income Division, Internal Revenue Service (IRS)

Tom Petska, Director

The presentation conveys a personal viewpoint rather than those of the IRS or the Department of Treasury.  See transcript for background information.  The intent is to convey information about IRS tax data administrative statistics as a potential source for shedding light on health outcomes.  Their studies have shown that matched files can be rich analytically but that linking data files is never easy and data quality is never optimal.  Resolving these discrepancies is labor-intensive but the IRS does not want to produce a file based solely on matches.  High quality linking variables that are put into a relational database produce good results.  Gathering accurate data from multinational corporations is a challenge.

See transcript for information about provisional data, tax analysis, statistical use, and the authorization process for access to data.  IRS mandate is to provide only the federal tax information for authorized purposes to the minimum extent necessary, (in contract to the Census Bureau’s mandate) and as such, there are constraints to using IRS tax data.  See transcript for discussion of Survey on Consumer Finances and the Federal Reserve’s capacity to receive data.  The complexities of restrictions on access to data and disclosure rules were enumerated although the IRS tries to service requests for tabulations from existing files.  A question about a matching study to verify how well the CPS measures Medicaid as well as private insurance coverage was discussed relative to regulation amendments and criteria agreements (i.e., Title 13, Chapter 5) while noting that the code that isolates health purpose for deferred compensation was not available before 2005.

Veterans Administration (VA)

Richard Bjorklund, Director, VA

See transcript for background information.  In response to the recent potential breech of confidentiality, the VA has tightened all policies and procedures for distributing data internally and externally.  The VA’s strategic direction regarding linked data was delineated, to include linked surveys that identify perceptions, interests, preferences, behaviors, as well as customer satisfaction.

VHA-Medicare-Medicaid Integrated Data Base     Much of the presentation was devoted to a description of a large project that integrates the Veteran’s Health Administration (VHA), Medicare, and Medicaid data in a user-friendly system, thus providing opportunities to improve healthcare outcomes and identify best practices.  This project uses risk-adjusted outcomes models; identifies how veterans make decisions; and identifies fraudulent billing practices.  When timely data are available, physicians may be able to use this information on-line to treat patients.  At the corporate level, opportunities can be identified relative to cost, quality of access, and strategy.

Barriers include:  few people have knowledge about these data sets or ability to access the data; integration is difficult and time consuming; and investment needed for hardware and storage media.  In addition, potential users have different needs (one system cannot satisfy the needs of all users); the size of potential demand for this data is unknown; and cultural differences pose challenges.  Privacy and security laws and regulations add a large dimension to data management.  Many decision-makers do not have experience with using data outside of their organizations.  The economics of such a database raises questions about fixed and variable costs; does it make sense to outsource variable costs?  Status to date is that demand is growing slowly but will increase dramatically over time as people learn how to access the data.  See transcript for pilot test of user-friendly system.  Benchmarking is required with movement forward as is internal agreement about what to measure.  The next phase should be part of a planning and policy office.  See PowerPoint presentation for information on technical, cultural, economic, and organizational challenges as well as opportunities.

Discussion ensued about what linked data sets can be used for and what pushes their limits too far relative to reliability and accuracy (see transcript for brief discussion).


Christopher Chapman, Prog. Director, Early Childhood & Household Studies, Nat. Cntr. for Education Statistics

Speaking as a data user (rather than as a representative of the Department of Education or the National Center for Education Statistics), Mr. Chapman focused on his Center’s use of administrative data previously collected through NCHS (see transcript for background information and Center activities).  In some data gathering, certain useful information was not obtained due to lack of cross-linking experience.  In other situations, data sets allowing for cross-linking to administrative record systems were relatively limited. The Center’s primary experience with linking existing administrative record systems has been with survey data systems (especially ECLS-B, which does not use a public use data set).  Nevertheless, the National Center has worked hard to make the data more user-friendly to the public (see transcript for examples).  One way to use the administrative record system is to do small-end studies that crosslink survey data (e.g., self-reporting data) with administrative record data.  Another area of research to consider is linking administrative record data on health statistics with the Center’s school-based administrative records systems.

Moving the Work Forward       A key concern has to do with getting the correct identifiers into surveys.  Help is needed to interpret data from health databases.  A better understanding of Medicaid data sets is needed.  See PowerPoint presentation for information on the potential advantage of linking NCES data to administrative health data.

Discussion     There was discussion about the capacity to link data about use of certain drugs and medications to school districts.  Another question was posed about the ease of collaborating with HHS, NCHS, and other agencies on model design for longitudinal studies.  In addition to coordination problems, there are response-burden problems.  Even with good collaboration, there are limits to the types of data that can be collected.  A third question had to do with receipt of school lunch and breakfast and obesity data.

Maximizing Benefits from Linked Data:  Access for Research and Related Issues

Office of Management and Budget (OMB)  Brian Harris-Kojetin, Ph.D., OMB

Laws that impact data sharing were mentioned (e.g., Title 13).   The National Directory of New Hires allows agencies other than SSA to access data for specific purposes.  Broader laws that apply across government agencies include the Privacy Act and the Paperwork Reduction Act.

The thrust of the presentation was about the Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA) [see transcript and PowerPoint presentation for description].  In short, CIPSEA provides a ground-level foundation of uniform protection for information gathered for exclusively statistical purposes under a pledge of confidentiality.

Benefits of CIPSEA include: uniform protection across agencies; coverage of all data collected for statistical purposes under a pledge of confidentiality; a strong penalty for disclosure; and exemption from FOIA requests.  A treasury companion bill (which has not been passed) is important to data sharing provisions within CIPSEA.

Key distinctions in CIPSEA were delineated as were requirements that this act imposes upon agencies.  Only business data are covered (between Census, BLS, and BEA) for data sharing.  CIPSEA offers new protections to agencies without strong legislative protection but it does not restrict or diminish existing protections.  Policies and procedures for access and control are required for use of confidential statistical information by federal agencies.  The responsibility of researchers must be taken into account.

Discussion     The discussion centered on: a possible conflict between CIPSEA and the statute that NCHS uses; whether the CIPSEA statutes work in practice; an evaluation component for CIPSEA; and the lack of tax data in CIPSEA (see transcript for specifics).  Further discussion focused on whether OMB has examined the implication for users and whether new statutory language is needed for this kind of data sharing.  Geography undermines almost any public-user file.  Dr. Iams recommended that administrative data be put onto a national file with no geography (those wanting geography should go to a Research Data Center).

Center for Economic and Policy Research   Heather Boushey, Ph.D., Economist

See transcript for background information.  The Center is concerned about timeliness of and access to data (public use files) as well as security and privacy issues.  Timeliness and access to data have transformed the ability to engage in policy debates at national and state levels.  A current project about effective coverage of benefit programs in ten states exemplifies the usefulness of matched data, which, if available, would significantly impact policy implications.

Concerns include questions about accuracy of administrative data and what the gold standard is.  Dr. Boushey recommends the use of three kinds of data (administrative; survey; and qualitative).

Memorial Sloan Kettering Cancer Center   Deborah Schrag, M.D.

In this overview, Dr. Schrag spoke as a data user rather than as a representative of Memorial Sloan Kettering Cancer Center.  Serving as a representative of academic health services researchers, her work involves a study of relationships between need, demand, supply, delivery, and outcomes of healthcare with emphasis on disparities, access and barriers, technology dissemination, quality measurement, and efficiency of healthcare delivery, including cost issues.  See transcript for more information about the research focus, which examines the implementation gap (difference between clinical efficacy and effectiveness).  An underlying theme is that data are layered (starting with source populations [US citizens with IRS data] but also including providers).  Academic researchers try to link federal data at the bottom of the pyramid with detailed data about providers and facilities at the pyramid’s top (see transcript for specifics and examples).  It is helpful to consider the domains data sets belong to as well as where they lie in the spectrum of pure population-based data.  Nomenclature is needed to define data boundaries.  Dr. Schrag stated that 90% of her time is spent trying to get data, manage permissions, etc. while analyzing data is much less time consuming (see transcript for examples).

Recommendations include: development of a standard taxonomy that delineates the analysis of population, quasi-population, and non-population data so that end users understand underlying healthcare delivery patterns.  Such taxonomy would help frame development of coherent research policies and rules.  Availability of protected federal agency data sets (i.e., area-level rather than individual patient-level data) is important to the research community.

A wish list includes: help from federal agencies with the research community’s ability to more effectively tap into Medicaid enrollment and claims data.  The ability to access consistent definitions in Medicaid enrollments files is desirable, as is linkage of UPINs on claims data to files that describe position characteristics.  Also on the list are Part D data from pharmacy claims; access to chloropleth maps by various geographic units [useful for common data elements, Census, and survey data results].  See transcript and PowerPoint presentation for definitions and examples of wish list items.

Challenges to collecting data sets include:  the limits of denominator file structure, which preclude an easy identification of cohorts of the chronically poor due to retroactive enrollment, chronic versus episodic poverty, spend-downs (when illness precipitates enrollment), variation in state thresholds in generosity, and definition of an HMO or managed care.  A coordinated federal approach would help with these challenges.

Priorities include: coordination of procedures for obtaining access to data and review processes with various stages; standardization and harmonization of reporting rules; the development of categorization schema for types of linkages; a central clearinghouse with a broad mandate (see transcript for example); facilitation of federation of state data; chloropleth maps for use in commons-based systems; and work with states to facilitate analyses of Medicaid enrollment and claims files. The importance of establishing barriers for researchers that clarify why they need and what they do with certain data, was stressed.

Discussion UPINs were identified as a way to get cost information when looking at provider data.  Separate survey data are needed for physician income.  The biggest factor to fix is the fact that detailed provider information is not held within government agencies.  There is a need to improve the comprehensiveness and quality of current data.  The states have data that are not accessible or necessarily uniform.

A Broader Perspective on the Role of Linkages

History of Record Linkage   Fritz Scheuren, Ph.D., VP for Statistics, NORC and 2005 President of ASA

See transcript for more information. To optimize systems, survey errors can be prevented by using a detection system.  One of the best ways to fix errors is to replace them with data from a better source.  As such, linkage is an important quality improvement step.  The trade-off between response variance and response bias was noted.  Because a conflict of principles between administrative and statistical agencies exists about privacy and confidentiality, Dr. Scheuren suggested that statistical agencies look at the point of intervention where they receive data and determine whether they can receive data at a different point to aid administrative agencies.  He believes that every agency that does linkage should have an IRB (as does NCHS).  Linkages should be thought of as learning systems.  Diagnostics should be developed for linkage.  He suggested swapping staff to expedite the learning process.

Survey verses Administrative Data  Michael Davern, Asst. Professor, University of Minnesota

Survey data are collected primarily for research and administrative data are collected to administer programs but the goal is to combine these two avenues to produce good health outcomes research.   Linking survey and administrative data holds great potential for health research.  In contrast to administrative data, survey data (which are in the public domain) have identified strengths and limitations.  Documentation (including metadata and paradata) is needed as is research on linked files data elements and this research should be made available in the public domain (e.g., how data are collected; how they get into the data file; where they come from; how the information is produced).

Survey data limitations include concerns about errors in sample frame coverage, sampling error and variance estimation; non-response error (item and unit), measurement error; data processing, imputation/editing; and need for better documentation of metadata.  How are administrative data like or unlike survey data?  Sample frame and frame coverage are not problematic but care must be taken with such things as contact information (see transcript for example).  Sampling error does not pose great problems but missing data or non-response error (i.e. more systematically missing such as in race or ethnicity) are more significant.  When linking keys are systematically missing, they can be a large source of sample loss and bias for merged data.  Administrative data can be collected through many modes and the source of information matters.  Administrative forms are not as user-friendly as surveys and people have different motivations for filling out administrative data forms.  Research is needed into mode effects and longitudinal panel conditioning.  Administrative data may not be consistent and the quality will vary greatly from centralized data collections like social security to state-based programs like Medicaid.  It is essential to understand the universe issues and measurement error on the administrative and survey data sides – and how they differ (see transcript for more information).  Administrative data can be misclassified or have systematically missing linking variables.  Timeliness of producing linked data files presents an ongoing challenge.  Because the ability to conduct research on the quality of administrative data is limited, it is important for agencies entrusted with those data to look at data quality for research (in addition to administrative) purposes (see transcript for examples).  Research into sample loss in linked data files is key.

Potential for Linked Data includes: improving accuracy of survey data collection of enrollment data (Medicaid, SSI), survey sample frames (Census, MAF), administrative data on race and ethnicity, and health policy simulation; creating small area estimates; and benefits to using information in imputation models and editing.

Discussion  Improved research also benefits administrative purposes (see transcript for examples).  The importance of metadata and paradata was noted.  It was recommended that people not consider linked data products to be good until an evaluation has been completed and published.  Administrative data used to calculate benefits are reliable but at risk if used for other purposes.  A discussion about the reliability of payment data ensued.

Recommendations for incentives to change current structures include: 1) a small fund for data analysis at NCHS from NSF or other outside groups; 2) an increase of earmarked staff to do analysis; and 3) widespread use of documentation software.  One way to provide incentives is to give back useful data.  Collaborative work with states can produce higher quality administrative data.  Efficiency can be improved by eliminating weak data vehicles and strengthening effective ones (seeFood Stamp Quality Control data example).  An agency’s return on investment (ROI) should be considered as well as its statistical environment.  Agencies trying to incorporate more methodology into their programs benefit from learning about what has been useful to users and other professionals.

A discussion about death certificates followed.  In most states, death certificates have proximate, contributing, and underlying cause.  Cause-specific attribution is one of the most misused statistics but little guidance about metadata files and training exists (training does exist at the state level but not at medical schools or at the medical provider level).  Interagency funding for data sharing is hard to come by.  A question was raised about whether agencies do enough self-criticism (see transcript for examples).  Could there be an interagency working group to look at data availability and linkages at the working-person level?  It was suggested that such groups (like the Populations Subcommittee) examine record linkage problems where the incentives are (see transcript for more information).  A question was again raised about more integrated ways to work with states and multiple agencies.  Competing agendas between state and federal incentives pose a challenge and must be bridged.

Subcommittee Wrap-Up

Moving Forward  The tradeoff between data quality and privacy versus the social benefit that comes with moving forward was discussed and a question was raised about whether the decision makers are weighing both.  Should someone serve as arbitrator to determine the appropriate tradeoff?  Permitting a greater exchange of linked survey data across agencies with less pain, suffering, or prohibition will hopefully produce better data for policy analysis and evaluation.  For statistical purposes, data linkage should occur in a safe, secure environment that does not violate confidentiality and that permits exchange of government data or government-linked survey data to other agencies.  This kind of exchange can’t happen without legislative changes (e.g., Title 13 constraint of Census data linkage is very limiting).  Legislation is needed that allows data that is tied to one agency to be shared for statistical purposes with others (with protections).   The timely exchange of linked data leads to wiser policy analysis and decisions.  ASPI, a policy office, would also like access to the data.  Mr. Prevost will suggest the use of standardized agreements and procedures for data sharing to OMB.

Portfolio Approach, Data Criteria and Timely Access  Public use versus confidentiality was again raised.  Challenges to data linkage accompany expanded access – or a perception of access – which is why a portfolio idea (with its standardized agreements and various “hoops to jump through”) is under consideration.   In the meantime, work must continue to develop primary criteria of data access as well as timely data access in and outside of the federal government.

Data Centers  It was again suggested that Data Centers be adequately staffed to ensure that work can be accomplished within mandated contract timeframes.  Because two months is too long for data turnaround, a negotiated general agreement allowing for certain classes of studies (rather than particular projects) was recommended.

CMS’s Resdac Project as a Model   The Resdac Project helps new users of Medicare data as well as SEER Medicare data users.  The Project might increase demand for data linked to those data sets (see Resdac Website at the University of Minnesota).

The workshop was adjourned at 3:35 p.m. on September 19, 2006.

See official transcript for full Subcommittee discussion.


Donald M. Steinwachs, Ph.D.
Chair, Subcommittee on Populations
National Committee on Vital and Health Statistics