[This Transcript is Unedited]
DEPARTMENT OF HEALTH AND HUMAN SERVICES
NATIONAL COMMITTEE ON VITAL AND HEALTH STATISTICS
SUBCOMMITTEE ON POPULATIONS
WORKSHOP ON DATA LINKAGES TO IMPROVE HEALTH OUTCOMES
September 18, 2006
999 9th Street, NW
CASET Associates, Ltd.
10201 Lee Highway
Fairfax, Virginia 22030
List of Participants:
- Donald M. Steinwachs, Ph.D, Chair
- A. Russell Localio, Esq.
- William J. Scanlon, Ph.D
- C. Eugene Steuerle, Ph.D
- Nancy Breen, Ph.D
- James Scanlon, Ph.D
- Joan Turek
- Ed Sondik
TABLE OF CONTENTS
- Call to Order, Welcome and Introductions
- The Importance of Data Linkages
- Census Bureau
- Using Linked Micro Data – Julia Lane
- Health and Human Services Agencies
- Jennifer Madans, Associate Director for Science National Center for Health Statistics
- Christine Cox
- David Gibson
- Steven Cohen
- Gerald Riley, Social Science Research Analyst, Center for Medicare and Medicaid Services and Martin L. Brown, Chief, Health Services and Economics Branch, National Cancer Institute
- John Drabek, Economist, Office of the Assistant Secretary for Planning and Evaluation Social Security Administration
- Howard Iams, Research Advisory
P R O C E E D I N G S (9:10 a.m.)
DR. STEINWACHS: I am Don Steinwachs. I have the pleasure of chairing the
Populations Subcommittee of the National Committee on Vital and Health
Statistics. We welcome you today to a two-day workshop on data linkages to
improve health outcomes.
I think everyone has a copy of the agenda that spans two days. After saying
a couple of words about NCVHS and what the Populations Subcommittee does, I
will be turning to other members of the committee to talk about the motivation
and the direction that this program has been shaped to try and answer some key
questions for us about best practices and successes in data linkages, and
identifying where there are barriers, and overcoming those barriers to improve
data linkages, to provide information necessary for improving the health of the
As many of you may know, the National Committee on Vital and Health
Statistics is an advisory body to the Secretary of Health and Human Services on
health information and different data policy. The Populations Subcommittee has
a particular focus within that, and that focus is on population health
measurement. In looking at population health, we tend to emphasize the
distribution of health characteristics in the population, trying to identify
disparities in subpopulations.
The program today is shaped in many ways trying to look at those sources of
data that provide potentially important information on risk factors for health,
socioeconomic status, education, income and other factors, as well as
potentially open doors to understanding the health of subpopulations, because
the data are more comprehensive and may exist in national surveys and
interviews that are done.
There is a picture that all of us have on the committee for what health
statistics might be. About five years ago, a report came out of the
subcommittee before any of us who are here today were members, but Marjorie was
there and some others, so can talk about it authoritatively.
It is a report on the vision for health statistics in the 21st century.
That vision talked of bringing together the variety of information that is
needed to look at health risks and health outcomes, to look at disparities in
health and the provision of health services, a very ambitious agenda, one I
don’t think that we can attain without exploiting fully the capacity for data
linkages for existing data, as well as trying to fill the data gaps.
Another piece of work this Populations Subcommittee has taken on is trying
to look at health in minority populations. We found very clearly that in many
minorities, we have insufficient data to be able to say something about the
health status and disparities in health, whether you are talking about Native
Hawaiians, Pacific Islanders, you are talking about the tribes in the American
Indians, and other groups. So when we take on the Secretary’s goal of reducing
disparities in health and health services, we find in many areas we just don’t
have the information. Again, there is hope that possibly through linking data
sets and identifying areas for strengthening data collection, we can improve
Before turning to the introductions specifically for the two-day program, I
would like to go around and have members of the committee introduce themselves,
so you know who is on the Populations Subcommittee and on the staff.
I am Don Steinwachs. I am at Johns Hopkins University, Bloomberg School of
Public Health and Chair of the subcommittee.
(Whereupon, introductions were performed.)
DR. STEINWACHS: I should also let you know that this two-day workshop is
being broadcast on the Internet, so I ask for you to speak into the microphone.
I know that some of the people who are attending around the table do not have a
microphone. In that case, if there is not a microphone easily accessible to the
individual to ask a question, I would ask that the person who is receiving the
question restate the question. Oh, we have a portable mike. I should have never
doubted that technology would be here.
Why don’t we take advantage and also ask the audience here to introduce
themselves for the benefit of people on the Internet.
(Whereupon, introductions were performed.)
DR. STEINWACHS: I would like to turn now to Gene Steuerle and Nancy Breen,
who along with Joan Turek shaped the agenda, and many of you got to talk to as
the agenda came into place for the workshop and have been trying to think
through these issues.
Gene, I don’t know whether you are going to start or Nancy, to make some
comments about the specific expectations for the workshop.
DR. STEUERLE: I would like to also extend my thanks to all of you for
coming. I realize that any activity like this is always time consuming. I can
see peoples’ brains now wandering back to what they have got sitting on their
desks that is not being done right now, so I really appreciate that.
I know we have got a couple of speakers that are still in the audience.
This is really meant to be a roundtable. The configuration of the room is what
it is, but please come up to the table. We are not just asking the speakers
around this time, but anybody who is a speaker in a future session, please come
up, because among the things we want to have is a dialogue. We want someone
from say Census to turn to somebody from say IRS and say, if we could only have
merged these data sets, we could have achieved the following. Or, we have a
problem just like that with respect to privacy, or something like that. So we
really do want it to be a dialogue.
I want to be clear that the National Committee on Vital and Health
Statistics, of which I am a member and Don is the Chair, our role is mainly
advisory. The main person we advise, although it doesn’t have to be the only
person, but the main person we advise is the Secretary of HHS. That indeed is
one of our final products as we write these letters to the Secretary and say
the following things should be done.
The purpose of this meeting is not just to talk about what data linkages
are possible or are being done, but we really want the data advisory group to
say, here are things that maybe aren’t being done, or here are things that with
a little bit of additional resources could be done, or here are some real
constraints blocking the following linkage.
I ask all of you to think about it in the following sense. Our ultimate
goal of course is to improve the health of the U.S. population. We are not
linking data sets or we are not raising the issue because we are researchers
and we like bigger data sets, which maybe we do, but hopefully we are doing it
because we have a very specific reason.
I ask you to think about that also in terms of the power of the advice. If
the advice says here are some data sets that are linked, it says one thing. If
we could link these data sets, it is likely we could understand better the
following. That would lead to an improvement in health for the U.S. population.
It carries a lot more weight.
I realize that a number of you, particularly from the agencies, are
resource constrained. That is one of the issues you will probably raise. You
will say, we could do the following were it not for additional resources. There
is probably a certain extent to which politically you are constrained from
saying my boss or someone else in the Administration or Congress was too dumb
to realize I needed more resources, but you can say here are the benefits if we
could do the following, and we can let other people worry about whether those
costs can be met.
We already know and anticipate that two of the biggest constraints that
people will raise once they come up with ideas are with respect to resources,
but also privacy and confidentiality. If you get to that issue, we would like
you to help us also tease out whether there are alternative ways around the
problem. Maybe there are ideal ways we want to share data sets, which I think
is crucial to the extent we can, but even if we can’t, to what extent are we
able to bring in outsiders into agencies, to what extent are we blocked.
I know in dealing with groups that I have advised, at times they are
blocked from bringing people inside the agencies for a lot of reasons that have
little to do with the law or anything else, but just — I don’t want to say
narrow, but almost Catch-22 types of rules. We would like to identify those. So
really, to ask you again to think about the ultimate goal; it is to improve the
health of the population and to help us sort out what could be done and not
In all the sessions we have tried to leave a fair amount of time for
discussion and dialogue. So again, I would like to ask all of you who are not
the formal speakers to jump in. People back here who are too modest to come up
to the table, we really want your advice and input.
I think what we find as we go through is that all of us probably know —
because we are talking about information and data, something in which we can
have an infinite demand, if we could have an infinite supply barring all costs.
So a lot of us know an example here and an example there. The hope of this
hearing is, we can go beyond an example we might gather one by one and put them
together to be able to weave from the examples a broader story. So that is
again the reason we encourage and hope that you will participate. I myself, and
I know I speak for other members of the committee, are deeply appreciative of
the time you have given to us.
DR. BREEN: I would also like to welcome everybody, and thank you for taking
the time to come here to this workshop on data linkages.
We feel it is really important. Gene and I didn’t really now each other
until we both came up with this idea that we should start to investigate what
was going on in the federal government more systematically on data linkages,
and try to determine what are the best practices, and also see if we could do
more along those lines. The examples that both of us knew for data linkages
were so fruitful and were bearing such great research, that it seemed like the
is was something that we should try to move forward more systematically. So
once again, I am delighted that people have taken the time to come.
In the course of developing this, I want to take a minute to thank Joan
Turek, who is sitting next to me. As we were talking about this at the
committee, Gene and I had some ideas, and we had a list, but we didn’t know
everybody in the federal government that was doing this. Joan, it turned out,
DR. STEUERLE: And does.
DR. BREEN: So she filled the gaps for us. I would really like to thank Joan
for what she has given to this. Thank you, Joan.
MS. TUREK: You see what happens when you work for Uncle Sam for 37 years.
DR. BREEN: And so productively, thank you. One of the things that she did
was to make us recognize that the Census Bureau is doing a whole lot in this
area, which I had no idea. I worked at the Census Bureau years ago, and they
were collecting survey data and they were collecting the census. They were
collecting a lot of different data, but as far as I knew, they weren’t doing
much in administrative records.
At the National Cancer Institute where I work, we started working with them
a few years ago on the national mortality longitudinal survey, so I knew that
they were starting to work with administrative records, but Joan made us
realize that there is a whole group of people who were systematically working
through how they could use survey and administrative data to improve not only
the quantity but even more important perhaps, the quality of the data that we
In conversations — can I take this as a transition to start to introduce
the first speakers?
DR. STEINWACHS: Please do.
DR. BREEN: Okay. I guess I will start by introducing the speakers and then
say a little bit about our conversations and what I think they are going to be
talking about, just to bring out some of the high points, and then they will
provide more detail on what they are actually doing.
I would like to introduce Sally Obenski. I think you have been reorganized
recently. The title that she had about a month ago, and also that Ron Prevost
who is sitting next to her, had about a month ago has been changed. This is
such a timely and hot topic that Census has reorganized so that there is now a
division which is the Data Integration Division.
David Johnson, are you chief of that division.
MR. JOHNSON: No, no.
DR. BREEN: Would you like to be promoted to chief of that division today?
You are in that division?
MR. JOHNSON: No, I am the director in my division, the new division, the
DR. BREEN: So you are here. One thing I might want to mention because
people may be interested in this, it wasn’t the impetus for this, but an issue
that is related to access and a number of researchers have been concerned about
was the defunding of SIPP, the Survey of Income and Program Participation, a
really important panel survey that the Census Bureau was doing, which gave us
right up to the date information on employment, on who was in and out of the
labor force. We learned a lot about patterns and long term trends that we could
never figure out from any of the other data sets that we had.
David Johnson — I hope I have this part right, David — is the program
manager on the dynamics of economic well-being system, the DEWS, which is going
to be engineered in order to provide the information that SIPP was providing to
us. So he is here and available to answer questions related to that, so I want
to thank him for that.
Then the third speaker, the formal speaker from the Census here today is
Gerry Gates, who is the Chief Privacy Officer. He is here at the end of the
Of course, we want to know here today how we can improve data systems by
improving data linkages, and how we can improve interagency collaboration to do
that. As you will see, Census is working with all kinds of agencies within the
federal government in order to try to do that, and quite effectively.
Another question that Gene brought up is, how can we insure adequate
access. This is a big issue. As Sally said, we don’t have all the answers, but
we are very committed to working with stakeholders to figure out how we can
best do this. So it is really important that people around this table, at home,
on the Internet and subsequently work with Census in order to try to understand
how we can best get access, because that is a big question that is remaining.
This information is confidential. We don’t want to violate anybody’s
confidentiality, of course. They won’t give us additional data if we do. But we
do want the information to be available to people, because otherwise it is not
helpful in improving population health and reducing and eliminating health
Sally is going to give us an overview. Let me just give you Sally’s new
title. Both of them are now in the Data Integration Division, and they are both
assistant division chiefs. Sally is the assistant chief for administrative
records applications and Ron is the assistant division chief for data
management. Sally will provide an overview as I said, and then Ron will talk
about — he has done some interesting research, in which he has found by
looking at survey data and administrative data there is a big mismatch. So
coming from that point, how can we use this finding to improve the data sets
through the linkage mechanism. It is a real quality improvement question, which
when Gene and I started thinking about this wasn’t really where we were
thinking it would go. We were thinking along the lines of just getting more
So I want without further ado introduce the speakers and let them talk
about these issues. We will let all three of the speakers speak, Sally, Ron and
Gerry, and then we will have some time for questions after that. So I want to
turn the mike over to Sally. Thank you.
Agenda Item: Census Bureau
MS. OBENSKI: Thank you. As Nancy mentioned, what I would like to do is
provide an overview of the technical infrastructure that is enabling the
expanded use of administrative records and record linkage at the Census Bureau.
It also will provide the 30,000-foot view of projects that are using linked
data sets. We are also going to talk a little bit about operational and
technical constraints, whereas the policy constraints are going to be discussed
both by Ron and Jerry.
As many of you know and others may not, Title 13 which is our guiding
mandate, states that we are to use administrative records extensively as
possible. Also, our strategic plan calls for the use of administrative records
to reduce both reporting burden and minimize costs, and also to come up with
innovative data sources. We are also governed under a plethora of legal
guidance and protections that include Title 13 and Title 26, Privacy Act, and
so on and so forth.
Whenever any of us discuss using administrative records, we always make
sure that the audience is aware that we have commitments to our data providers,
to our data users, and to the public. In order to use other parties’ data we
take it very seriously. We have a stringent infrastructure known as the data
stewardship program that Gerry is going to talk about, in which the use of
administrative records nest. We insure that there is a consistent application
of policies, and we have numerous administrative controls including, before a
project is approved it undergoes quite a bit of scrutiny. We have checklists.
We have to make sure that we are protecting the privacy and always the
confidentiality of the respondents.
Another point that is important, I will make it later, is that although we
acquire data sets that have personally identifiable information on it, those
identifiable information are stripped from the data sets before they are used,
and they are replaced with a protected identification key which is an anonymous
key that allows us to do record linkage while maintaining privacy.
What this gives you is a snapshot of the program evolution, the
administrative records program evolution. The Census Bureau has used
administrative records since the 1940 census, where we used it to identify the
first differential under count. However, in terms of a formal program, it
really came at the aftermath of the 1990 census, in which we showed a severe
and serious differential under count. Out of that came 17 program designs, of
which one was, could we supplant direct enumeration using administrative data.
So a small group of people began researching. There was a privacy
conference that was held in ’93. There were numerous surveys to look at the
feasibility of such an undertaking. About the mid-90s we began developing a
prototype of what is known as a statistical administrative record system, or
Now I would like to give you some of what I consider to be the enabling
technology and methods that have allowed us to expand the use of administrative
records. Early on in the program, it became obvious that in order to do
anything seriously with administrative data, we would have to have a database
made up of national files.
So STARS is such a database. It was first prototyped in 1999 to be tested
in the Census 2000 evaluation experimentation program. STARS is comprised of
seven national fields, including IRS files, HUD, Indian Health Service,
Medicare and Selective Service. As you see in terms of the record count, it is
quite a large file. The way that STARS was designed was to conform to the short
form census, in that it has short from demographic characteristics, age, sex,
race, Hispanic origin. Its address part conforms to our master address file. So
this database was created as a prototype in 1999 and tested in 2000.
It was tested in what we call the administrative record experiment or AREX.
What came out of this experiment even surprised the developers of STARS. It was
not so much that it could ever supplant a census, but it validated the
conformance of STARS to census, in the fact that we had demonstrated that we
had captured 85 percent of the census addresses and 95 percent of the persons.
So STARS has been recreated every year ever since, and improvements continue to
be made on it.
Let me talk a little bit and digress a tiny bit about the Numident. The
Numident is an incredibly important file that we acquired in the late ’90s from
the Social Security Administration, and it is the transaction file that has
every SSN that has ever been requested. This is about 803 million records.
What we have done is to collapse this file into the best, into unique
records. We use it for two things. Initially, the file did provide our
demographic data for STARS. Secondly and incredibly importantly is that it
provides our verification and validation system.
Before I talk about the validation system I do want to swerve back to the
demographic data. As I said, initially we used in STARS demographic data from
the Numident. The problem with the race data in particular was that it only
captured white, black and other, and had no ethnicity. Furthermore, race data
on children is no longer being captured on social security records when
children are born.
So to build on this, we in fact built on work that was done by Barry Biden
in the late 1990s, in which he linked the current population survey to the
Numident to start building a race model. We went the next step. We linked the
Census 2000 to the Numident. Where we had a match we brought over race and
Hispanicity and where we didn’t have a match we modeled the differences. What
this has allowed us now is that we have a hybrid system that we have Census
2000 race and Hispanicity and we have the very, very excellent high quality
Numident age and sex.
So the result of this fixed a very substantial weight in STARS, which was
the race data, and it is currently being used by a number of Census Bureau
programs, including the intercensal estimate.
Person identification validation system that we call the PVS. The Social
Security Administration who we have worked very closely with over the years in
a number of venues requires that any file that is going to be linked to their
data must be validated. Originally we worked with SSA on their validation
system to do such activities. Over the last few years we have been working with
them in order to develop our own completely automated system, which has been
approved by them. That is what this PVS is.
The importance of this system can’t be understated, because it is in fact
our record linkage infrastructure, if you will. We use the Numident, the SSA
file as our reference file. When we have a file come in the door, regardless of
whether it is going to be linked to SSA data or not, we run it as part of a
huge quality control check against their reference file. We link addresses, we
match on name, address and date of birth. We search within the address, and
then if we don’t find the person in the address, we go ahead and search by
name. Then we append the record with a unique protected identification key. It
is this pick that is used by the Census Bureau record linkers in order to do
their work throughout the Census Bureau. There is no identifiable information
passed to anyone.
Finally, the other major enabler was the implementation of the American
Community Survey. As you all know, the ACS provides a large timely sample of
essentially decennial long form data. It is essential in order to start getting
data at smaller levels of geography.
What we have been working on with some of our researchers is the idea of
using the ACS to model the model. We have gotten the rules from deeper surveys
such as the SIPP or the CPS, but then in order to push it down to lower levels
of geography we used the ACS.
Now I am going to talk very, very quickly in general about the work that we
do in processing our files and anonymizing them, are used across the Census
Bureau by numerous important programs, including the intercensal estimates and
the small area income and poverty estimates. I would like to remind folks that
these systems are administrative records based and are responsible for the
allocation of billions of dollars of federal funds. Also, as some of you are
familiar with the national longitudinal mortality study, we support them with
our processing, and also the LEHD program.
Now I would like to talk a little bit about some research that we are
involved with that is looking at uses in the decennial census. We have three
major programs underway that are being evaluated in the 2006 census test. One
of them is to see if we can assist, not replace but assist, the hot deck
imputation method by using administrative records to assign age, race, sex,
Hispanic origin when we can match a record.
What this does is, it reduces the work load that falls to the hot deck,
which improves its standard error. This looked very promising, and we are
checking to see if it is operationally feasible in a production environment as
The second use was to use administrative records to identify households
with coverage problems. What this is, is a systemic problem where a number of
misses, omissions, come at the within-household. In a given household we tend
to miss people. This is for a whole lot of reasons.
So to ameliorate this, we send out a major field operation, which is the
coverage followup operation, and it is very expensive. We have a research
project underway in which our STAR system was remodeled. We used some modeling
to develop probabilities that certain types of housing units in certain types
of areas would be under covered. This is also currently being evaluated.
Finally, we have been involved for several years now looking to see if we
can enhance the group quarters frame, which was a very challenging endeavor in
Census 2000. We are looking at using Info USA, which is the yellow pages, and
also the ES 202, which is the business register for states. We also have
evaluated seven states and their co-op list, and that program proved so
successful that it is being expanded nationally.
What are some other kinds of survey improvements? Ron is going to talk
about one very important case study, but just a few others that are worth
mentioning. Bob Faye has been working in some very exciting work in seeing if
he can use our STARS database to develop survey controls for reducing ICS small
area variance. This is looking highly promising.
A second body of research we have had underway was to take a look at a
STARS to CPS match and to look at non-respondents and to see if there is any
difference between them and the responders. The linkages occurred, but the
analysis is still underway.
Another thing that was very exciting was our response to the aftermath of
Katrina. As you all know, the effect on the federal statistical system
highlighted an inability to react in a real-time basis. As luck would have it,
we were in the process of acquiring the national change of address file from
the U.S. Postal Service.
So we got together with the data linkage experts. Everybody said this is
our single best chance to get something out there. So they graciously agreed to
give us an abstract, and we used that file to come up with some alternative
survey controls and later on we produced county level tallies using NCOA. We
also were very interested in getting FEMA’s files, but we didn’t have them in
time for the immediately response.
So as an outcome of this, we are looking into the feasibility of developing
the next generation of STARS, which maybe could produce some real-time
Here I have given you a very brief snapshot of record linkage and some
exciting projects. What are the constraints? In order to use third party data,
this requires an extraordinarily complex memorandum of understanding in order
to insure that all parties are protected.
Just to give you an example, after Katrina we had immediate discussions
with FEMA. OMB wanted us to have the FEMA files, we wanted them. It should have
been slam dunk; it took nine months. So the other problem is that when you deal
with some of the federal poverty data, they tend to be state based. Dealing
state by state as I think Julia could attest is quite an undertaking.
Also, there are differences in content definition, quality and program
rules. We have come up against this in some of our projects when we are trying
to build eligibility models, and they sometimes differ at the county level, let
alone the state.
Also, lag time. This is a big, big problem. Most of our files lag by about
a year. For example, one of our more important health ones lags by about four
years before we get the national file. As we all are seeing, more applications
require more near-time, real-time response.
Technical constraints, getting the right data in the right format. Even if
we write these very detailed specifications, it many times takes our analyst
many conversations and the data going back and forth to get it right, because
it is coming from two different views, the administrative view versus the
survey integration view.
Also, we do have varying rates of validation, for example, Medicare very
high, Medicaid lower. Just to note, if the record does not validate, we do not
use that record. If we can’t put a pick on it, we do not use that record in our
Also, something we came up with in terms of looking at SIP administrative
data is the coarseness of administrative data compared to the nuances of
teasing out what we want from the survey. Finally, measuring error, what does
it mean, how do we put a confidence interval around an integrated data set. It
is very challenging.
What are we doing to overcome the constraints? Revolving file acquisition
issues especially among state data may require OMB or Congressional assistance.
The lag time for general demographics is we believe largely addressed by the
national change of address file and possibly moving to this enhanced STARS.
We have under the new Data Integration Division completely standardized and
centralized file acquisition. The next bullet is speaking to continual
improvements in our person validation system, including converting it to SAS,
which has made it much more accessible to our analysts. We have identified a
data quality standards team that is going to be looking at measuring error in
integrated data sets.
What are our conclusions? New files and innovations clearly leading to this
expansion of administrative records uses, but new challenges continue to arise.
The idea of having to regularly update a file like STARS that has got hundreds
of millions of records, and then it is going to be updated on a quarterly
basis, is very, very difficult. Also, just understanding what integrated data
sets are. But I believe, everyone here would believe from the Census Bureau,
that we are at the incipience of a new generation of products and services.
That concludes my talk.
MR. PREVOST: I guess I will go into my part of the presentation now, which
is talking about health related administrative records research at the Census
What I am going to do is, I will start with a brief commercial announcement
about the small area health insurance estimates program. We will also talk
about the Medicaid undercount study description, its preliminary results, what
our next steps are. We will discuss a little bit about related research we are
doing, the benefits of integrated data sets, and the policy challenges and
The U.S. Census Bureau has embarked upon a small area health insurance
estimates program that produces a consistent set of estimates of health
insurance for all counties in the United States. The intent of the program is
to have published estimates for these counties and states by age, under 18 and
total, with confidence intervals. Right now, we are investigating model
improvements and expanding the age categories for which we can estimate.
Here is just a couple of examples. This is just a brief display. Policy is
implemented locally, and there are significant differences in local area
ability to have insurance coverage. This slide here is a quick snapshot, I
don’t expect you to read it or understand it all, but it is just a blurb to see
what the differences are in the U.S. for the total population without health
insurance. We have a similar situation for children without health insurance.
These health insurance coverage estimates are created by combining survey
data with population estimates and administrative records. Currently we are
working on race, ethnicity, age, sex and income categories that are being
investigated for counties and states. The state level estimates such as the
uninsured black or African-American population under age 18 and less than or
equal to 200 percent of poverty. We are also looking at county level estimates
such as the uninsured children under the age of 18, again under that poverty
constraint. This project is partially funded by the Centers for Disease Control
and Prevention’s national breast and cervical cancer early detection program.
What is forthcoming is health insurance coverage estimates in 2007. We are
going to be providing updated county and state level estimates by age, state
level estimates by race, ethnicity, age, sex and income categories. Then
depending on future funding, the SAHIE program plans to produce county and
state level model based estimates as an annual series.
This is just one way in which we have used administrative records at the
Census Bureau. We use them as Sally showed earlier for inter-censal estimates,
for the smaller poverty estimates, et cetera, and so this is an example of
What we are coming up with now for the next part of the presentation will
be the Medicaid under count project. I see that there are a couple of the
collaborators in the audience, Mike Davern and Dave Baugh. They are the experts
here, I am merely the rapporteur. We have had a great collaboration I believe
between our agencies, between the Centers for Medicare and Medicaid Services,
the states who have helped provide their data. We have the Assistant Secretary
for Program Evaluation, and those of us at the Census Bureau. Can’t forget our
sponsors. Our sponsors are the Robert Wood Johnson Foundation and also ASPE.
I think this is a really great project because of the type of collaboration
that we have. When you are integrating data sets, you need to bring in
expertise from both sides, who is expert in the survey data, who is expert in
the administrative data and the use of those data.
What is the Medicaid under count? I’m sure you all are familiar with this.
Survey estimates of Medicaid enrollment are well below the administrative data
enrollment figures. Why is that? In preliminary numbers for calendar year 2000,
the current population survey estimated — and this is using the more
conservative estimates, not the published numbers — 25 million persons that
were in the system. The Medicaid statistical information system or what Sally
referred to earlier as MSIS estimated that there were 38.8 million persons
enrolled. I’ll show you how we get to these numbers a little bit later on.
There is a substantial under count therefore in the CPS relative to the MSIS.
In this case it is 64 percent.
Why do we care? We care because we want to better serve our customers. We
want to improve our surveys, and we want to enhance performance indicators and
provide feedback loops. These numbers are used for policy simulations by
federal and state governments. They are the only source for the number of
uninsured. They are also the only source for the Medicaid eligible but
uninsured population, et cetera. So this under count calls the validity of
survey estimates into question, and this study is intended to understand the
What could explain the undercount? There are universe differences between
the administrative records that are collected at the states and the survey data
collected by the Census Bureau. There is measurement error. There are
administrative and survey data processing, editing and imputation errors, and
there are survey sample coverage areas and survey non-response biases.
So we came up with a bunch of hypotheses. These hypotheses were drilling
down into these sources of error. Why would there be persons included in the
MSIS but not in the CPS? There are persons living in group quarters, and group
quarters are defined differently by the two agencies CMS and Census Bureau.
There are persons who don’t have a usual residence, but still receive services.
There are persons who receive Medicaid in two or more states. This occurs
obviously because if you move within a given month and you apply in your new
state, you are going to show up in both states, and that is what we have to
We have to look at what is meaningful health insurance coverage. There are
persons with restricted Medicaid benefits. There are persons that have only
Medicaid coverage for one or a few months. So what does it mean to be insured?
On the CPS side we have respondent knowledge. Because of plan names,
enrollees and Medicaid and prepaid plans may not know they have Medicaid
coverage. We also are hypothesizing that Medicaid enrollees who didn’t use
Medicaid services may not consider themselves covered by Medicaid. We also have
the issue of proxy responses for other members of the household that may be
incorrect, especially for non-family members or when there are households that
have multiple families, and the respondent for the survey is answering for
We developed this study. The steps of the study were several phases. The
first phase was to develop a validated national CMS enrollment file and to
determine the coverage and validation differences between the MSIS and the
MEDB, which is the Medicare enrollment database, to determine the
characteristics of these databases and to look at dual eligibles which we
thought might have a factor in here, and also to conduct a national Medicaid to
CPS person match, and this way we would determine why the Medicaid and CPS
differ so widely on enrollment status, and then we would build a suite of
tables detailing explanatory factors and characteristics.
I mean to say that this study is a longitudinal data file that we are
building. We are collecting data from calendar year 2000 through 2002. These
are the first results from 2001.
In later phases, in order to enhance the study, we determined that there
were some variables that we would like to receive that weren’t in the national
files. So we are currently working with several states to get their local
information to see how we can improve the information we have on the national
files, and how that can be used to conduct a variety of researches on both our
master address file, CPS, the American Community Survey, otherwise known as the
SS-01 in this case for the 2001, and to also look at Medicaid addresses to see
if there was any specific frame bias.
Finally, we want to take a look at the impact that the state data has on
the national data to see if it provides any further explanatory factors that
could potentially apply in the national environment.
The later phase of the study, the fourth phase, and we have about a year
left in the study, give or take a year, we will be matching the national
Medicaid system to the national health interview survey as well to look at
person coverage, and then each way along the way we will be documenting the
results in papers.
Preliminary explanations. We have enforced the CPS group quarters
definitions on the MSIS data where we have administrative data address
information. That is, we were able to locate a person at a specific address and
in the Census master address file. If that address was defined as being a group
quarters we eliminated these people because in the CPS it is a household-based
survey that has only — it is the civilian non-institutionalized population, so
there are components of the population that are not included by definition in
the universe. Then also looking at duplicative persons in different states, and
understanding the covariates for the mid-reporting.
Here is a brief graphic that shows that there is a major overlap in the
universe, but there are folks on both sides that are not included, particularly
those who are under group quarters, those who were deceased during the time
period, and obviously we couldn’t survey them in the CPS, those that did not
have valid records on either side, and those that were in two states. So we had
to build a common universe out of that in order to conduct the study.
We removed the dual eligible cases defined as a group quarters by Census,
and then we ran the data through the Census Bureau’s personal I.D. validation
system that Sally had discussed earlier. We removed these duplicative valid
records and then we removed the MSIS enrollees that were not enrolled in full
So how does this break down? We started with 44.3 million MSIS records in
the year 2000. We had one and a half million of those records that were more
than one state or were in a group quarters, and we had four million that had
partial benefits. Partial benefits, this is a situation where you only had
received Medicaid for a day, or where you only had use of family planning
services, et cetera. I know there are a whole variety of these things, and Dave
Voss, he is the master here, he can tell us if you have more questions on this.
On the sample loss side, nine percent of all MSIS records did not have a
valid record, and were not eligible to be linked into the CPS. On the current
population survey side, 6.1 percent of the respondents’ records were not
validated, but more importantly, roughly 22 percent refused to have their data
linked. What this meant was, they refused to provide a social security number.
Therefore, the way that we interpreted it was, if you do not provide an SSN, we
will not link your data.
However, the effectiveness of the ID validation system that has been
developed by Census Bureau has allowed us to change the method of collection of
information from respondents, so that we are not asking for social security
number anymore. There really is a two-part question. One is, what is your
social security number and the other is, can I link your data. We really should
separate those two out. And because we can link data without a social security
number, and we know that is particularly sensitive, we have eliminated that
from our demographic surveys.
So in the future, hopefully we will be able to link more data that way. I’m
sure Gerry will be talking about the privacy implications and how Census Bureau
is addressing that.
Here is an example of the validation differences that you see across the
United States in the records coming from the Medicaid system. Those areas in
red and those areas in black, you can’t see the black too well, but there is a
number of them in California and also up in Montana, are areas of the country
where we had the worst validation rates. So if one was conducting a study in
the state of Montana or the state of California and attempting to apply that
study to the United States, you would get a very different view.
In California in some cases, I believe they are up to almost one third of
the records that were not validated. California has different rules for their
Medicaid systems. They serve a slightly different population than the rest of
the United States, so we are hoping that the state data that they provide us
will assist us in being able to do further linkages and improve it so that we
don’t have these validation issues.
In the state of Montana, the reason we had issues linking the data was that
in some cases, I think it was particularly children, they used state IDs, case
numbers, in the social security number field, at least in this year. I don’t
know how long that continued.
We matched the respondents together with the reported data only. We have
12,341 CPS person records that matched into the MSIS, 1906 had imputed or
edited CPS data, which was about 15 percent of the total.
If you are looking now at why the dysjuncture between what we saw in the
administrative records and what we saw in the survey, 60 percent of the
respondents in CPS responded that they had Medicaid, nine percent responded
that they had some other public type of coverage, but not Medicaid, even though
we had on the administrative record showed that they were in Medicaid.
Seventeen percent responded that they had some type of private coverage but not
Medicaid, and 15 percent responded that they were uninsured.
So basically what you can gather from this is that people really don’t know
the source of their health insurance coverage.
The factors that were associated with this error was the length of time
that they were enrolled in the system and how recently they were enrolled. For
example, and we have in other studies as well, when you asked the question,
have you been enrolled in X for the last year or over the last calendar year,
if they are currently enrolled in that month and you ask them the question, you
get a very good response. We had a similar study in food stamps, and if it was
within one or two months of the survey month that the persons had been enrolled
and participated in the program, they were showing only a ten to 20 percent
response error. If it had been six months since they had received benefits from
the program, we were showing response errors in the range of 60 to 80 percent.
Poverty status impacts Medicaid reporting, but it does not impact the
percent reporting that they are uninsured. This gets back to that stigma issue.
Stigma does not seem to be a factor here. As a matter of fact, it is the folks
that are at the higher levels of income who still qualify for the program who
are most likely to misreport. They think that they are getting private health
insurance and not Medicaid.
Adults 18 to 44 are less likely to report dis-enrollments, and adults 18 to
44 are more likely to report being uninsured. Overall, the CPS rate of those
with Medicaid reporting that are uninsured is higher than in other studies, and
the CPS rate of those with Medicaid reporting Medicaid is lower than with other
The work that is remaining. I already discussed briefly the other phases of
the project, where we will be bringing in the state files. We will be using the
MSIS data to enhance the study. We will also be bringing in the analytical
extract, the MAX file, to take a look at differences between those who are
enrolled and those who are enrolled and receiving benefits, which may have a
big difference in the way that the responses are occurring.
We hope to soon be working with the national health interview survey, and
then we will be doing a comparison measure of error in the CPS to the state
survey experiments, and then also looking at how well the NHIS does, because
the NHIS asks the question very different than the CPS does. The question we
have is, if the NHIS does better, is that telling us that we need to change the
way in which we are asking the question.
We will also be evaluating how well the CPS edits and imputations work both
at the micro level and overall macro level. We will be evaluating additional
state level Medicaid, and then we will be looking at the coverage area and
survey non-response bias.
As I said, these are preliminary results. They are subject to change after
further investigation. At the moment, we conclude that the survey measurement
error is playing the most significant role in producing the under count. Some
Medicaid enrollees answer that they have other types of coverage, and some
answer that they are uninsured. The overall goal of this project though is to
improve the CPS for supporting health policy analysis, especially refining the
estimates of the uninsured.
The use of integrated data sets is really an important growth that we are
seeing here. You couldn’t get the sort of analysis that we are doing by looking
at aggregate data from administrative records or aggregate data from the
current population survey, or any other survey for that matter. You have to
look at the unit level to determine what the causes are of why misreporting or
dysjuncture is occurring between the two systems.
The administrative data, while they may provide the experience, it is
really two sets of truth. The administrative data show you the information and
the experience that federal agencies have and state agencies have working with
a given set of individuals. They only collect that administrative data, they
don’t talk about all the demographics and social and economic information that
you really want to have in order to have a really complete picture for use of a
multitude of areas, including the development of policy, the implementation of
policy and the evaluation of your activities.
There are some related research examples I just wanted to share with you
briefly. We had a similar experience, where we started down this path three
years ago with the Maryland food stamp study, where we worked with some folks
at the Jacob Branz Institute in Maryland. We matched the American Community
Survey data to the food stamp recipient data, and we found that we had 50
percent response error. That is, the Maryland food stamp recipients were 50
percent higher than we were showing in our estimates. I understand there are
other surveys out there that have come up with similar results.
We were able to explain in our linkage study 85 percent of the discrepancy
by looking at these individual records. In fact, the misreporting was 63
percent of that discrepancy, much of it due to the temporal biases that we were
seeing that I mentioned earlier, and also to the fact that there seemed to be a
serious dysjuncture when there was a person who was not receiving the benefit
who was the survey respondent, or who had not applied for the benefit.
Other related research that we are doing in integrated data. We are working
with the University of Chicago Chapin Hall Center for Children on a child care
subsidy study. The other thing that this integrated data set does that you
would never get from administrative data alone is that it allows you to develop
eligibility models, those people who are eligible for a program but are not
necessarily participating. So we are looking at developing eligibility models
in the study, and then the researchers will examine the effects of this on
employment and self support, that is, the outcome measures at our research data
If we were to add data, we were asked earlier what data would you add to
the study if you could to improve research on health outcomes. We have
identified a few data sets that we think are of particular importance that if
we could bring it together with this three-year longitudinal file we already
have, that would be really important. It would be the WIC files, files from the
NHANES survey, the MEPS survey. We have not yet linked in the CPS food security
supplement, but that would be great, to bring in relationship data from the
SS-5 information that Social Security Administration collects. The SS-5 is your
application for a Social Security card; this currently is not on our extract.
There are other federal health insurance data sets that would be important
to look at. I think there are future things that we could look at like
co-insurance and its effect on things to bring in the VA medical health data,
to bring in tri-care information, to bring in the Indian Health Service
information. I think if you could get a good picture of what was happening in
the federal sphere, then we could take it to the next step perhaps and see if
there were any data out there for private health care coverage.
So the value of integrated data sets is that they really provide a more
robust and accurate picture of what is going on. It builds on both views of the
world, what the agency is seeing and what the persons are experiencing, but it
controls for both their weaknesses. It provides better statistics for input
into simulations, for predictions and for funds distribution, and as the demand
for data increases, and frankly we are all experiencing a budget decrease, data
reuse may be the only cost effective option for moving forward in the future.
So in doing this, we have a bunch of policy challenges, that is,
communicating the benefits of integrating these data sets versus the privacy
concerns, the need for interagency teams to insure accurate results. I think
this team that we had on the Medicaid study was a great picture of that,
because it wasn’t until you had all the expertise from every side to bear on
one specific problem that you could really address it.
To look at interagency agreements, as Sally said, what ends up happening
with many of these research activities is that you set a research plan for two
years, and you spend a year and a half trying to get the data. There needs to
be a more effective way of moving forward.
Then once the data are put together, who owns them? Big question. Everybody
owns them. Then there is the potential growth of possible disclosure risks. If
you were to try to blend these data sets together and then release them as a
data file, how do you do so without the administrative agency who had provided
you that data being able to back out the survey respondents’ information? This
is certainly that we can’t allow to happen. I’m sure Gerry will talk about
Then there is the need for these longitudinal databases to find an
anonymized person at an address at a specific point in time. That really should
be our vision in the future. But how do you do that and balance it with
In conclusion, integrated data architectures are the future of American
statistics. There was recently a paper released at the UN ECE conference. The
Norwegians, who have been working in similar systems — well, a number of
European countries have, but the Norwegians presented a paper that said they
are changing their approach to the way they are doing work, and it is not going
to be register based anymore. They believe that the blending and the
integration of data is the way that they want to work. So I think we are in
good company there.
As I said earlier, as the demand for data increases and budgets decrease,
data reuse may be the only cost effective option for us. We have to overcome
technical and policy related challenges, and this approach will support
evidence based public policy research on decisions.
MR. GATES: Good morning. As you heard, my name is Gerry Gates. I am the
chief privacy officer at the Census Bureau. It is a position I have held for a
little over a year now. Prior to that I was chief of the policy office, and for
over ten years prior to that I was the administrative records program officer
at the Census Bureau, responsible for coordinating administrative records
access and use. I have had some fairly close relationships with several people
in this audience over the years, acquiring administrative data for our
What I want to talk to you today about is some of the policy issues
associated with acquiring and using administrative data for statistical
purposes. I think the key policy issues related to these uses revolve around
trust. I have identified four trust relationships which I think are critical in
determining how we will be allowed to use administrative data for statistical
The first is between the administrative data provider and the statistical
data collector. In reaching agreements on using administrative data there is a
consideration of several key issues. First of whether the proving agency trusts
the statistical agency to protect the data and use it according to legal and
formal agreements. The second issue relates to whether the providing agency
believes that they have the legal authority to provide that data.
Another issue involves how the providing agency will be protected from any
public backlash associated with the sharing of that information. Another issue
involves what are the risks that the uses of these data may reflect negatively
on the data provider, in terms of whether or not the quality of that data is
sufficient, and whether the program goals are being met.
Another issue involves whether the provider has any say in how the
published data are protected, and what role they will play in that. Finally,
there is the issue associated with whether the provider is able to make use of
the results. In addition to cost reimbursement, will there be any quid pro quo,
will any data be returned to the data provider.
The second trust relationship is between the statistical data collector and
the respondents to our surveys and censuses. This raises issues about what
consent was reached in terms of permitting the data collected in surveys and
censuses to be linked with administrative data, was that agreement a specific
consent or was it a notification, was it an opt in or an opt out agreement,
were there any conditions associated with that agreement and finally, how is
the process transparent to the public at large; are these linkages being done
in such a way that they are well known, or are they not known.
The third relationship is between the administrative data provider and the
program recipient. This relationship is very similar to the relationship
between the statistical agency and the survey respondent. It is an agreement as
to how that information they provide will be used.
Finally, there is a relationship between the statistical agency and the
users of the data. These data are only valuable if they are being made publicly
available or being made available for research purposes. This involves an
agreement as to the quality of the data for the intended use, as well as how
accessible those data are.
So that frames where I want to go with this discussion today, to talk a
little bit about how a statistical agency makes policy decisions in reflection
of these trust relationships that they have to accommodate.
The Census Bureau’s mission includes a prominent statement about our
responsibility to honor privacy and protect confidentiality, as you can see
here. This demonstrates that when we use and collect information, we find it
critically important that we address our relationship of trust with our
Our law specifically says that we can only use the information that people
furnish us for no purpose other than the statistical purpose for which it is
supplied, and that we cannot make any publication where the data furnished by
any individual could be identified. So this sets the ground rules. When these
data are collected they must be protected, and they must only be used for
So when the data come in, they come in from an administrative agency, and
for our statistical purposes they cannot go back out for administrative
purposes. The law prohibits it.
The Census Bureau has decided to address these trust relationships through
its commitment to what we call data stewardship. It is a formalized program we
have established since 2001 reflecting our management commitment to comply with
our legal requirement and to acknowledge and address our ethical requirements
in terms of professional statisticians.
We have established a formal structure to accomplish this. It is titled our
data stewardship executive policy committee. Through that committee and its
subcommittees, we set policy related to the acquisition of data, the protection
of data and the use of the data. Three formal committees report to our senior
level committee, which is chaired by our deputy director. One of those
committees is our privacy, policy and research committee, one is our
administrative records planning committee, and another one is our disclosure
review board. So as you can see, administrative records are a focal point of
this data stewardship program.
As I said, the program is built around our foundation of values and
principles. Management decisions reflect — I think what is key to this is,
management decisions reflect not only our legal requirements, but consideration
of the ethical obligations to individuals.
We have established controls to formalize and insure that our policies are
met. These are done through the implementation of privacy impact assessments, a
requirement of the Government Act of 2002. What we have done is, we have linked
all our data stewardship policies to the assessment, so that managers have to
acknowledge the compliance with specific policies in addressing risk to privacy
throughout the process, from the initiation, the collection and the processing
As I said, we have established many new policies that respond to the issues
associated with this, and we continue to develop new policies in response to
our changing environment. Basically, data stewardship is making commitments. It
is a commitment to our data provider to manage and safeguard their information
in accordance with legal and policy requirements. It is a commitment to our
data user community that the administrative records will result in high quality
data products. Also, it is a commitment to the public that we will maintain
confidentiality of personal information, and also make sure that it will only
be used for statistical purposes.
We have legal guidance and protections that are in place. Title 13 of the
United States Code is the basic legal framework for the Census Bureau programs.
As Sally mentioned, Section 6 of Title 13 specifically authorizes us to obtain
records from existing sources, rather than collecting that information again.
So we obtain our legal authority for the collection of administrative records
from that section of the law.
Section 9 of Title 13 is the provision that we must keep the information
confidential. Title 26 becomes very important to the Census Bureau, because
that is the IRS code. The Census Bureau makes use of administrative records
from the Internal Revenue Service, something we have done for over 40 years.
The authority for that is in Section 6103-J of the IRS code that permits the
Census Bureau to obtain tax information for its statistical surveys and
We also have specific recognition in the privacy act that statistics are a
routine use for information collected by the federal government. The Privacy
Act specifically singles out the Census Bureau as a routine use, so
administrative agencies can provide information to the Census Bureau under that
We also are guided by the Paperwork Reduction Act, which instructs that we
collect the information to the minimum extent possible, and use information
that is already available. This is a companion to the Section 6 of the Title
Let me start with some definitions I think it is important to understand.
We talk about confidentiality and we talk about privacy, and they both are
important to the use of administrative data for statistical programs. I think
it is important to understand this distinction.
I get this definition from the IRB Guide Book. Confidentiality pertains to
the treatment of information that an individual has disclosed in a relationship
of trust — again we go with the trust — and with the expectation that it will
not be divulged to others in ways that are inconsistent with the understanding
of the original disclosure without permission.
As I said, there are some legal requirements for confidentiality. The basic
one is Title 13. There are also reflections of confidentiality in the security
guidelines established by the Government Information Security Reform Act and
the Federal Information Security Management Act, which is also part of the
E-Government Act of 2002, and confidentiality is also impacted by the federal
information processing standard, which is FIPS 199. So these all provide a
framework under which we must protect and secure the information that we
collect not only from our respondents, but also data that we obtain from
Now, information privacy on the other hand is defined here by Alan Weston
in 1967 as the claim of individuals, groups or institutions to determine for
themselves when, how and to what extent information about themselves is
communicated to others. So this is the individual’s control over their
information and how that information is used. The requirements for insuring
privacy come from Title 13, in the sense that Title 13 says that we can only
use information for statistical purposes. We have to tell people that that is
how we are going to use their information. The Privacy Act instructs us that we
have to tell people about the authority we have to collect the information, the
purpose for the collection, how the information will be used, and whether our
asking for that information is mandatory or voluntary.
We are also guided by the Freedom of Information Act, which says that
private information cannot be disclosed to requesters. Also, HIPAA has a role
to play here as well as the E-Government Act of 2002, which as I said
establishes requirements that agencies must conduct privacy impact assessments
to insure that information is protected.
A little bit about Census Bureau policies as they relate to administrative
record use. These are all policies associated with all our privacy principles.
This is the basis for our data stewardship program. We as an agency have
acknowledged that these are principles that whatever we do must be necessary
for our mission, that we will be open and transparent about what we will do. We
will have respect for the individuals who provide us the information, and we
will provide confidentiality for any information we gather. So these are the
overarching principles of our data stewardship program.
We have established policies to implement those principles or to insure
that we are in compliance with those principles. The highlighted principles as
you can see under mission necessity, linkage of decennial census records and
number five, record linkage. These are policies that we have established
related to our linkage of information. I am going to talk a little bit more
about the record linkage policy in a minute. Then there are some policies
related to multiple principles including collaborative arrangements with
agencies which certainly impact arrangements with administrative agencies.
Finally, the administrative records policies and procedures, which is a
handbook for how we will manage and how we will establish agreements to use
administrative data, how we will manage and handle administrative data that we
So those are the four guiding policies which impact administrative records.
Now a little bit about policy on record linkage. This policy establishes
six principles for conducting Census Bureau projects that use record linkage
The first is mission necessity. What that says is that the linkage must be
necessary and consistent with the Census Bureau’s legal authority and mission.
The second principle is best alternative. That principle says Census Bureau
will examine alternatives for meeting the project objectives and determine that
record linkage is the best alternative, given considerations of cost,
respondent burden, timeliness and data quality.
The third principle is public good determination. The Census Bureau will
weigh the public benefits to be gained from the information resulting from the
record linkage against any risk to the individual privacy that may be created
by the linkage, and determine the benefits clearly outweigh any risks.
Next is sensitivity. The Census Bureau will assess the public perception
that the level of risk to the individual privacy to the particular linkage and
create an appropriate level of review and tracking.
Openness. The Census Bureau will communicate with the public about its
record linkage activities, how they are conducted and the purposes and benefits
derived from them.
Finally, consistent review and tracking. Record linkage activities will
undergo a consistent review process using the criteria set forth in its policy
and a centralized tracking by the Census Bureau.
Now, the policy establishes a checklist of questions that have to be asked
related to each of these principles, and risk points are assigned. If the risk
points are high enough, it is considered a highly sensitive record linkage and
it needs specific approval from our data stewardship executive policy
committee. So there is a thought process that goes into this, and an assessment
is made that yes, this is a high, moderate or low sensitive project.
There are controls that we have established that support our administrative
record uses. Sally has mentioned this, but I think this comes under her
purview, centralized data acquisition and agreements. We have one focal point
within the Census Bureau for establishing the contacts with the administrative
agency to acquire data and to establish agreements. We also have a centralized
review process, so that all projects using administrative records go through a
formal review to insure that they are compliant with the legal requirements and
the agreement between the agencies. We require need to know access. Only those
people who need to have access to this information are permitted access. We
remove identifiable information immediately, and replace it with what we call a
We have an administrative records tracking system, which is a computerized
system, which basically allows us to control a project from its inception to
completion, to insure that anyone who works with that project understands what
the rules are, how that information can be used.
We have file receipt logs and audit trails to insure that we are complying
with the agreements. When our data have to be used off site, we do independent
site reviews to make sure that the security is required.
Finally, we do security and confidentiality training. All of our employees
on an annual basis must take confidentiality training under Title 13, and
training specific to Title 26, which is tax data, and also IT security
training. So this is a mandatory annual training.
Let me talk just a minute now about some unique privacy and confidentiality
concerns related to our administrative records use. A lot of this involves
perception as much as it does reality, I think. That is why these issues get
very complicated. We have to be sure that what we are doing is perceived to be
the right thing to do.
The first concern is, is the consent needed for the statistical use of
administrative records through the data provider, through the survey collector,
and what are the conditions for that consent. Is it an opt in or opt out, does
it matter whether it is voluntary or mandatory, whether the survey is voluntary
or mandatory. These are important decisions that we have to make, to make sure
that it is transparent as to what we are doing and the permissions that are
being given to do this.
Another unique issue involves, will the public accept a near real-time
system to respond to immediate statistical needs. Past uses of administrative
records are quite different than they are today with the new systems that we
have developed. The data that we have had before gets old very quickly. We have
had matched systems from data that is five years old. Today that becomes much
more current information, so that raises additional concerns for how that
information could be used, not how it will be used, but how it could be used.
The next is, does the public trust the protections around the interagency
data sharing. Today’s climate is different than it was prior to 9/11, in terms
of acceptance of record linkage and data sharing activities among government
Next is how can we be more transparent about record linkage activities.
Events in Canada several years ago led us to conclude that these things cannot
be done in secret. They have to be very, very publicly known, because the
public will react very negatively to information they find out about something
that is happening that wasn’t publicly acknowledged. So we have to be very
transparent about what we are doing.
Finally, how can we continue to meet the statistical needs while assuring
confidentiality. That is what I want to talk about next. It is critically
important that we make these links data accessible for research purposes. That
is where the value is. We have to determine our options.
As Ron mentioned, administrative records linked to survey data raises
unique confidentiality concerns because of the fact that that administrative
data exists somewhere else. So we are limited in the amount of public use micro
data sample files that we can publish. We don’t want to discount them, but we
have to understand that they are not going to be of as much value as they would
have been if they were just the survey data. So we have to look at other
We have been assessing many other options for providing access. As Sally
mentioned, we have a network of research data centers, and we provide access to
qualified researchers for work supporting the Census Bureau’s programs through
these research data centers. So this provides an opportunity for those people
who are in geographic proximity to the regional data centers and have projects
that meet these criteria to access some of these linked data. We are looking at
options for streamlining and enhancing those centers.
We also are actively researching techniques like have been used for the
Luxembourg income study for many years, which allows users to develop programs
to run on the matched data. They don’t actually access the matched data, but
they can submit those programs, and they are run on the matched data and
products are released to them. So that is something NCHS has done a lot of work
on too, and the Census Bureau is continuing its research.
We are also researching the development of synthetic data John Abod, who is
currently working with the Census Bureau, has done quite a bit of research on
synthetic data which is model data, so that we may be able to release linked
data that are not the actual data, but maintain much of the properties of the
Finally and probably most important, we have to continue the dialogue we
have established with researchers and program evaluators and program
implementers about what their needs are and how best we can meet those needs
under the constraints that we have to live with.
So in conclusion, what I wanted to say today is that the Census Bureau is
committed to meeting the needs of its customers, by enhancing the reuse of
statistical data through our use of administrative records and survey and
census data sets. We are also committed to reducing the cost and respondent
burden of developing statistics, and insuring the trust of the public, our data
providers and our data users, and finally, continuing a public dialogue on the
advantages and cautions surrounding our use of administrative data.
With that, I want to thank you.
DR. STEINWACHS: Nancy and I and all of us want to thank all three of you
very, very much. This has been exciting. It opened my eyes to a lot of things I
didn’t know, but probably other people around the table did, and that is part
of the sharing.
Nancy and I discussed, maybe we could take a five-minute break to let
people have a human break, at least stand up and stretch, and come back and
start the dialogue, both questions, comments and interchange.
DR. STEINWACHS: If we get started, we will get a chance to get a discussion
going and answer peoples’ questions.
Let’s get started. Let’s open it up to questions and comments, both from
those around the table and those in the audience. We have a microphone that we
can pass around.
MS. GENTZLER: I am Jenny Gentzler from Food and Nutrition Service at USDA.
I would like to hear some more discussion about the feasibility of getting food
stamp and TANIF data from umpty-ump state agencies, and sometimes even if it is
administered by the county agencies. I know that is a linchpin of the new dues
system. So I would really like to hear some more discussion about whether we
could have all of those matches done and have the data available before 2020.
DR. STEINWACHS: Who would like to take that?
MS. OBENSKI: Even though there are technical challenges with different
state data, our experience to date is that we can overcome the technical
By far the single biggest hindrance or obstacle are all the legal
requirements around acquiring the data set. For example, in our child care
subsidy program, which is a pilot project that we are undertaking with
researchers from Texas, Illinois and Maryland, it took a year and a half. This
is working with researchers who have established relationships with the state
entities. This wasn’t the Census Bureau calling up cold.
So although there are technical data challenges, I think I can speak for my
analyst that those they can overcome, but the policy and legal obstacles or
challenges to make sure that all parties are protected is quite formidable.
MR. PREVOST: I just wanted to add, we were involved in a child care subsidy
conference last week, where a similar question came to us. I think if you look
at the way that data are submitted to CMS as an example, an agency would almost
have to have a centralized focus within the federal government that somehow
related to the regulations around a program and perhaps tied to the funding to
those states, to be able to collect or to require the collection of that
information in the standardized format in order for this to occur, because
otherwise it becomes voluntary. I guarantee you, out of the 51 or more entities
out there, somebody will not want to participate for one reason or another.
MR. LOCALIO: You gave us an interesting response. Let me just tell you what
happened last week. The full committee had a meeting. One of the issues had to
do with the liability — I guess it was NCHS — or the decreasing ability of
NCHS to get vital birth and death data from the states and local entities that
produce those data.
We had a speaker from New York City, Steve Schwartz. He called himself Dr.
No, because he tells so many people that they can’t get access to the data.
Then we had a presentation that indicated some of the difficulties that this
poses for people who previously got data but now they cannot on births and
deaths. Obviously this is very important for a researcher, who needs to know
how many people in the study have died. That is your outcome, survival.
So what people didn’t hear is what I asked him after we broke for lunch. I
introduced myself and I said, suppose Health and Human Services told New York
City that their further receipt of any Medicaid funding from the federal
government was contingent upon their cooperating with NCHS in providing the
data as they used to. His reaction was very defensive. He said, oh no, Tenth
Amendment would prohibit that because that has traditionally been a state and
local activity. Then I reminded him that in the early ‘70s, every state
had a 55 mile an hour speed limit because they were told 55 miles an hour or
you don’t get any highway funds.
So I would submit that a lot of the way around some of these obstacles
which are most unfortunate and not in the best interests of the country as a
whole is to do what I think you alluded to. That is, to suggest that more
cooperation will be forthcoming if these data systems are linked to certain
line items that provide appropriate incentives.
Now, the other thing that came up which I think was not nearly as
threatening as what I mentioned is to sponsor model state and local laws that
would be appropriate and consistent with the needs of everyone and have
uniformity, and to say, we have this model, this requires somebody to get out
there and think about it and write it, and that is not always easy. But if you
have model laws, it then becomes a little bit easier to convince state and
local entities, many of which have a lot of difficulty passing a budget every
year, that this is worthwhile if you do some of the work for them.
So I just wanted to relate that to people, that this is my conversation on
Thursday of last week. I am wondering if anything was said about me back in New
York City on Friday.
DR. STEINWACHS: Anyone representing New York City here?
DR. MADANS: Lucky for you, Steve was still in Hyattsville on Friday, and it
hadn’t gotten back to New York.
I just wanted to clarify a little bit about what is happening there. This
is really a data release issue that we are facing with vital statistics. There
are some items we are not getting, but that is a different issue.
NCHS is getting the information. What has changed is what and how we are
allowed to re-release it. It does feed directly into the issues that have been
brought up. I have a feeling some of the presentations you are going to hear
today and tomorrow will get boring, because if you take out Census and put in
NCHS, other than examples, the issues are all the same. But if you nod off, we
This whole issue about what is safe to release and what are our
requirements, what are the various parties’ requirements, how do they mesh, is
one of the major issues. To the extent that something like a model law, or some
way that we can ease the way that we work together is really important.
Every time we start a different linkage project, it is as if no one has
ever linked data. We start from scratch, dealing with the same issues over and
over again. Nothing changes, again, substitute the different data sets. So
getting agreement on best practices, on what is reasonable to do, what is
acceptable to do jointly I think would make our lives very easy, because after
awhile this is just work. We are not getting anywhere. We are just redoing the
same thing over and over and over again.
DR. STEINWACHS: Let me ask a question that falls into one of the areas this
subcommittee has been dealing with, and that is trying to improve the reporting
of race and ethnicity. It sounded to me as if I just had to come to the Census
and I had both good reporting and filing in the missing blanks by statistical
Let me just ask the question as part of my education. You have that data
well developed and are updating it and monitoring it. On the flip side of that,
there are people who analyze the Medicare claims data, administrative data, to
look at disparities in health care services. But mainly they have what Sally
was saying, they have white, black, other, because it dates back to the — the
source of the data dates backs through Social Security or original enrollment
in the Medicare program.
What would be involved if CMS came to you and said, could we get from you
an estimate of race and ethnicity that we could link to each of these people
and we could update that. I was trying to get a sense of how does that fall in
privacy and confidentiality. Does that raise issues? In a sense, they have it
but they don’t have the measure that we use today. They don’t have the
categories. But it would help me to get some sense of what the possibilities
are, maybe have a little discussion of something like that kind of reverse
request, where CMS was coming to you.
MR. GATES: Maybe I will start this discussion. I think the issues
associated with assigning race and ethnicity have to do with — if you are
going to do it on an individual level, how good it is. If you are assigning
race and ethnicity based on some models that are a hundred percent accurate,
then you have got confidentiality concerns, because of the fact that that
information may have been — where you are deriving that information from, if
it is derived from the Census, let’s say, that is protected, and you are
creating this model that is a perfect model, then you have got a real issue in
terms of, did you really comply with the confidentiality requirements. So that
would be a major concern.
So I think the issues associated with this have to do with perception
about, have you violated the confidentiality.
DR. STEINWACHS: Let me ask it another way. What if there were some auspices
under which you took Medicare data into your data centers and did a linkage?
What would be the issues if you went the other way?
MR. PREVOST: I think I can speak to that a little bit. Part of it is, that
is what we have done. The auspices under which that would work, we have our
research data centers, which would provide an environment that was protected,
where the individual, if they had a project that had been approved, could come
in, could work on that project and the data that were ensconced in that data
center, and then there would be a confidentiality filter, being people who man
the research data center, who would take a look at the results to see if those
results were presentable outside. By that, I mean that they would not be
disclosing any given individual’s information.
So we have worked under that. We have also worked under situations like in
our current areas, where we are jointly conducting this research with our
partners. We do confidentiality research to make sure we are not disclosing
anything, and then we provide it as a group.
MR. GIBSON: I apologize for interrupting. I’m just going to add some
technical details here. We have had the privilege of trying to work with Sally
and Ron before. My name is Dave Gibson, and I am representing Spike Duser, who
is out this week and he has asked me to speak in his behalf.
DR. STEINWACHS: Thank you.
DR. BREEN: CMS.
MR. GIBSON: I’m sorry, I’m from CMS, I apologize. We have tried to
establish some relationships with Census. It wasn’t for want of trying on Ron
and Sally’s part. We really appreciate their professional demeanor and
everything that they were trying to bring.
We have a lot of problems with administrative data. I’ll be talking about
that some this afternoon. But I would like to correct something, the fact that
you are saying that some of the data for the other race and ethnicity
categories are not broken out.
We have done a better job in terms of especially the Indian Health Service.
We have a relationship with them where they populate our data, and if you are a
member of a recognized tribe, that information will trump anything that is in
our database. We have had mailings out to people in that other category, who
are we think mostly Hispanic, and people who have Spanish and Hispanic
surnames. We have done mailings to them and tried to get information back to
populate the expansion of our race codes to go beyond just white, black and
That was a constraint that as you know was based on the old SS-5, and data
that is coming in currently they are collecting it more, but we are collapsing
it back down in some cases, but in our enrollment databases we are trying to
keep that information there.
So there is some effort going on to try to improve the race and ethnicity
data. I think the thing that you have to worry more about is the fact that with
enumeration of SSNs at birth being done by the states, many of the states now
are refusing to give SSA the race and ethnicity of the child. So whereas most
of the Medicare population are people who are obviously aged for 85 percent,
there are disabled in that category, some of the ESRD especially could be
children. You are going to find a lot more of these unknowns creeping into the
data over time.
So I don’t know if that is helpful or not, but just little technical
DR. BREEN: Could I ask David a followup to that? I was wondering, the
techniques you said you were using working with the Indian Health Service,
which of course would be a really good source, but wouldn’t it be more cost
effective to work with the Census in order to get information from them rather
than to send out a mailing? That sounds like an expensive way to do it.
MR. GIBSON: I believe we did it a one-shot time. I don’t recall. I would
have to talk to folks in the enrolment area to see if they have had a followup
on that. But my awareness of it was only that it was done one time. Yes, it
probably would be more cost effective to try to do some sort of a — you are
talking about some sort of a statistical match?
DR. BREEN: Yes.
MR. GIBSON: That would probably be much more helpful, I agree.
DR. BREEN: And you would probably end up with better results.
MR. GIBSON: I would suspect we would.
MS. TUREK: Gerry, is this a Title 13 issue? If they were to do a match, can
you give them your race data from the Census to match into their files? An
MR. GATES: Are you saying if we gave it to them?
MS. TUREK: To CMS; may you do that?
MR. GATES: No, we can’t. It would have to be done by us.
MS. TUREK: Then you couldn’t share the final data set with them. You
couldn’t give them the micro data, right?
MR. GATES: No, not the micro data. We could give them back tabular data,
but we could not give them back micro data.
DR. IAMS: If I may point out, I am from Social Security. Starting in the
mid-80s, the social security number has been issued at birth. We don’t have
race and ethnicity, so your black-white is disappearing from the data set. At
some point in time people will be on Medicare, and you won’t have anything.
DR. BREEN: Is it not possible to fill in the blanks and return the data?
DR. IAMS: I’m not sure what you are suggesting.
DR. BREEN: If there were missing information, to update that information,
or if there were incorrect information, to update that information at the
Census Bureau with the CMS data, and then return the data corrected.
MR. GATES: No, that would not be possible under our statute. We could not
disclose that personal information to CMS.
MR. PREVOST: While it would be technically possible and we have done it
within the constrains of the Census Bureau, delivering that data back to the
agency, we would not be capable of doing that.
MS. TUREK: Title 13; they don’t give that to anybody.
DR. STEINWACHS: Let’s continue. I want to get everyone in.
DR. STEUERLE: Continue this conversation. I don’t want to interrupt.
DR. STEINWACHS: Others who want to join in on this conversation?
DR. DAVERN: Mike Davern from the University of Minnesota. Would it be
feasible, and I know you guys have talked in the past about building an
imputation model, in which you could give CMS the information it needs to take
a best guess at what the race might be on their file, based solely on Medicare
MR. PREVOST: Yes, it is certainly possible to do that. We would have to
conduct a wide variety of research to make sure that we weren’t in particular
cases re-identifying somebody.
DR. STEINWACHS: Other comments on this?
MR. J. SCANLON: Aside from the data acquisition issue, which will probably
be with us for a long time, you mentioned on the dissemination end you had
research data centers, you had other ways of releasing the data. It seems to me
that is probably an area to focus in as well.
Do you have any indication of to what extent research data centers are
utilized, and when you do talk to customers what kind of ideas do they give
you? Apparently you can’t use remote access for a number of reasons, but
someone can request certain tabulations or regressions or other things. You can
provide the product itself. But how much is it utilized? What sort of models
are you thinking of for the future, and how do you relate to the way other
agencies are approaching this?
DR. DAVERN: I think one of the issues on the research data centers is the
process involved in getting approval for the particular project, especially
with administrative records data. It has to go through an approval process with
the disclosure review board and all those types of things, so it takes a long
time to get approval for the research data centers.
I know there is a large interest in people using them, and there are a lot
of people who are using the research data centers. I don’t know what the
breakout is between those using administrative data and those just using some
of the other Title 13 data. I think there are two different ways to do it. But
the research data centers are used heavily. There are only 12, 13 research data
centers, and they are looking at expanding the access to those.
MR. J. SCANLON: Do you physically have to present yourself?
DR. DAVERN: You physically have to go to the research data center, correct.
DR. SONDIK: I am doing the wrong thing because I am reading ahead here.
DR. STEINWACHS: I think that is against procedural rules, isn’t it?
DR. SONDIK: It probably is, but I noticed this before, and given that you
just asked the question it can come up again. But in the talk that is going to
be given by Julia Lane, there is a slide that says research data centers
drawbacks low in declining utilization. It said, fewer than a hundred active
DR. STEINWACHS: Let’s call on Julia.
DR. SONDIK: I was struck by this. There is a judgment there that that is
low, but it struck me as — I don’t know what it means across 12.
DR. LANE: My background is, I was at the Census Bureau for seven years, and
then I was at the National Science Foundation, in the economics program, which
continues to fund the research data centers.
That judgment is not mine, by the way. That is the judgment of Brad Jensen,
who used to be the director of the research data centers, and that is a direct
quote of his testimony to CENSTAT. NSF had very similar concerns, because NSF
funds it. When I was there, there was a lot of concern about the little usage
which last year was under a hundred projects spread over eight, count nine if
you think of the California one.
So the big issues that we have when we put it before review panels was
first of all, there was a delay in the review process, which could take up to a
year, which is for graduate students and so on. Physically having to go on site
to do the work is a big deterrent for research purposes, and certainly for
program agencies. Joan might want to speak to that.
So we at NSF came to the conclusion that it is very important to support it
as a niche approach, but it is certainly not going to be a bill-based response.
That is a little bit what I will be talking about this afternoon at Joan and
MS. TUREK: Let me give an example of very intensive data use. I manage the
transfer income model which is used to look at the impact of changing federal
programs. It has been around since the ‘60s, so it has probably got $70,
$80 million in it over all that time.
We use the CPS fortunately and not CEPR. We would be in real trouble. We
run the micro database through that model and assign people to different
buckets, depending upon their characteristics. We also can change the code in
that model daily when we are doing something for the Hill. At one time we had
an open line to one of the subcommittees and were doing runs while the
subcommittee was in process.
When they change the programs, we have to write new code. So for us to use
the matched data we would need to be able to have our model sitting out
somewhere where we could change code, do runs and get the results in real time.
I think that is going to require some major changes in law. I wouldn’t want
you to do that to any of my data sets that I am using until after — I would
want to keep what I have until you have got the rules in place that I could use
the new, or you are going to affect one of the major policy functions of our
Do you want to say something about social security? Because you had SIPP
DR. IAMS: We continue to have SIPP based models. We are a data center that
follows the rules and regulations of Census and the Internal Revenue Service,
and treat it as if it is gold at Fort Knox. We are highly protective of our
security and privacy, and we do run these models. We have an ongoing
relationship on social security reform with the White House, that is receiving
it on a flow basis. But that is facilitated by having Social Security
Administration being a data center that permits us to use the restricted data.
MS. TUREK: Which we would not be, because we are a policy office.
DR. IAMS: Yes, you are a different agency.
DR. BREEN: Is there any access by the public or researchers to your data
center or that data that you are talking about, Howard?
DR. IAMS: Yes and no. The two main models we use with restricted data, one
is called MINT, modeling income in the near term, and the other is our SSI
financial eligibility model we call FEM. We have had researchers funded through
our retirement research center grant system use MINT data for various
descriptive projects. I think people from the Urban Institute have done that as
grants. But that is really a statistical adjustment to five SIPP panels, and I
am not sure how broadly based the Social Security Administration wants to make
that available. We get into an agency proprietary aspect.
DR. STEUERLE: Howard, just quickly, you and Tom Petska do at times — I’m
not saying it is done as much as you like, but you do also tend to have joint
projects. You have had external research combined with internal research.
DR. IAMS: Yes, we have had joint projects with external researchers, some
of whom we fund. Offhand I can’t think of anyone we don’t fund.
We have researchers with our retirement centers. We fund projects through a
cooperative agreement to three centers. The Social Security Administration
funds projects at National Bureau of Economic Research at the Michigan RRC and
at Boston College. We have had researchers come and use matched data, getting a
grant from us and using matched data at our center, and we have CBO who has
come and used matched data.
They are using the administrative data matched to the Census Bureau data. I
am speaking about that this afternoon. We have an array of administrative files
that are matched to it.
On a slightly different note, I have felt that the Bureau in the past could
have made more use of these matched data to improve the data quality of their
missing data and imputed data. Our data on being paid a social security
benefit, that is what Treasury sent them, and our data on an SSI check, that is
what we sent them. In the CPS or in the SIPP, when you have a matched record
and someone fails to provide an answer, you could provide a statistical
assignment from our administrative record that is going to be better than using
a hot deck from some other person, who may or may not be related.
We found in our studies that SSI gets misreported in CPS and in SIPP.
People call it social security, so you have more social security beneficiaries
in these surveys than we have as beneficiaries for matched people, because some
of those people are getting SSI and say it is social security. Then you have an
under count in SSI.
If you were to do that assignment, you probably could carry it further,
because with SSI usually comes Medicaid. So you could assign Medicaid based on
SSI. But there are a number of things.
I believe in the new economic well-being survey system, they are planning
to make more use of administrative records in improving the information
collected in the survey than has been done in the past. There is not a case
where you have to go get privacy from 51 states. We do have it, and we do have
an exchange and a series of MOUs and exchange data with Census for statistical
DR. W. SCANLON: I just had a question for clarification. What are the
restrictions, or what are the requirements for an agency to be a data center,
in terms of social security, qualifying for it to be one? Is it conceivable
that CMS could qualify to be one?
DR. IAMS: If I could be colloquial, we basically take your first-born
child. I am joshing, since this may be recorded.
There is a series of background checks that are required. There are a
series of sworn statements that have to be made. There is a training program
that IRS maintains that people have to go through. There is a training program
that Census maintains that has to go through. Then our data center walls off
the data; people can’t print from the data, they have to print to a printer,
and one of our employees goes and reviews the output that has been printed, and
you can’t take the data away.
Then to get to our data center, you have to go through two armed guards on
two different floors to get to it. Then you would have to know what you are
doing when you got to the computer, in terms of getting access to the file and
all the standard protections.
So there is a set of requirements that Treasury has laid out for being a
restricted data center that protects their data. You have to go through these
safeguards, and users have to go through these types of things. I don’t know if
Tom Petska tomorrow is planning on touching on this at all. But it really is a
very tight control, in terms of getting access and use. I think the Census
Bureau maintains the same type of thing at their RBCs.
DR. DAVERN: Let me say a little bit about that. We started the data center
network probably about 12 years ago through our Center for Economic Studies.
The process for identifying new research data centers is handled through an
arrangement between the Census Bureau and the National Science Foundation to
determine where the best place to locate the centers are, based on the research
community associated in that area, and how we best can establish a partnership
with an academic institution or whoever we decide to partner with to establish
a secure environment for accessing these data.
As Howard said, there is a very tight process for determining what projects
will be accepted through the research data centers. You should know that people
who access these data are considered — we call them special sworn status
persons, and they are given special sworn status because they are helping the
Census Bureau conduct its activities under Title 13. So we go through a process
of identifying how that project is going to support our programs by improving
our data, improving our knowledge about how the survey is functioning. So that
is all part of this process.
But there is a formal process for identifying new research data centers.
The arrangement that Howard is talking about between the Social Security
Administration and Census Bureau was established in 1967, way before we started
formally introducing research data centers. It was in recognition of a
collaboration that these two agencies had for better working together on these
projects that were of great interest to the Census Bureau in working on its
income programs and other programs.
So I think it is important to understand how this process has come to be
where it is.
MR. LOCALIO: I just want to make a couple of comments on data centers from
the perspective of a working statistician.
I am probably currently the statistician on 25 projects. I have meetings
every day, even when I am not in the office, sometimes by phone. To do an
analysis and to do it right sometimes takes years. It is impossible to get
funding out of anybody to finance staying in a hotel room for six months and
So I would suspect that a lot of work that could be done is not getting
done, and I would suspect that a lot of the work that is being done is not very
good. I can tell you right now, putting on another hat, that is as an associate
editor for a journal where I act as a statistical reviewer, a lot of this stuff
is not very good and I send it back to the federal agency as well as other
academic institutions. I say, I has to go back and you have to redo this
analysis. So if you are working in a data center, you would have to make
another arrangement to go back to the data center and rerun everything and
rearrange your schedule.
So I can understand why there are a hundred current projects and it is
dwindling. That is no surprise to me. I am amazed there are a hundred active
If you think the data center or dozen of them is an answer to one of the
issues, it is just not. It won’t work. What it does is, it trivializes
statistical science to the fact that people are writing a couple of lines of
code in SASS, and then running it. You may be writing thousands of lines of
code in SASS or RS Plus or some of the other programs, thousands, and you are
rewriting them and rewriting them to get it right. These data sets are complex.
There is misclassified data and missing data.
See all this gray hair? This is not an easy life. It doesn’t pay very well,
So I think we have to consider much more seriously about what types of
projects can be done at data centers and on what terms and conditions. It is
better than nothing, but if it does work, there has to be much more thought to
increased flexibility as well as adding to their numbers.
By the way, my views are not new. Marjorie, remember that e-mail I sent in
response to NCH’s data center request for comments? That was in the spring of
2005. It caused some laughter because I tend to try to kid around a little bit
when I am serious.
So those are the real issues when you are working with data. By the way,
the data sets also tend to be big, and everything is bad.
MS. OBENSKI: I’d like to partially respond. We know that that is a problem,
and we certainly can appreciate the fact that these data sets are
extraordinarily complex. I have watched the very best analysts run and rerun,
One of the models that ameliorates it, does not fix it, is for example in
the Chafin Hall project, which is very complicated, it is TANIF, UI wage, chalk
year subsidy and the American Community Survey. We are working very closely
with the researchers, but we are doing the data analyses and the runs and the
reruns and getting it right. What we are delivering to them in the RDC is a
fairly pristine data set that they should be able once they get access to do
pretty good research on without the runs and reruns. We intend to keep involved
with them to help expedite that.
Again, it is not a fix, but we believe that it should help things in the
MR. PREVOST: I just wanted to add a little bit more to it. I think one of
the other things Howard said was important. What could be done is for edits and
imputes of the data to occur to the POMS micro data sets themselves, therefore
affording access to individuals as they would any other POMS file.
But as Gerry I’m sure can say, and I said in my speech earlier, this
certainly raises the disclosure issues from being able to reidentify people. I
think POMS data sets as a whole — as computer technology and everything else
increases, the ability to match back to am individual and to reidentify them is
growing. It is certainly a concern that we all have as statisticians.
So anyway, certainly one model in the short run could be doing this with
edited data in a POMS if it passed disclosure proofing. But then also to
provide the differences that we are seeing between the survey data and what is
being collected in the administrative data to enhance the modeling that is
being done by researchers for policy reasons. I think you need both prongs of
that in order to make that function work.
DR. STEINWACHS: We are going to need to wrap up here, so one comment, Gene,
and then we will close up and go to lunch.
MR. GIBSON: I am afraid in some ways, the changes that CMS is planning may
necessitate something along the lines of what we are talking about with a data
center, and whether we would qualify, I don’t know.
We have been used to giving out our files as flat files for both the claims
as well as the unloaded EDB or else the denominator file. We intend to go to an
integrated data repository, which is basically a relational database with
everything in it. There would be only one source of that data, and you would go
to that data, so there wouldn’t be these multiple copies of flat files that we
would be giving out as either an Epsidec or a SASS file out to our users.
So that has us as researchers inside concerned, because we have found that
when we go against relational databases like the NMUD which is used for our
claims — and even we don’t have access to them, but we have seen when you use
queries against them, it can take a long time to get a response back. That is
typically caused because the records are very wide, they are variable linked
We would like to entertain the notion of shortening them, trimming them off
on the front and back and coming up with what I would call core research files,
which would be very, very deep, but very, very narrow. Trying to get our data
processing friends to buy into that is another story, but that would be another
We could work with the groups who know how to handle data such as Census.
Where we are typically hurt is by the lack of meta data associated with the
instructions from our CWF friends in policy when they implement policy, or our
CWF friends in OIS, that is our data processing group, when they go out to our
standard systems and come up with procedures for physical areas, carriers and
providers. The meta data somehow gets lost, and it is not associated with the
So if there could be some way of working with agencies who know how to
handle meta data all the way from the policy to the systems to the manuals to
the actual claims and enrollment database, such as with the Census folks who
are SSA or NCHS, and then think about trimming off these humongous files for
the variable length packed decimal stuff that is not really needed, we could do
a lot, lot more in terms of being more responsive. A relational database, we
found, if it is relatively narrow you can get answers out of it fairly quickly.
DR. STEUERLE: I have had the fortune of being associated with a number of
agencies that are very interested in data from SOI to Social Security now to
HHS, and on numbers of occasions have made contact to Census.
It is probably unfair, but I would say that probably the view of the
statistical community, at least around Washington and maybe around the country,
is that Census is king. At times it ends up to be the lion king and other times
it ends up to be King Kong. It is just big, it has some really good people
doing wonderful things, but at times the nature of the agency is such that it
accidentally steps on people that it doesn’t want to, or knocks over buildings
by accident or something.
As an economist, I tend to think of this as very much based on — I ask
myself, what are the incentives of each agency. It was interesting in your
talks, because a couple of things stood out. One was, the word counting came up
numerous times in your analysis. Even when you listed the reasons you would do
things, I think the public good was listed third or fourth on a list. They
weren’t necessarily hierarchical lists, so maybe that is unfair, but it wasn’t
listed first, it was listed third or fourth.
I think the reason for that is, the Constitution gives you authority on
counting, and that gives you great sway if you go to Congress, because
Congressmen care about how many people live in their districts.
If you could justify things on the basis of counting, you can often go out
and do a number of things. The further you get from counting and the closer you
get to research, including research that might show policies as being
ineffective or policies as varying by states, and in some states being
effective and ineffective, the harder it is for you to get the easy
The other thing I sense, and this is not Census, I think part of my point
about making the king statement is, I think the reason people turn to Census is
just because you are larger than anybody else. They hope you are going to solve
the problem, even if it really is an IRS problem or a Social Security problem
or an HHS problem.
This question about public acquiescence and public buy-in. That is a tough
one. It reminds me of the story of when the private person asks to do something
that might create some harm, so you do something that might at some level
reveal somebody’s privacy. It is called a type one error. One person objects,
and somebody in the press plays it up, and it gets bad news.
The consensus of many agencies in government of course is, you don’t want
bad publicity. It is almost like that is the first incentive. The second
incentive is in some cases to serve the public.
However, there are times when that is overcome. That is when type two error
becomes publicly visible enough. The case that you gave was real time analysis.
All of a sudden there is a hurricane, we have — really, let’s be honest, we
have not done a good job on integrating data sets. We did not adequately serve
the public. Again, not to blame it on FEMA. I don’t even know who would be
responsible. But all of a sudden, the type two error gets large enough that
people say, privacy concern is nothing when we couldn’t serve the population of
This leads me to asking a last question. What do you see as ways of cutting
through the Catch-22s, the dilemmas that many people have pointed out here with
some specific examples, on where we could use a better integrated data set? By
the way, I don’t agree fully, Russell, that these data centers aren’t useful,
because I think it has been an attempt of these agencies to — it is the one
way they figured they could cut the ice. I can give some examples where I think
it has made a difference, even though for every one example I could give you
where it has worked, you could give me ten where it doesn’t.
But it is a step. So what we are encouraging you to do and what I hope you
will do is give us advice on what are some steps we could do. Do we need more
people working on state protocols? Do we need advocates within agencies who
would be research advocates, not having the advocacy coming from the outside,
but somebody inside saying this would serve the public good, let’s figure if
there is not a way to solve this problem and get around our 47 constraints, or
dealing with Tom Petska at IRS, or dealing with Russell.
I am really asking you, what do you see that we could advise policy makers
on how to deal with these issues? I have been involved with too many drafting
of laws, and I know what happens. It is what former press secretary Jim Brady
used to call the bog set method, a bunch of guys sitting around the table. One
says I think this looks good, now all of a sudden it is the law, but it is not
that it was written in stone or was even well analyzed in the way it was
drafted. Do we need people in agencies who work on how we can redraft the laws
to deal with these things? Do we need advocates for the public good? That is
the one thing we can use to give the type two error its importance.
It does seem to me that there are cases, and I don’t want to use real-time
analysis to be the only example, there are cases where we are dis-serving the
health of the public by not doing the things that as analysts or researchers we
know we can do, but as bureaucrats or law abiding citizens we know we can’t do.
DR. STEINWACHS: He asks an easy question, doesn’t he?
MR. GATES: That gave us a lot to think about. I would like to separate that
into two areas, if I could. One is the issue about whether we can do the
linkages to support this research. The other is, can we provide the data to the
researchers to make it happen.
I think in terms of whether we can do the linkages, there are a lot of
issues we can deal with, but the perception issues are the really critical
issues, about how do we convince people, convince everybody, that this is the
right thing to do. We should be linking these data for these purposes, because
we can do it in a very safe and controlled way.
I think we have to have more of a public discussion about that, about the
fact that these records for statistical purposes under these controlled
conditions are a good thing.
The other issue about whether we can provide these data for the researchers
is also a critical issue. It comes down to, as I said before, about how do
encourage the research and still maintain the trust that we have built up. We
did build up trust, we don’t want to lose that trust. We cannot lose the trust
of the public that gave us the information. Either they gave it to the
administrative agencies or they gave it to the statistical agencies.
So we have to figure out under current legal requirements whether or not we
need to modify the legal requirements. I don’t know the answer to that, but
that has got to be worked out in terms of how we can do it in a way that keeps
the information safe.
I think what we have tried to do, at least with these research centers and
moving towards other ways of permitting remote access or doing synthetic data
that may give micro data that is useful to researchers, are ways of
accommodating what we think are the best ways to get the data out to the
research community in a way that is safe.
There are probably other ways that we could consider too, but I think that
is really going to be the hard question, and it is a question that we have to
I’m not sure what the right answer is, but the more we can discuss this,
and the more we can think about, is there something else besides those things
that gets those data that we need for the research, those linked data that we
need for the research out to the researchers. I don’t know.
DR. DAVERN: I think what we need is, we need researchers to be patient with
us, and work as hard as they can to get access to the research data centers to
use the data, and show that it makes the difference.
Our experience with the SIPP, when we came out and sid we are going to get
rid of the SIPP, we are not going to collect it, we are going to do a new
dynamic system, a lot of people out there, the research community agencies,
pointed to a lot of research that was used heavily and widely and said, we need
this data, these data are important.
We have a couple of projects out there in the research data centers that
are doing these linkages, are doing matches to evaluate that, and we need more
of those out there doing this to show results that get articles published or
articles used by agencies to show that these things are working, to show that
we need more of these research data centers or increased ability to access
these, and to work with us to get those out there and to keep working on that.
We have also contracted with CENSTAT to do a panel that looks at these
administrative data linkages, particularly as it relates to this new dynamic
system. One of the key questions are, how do you get these data out there to
the public to be used, and how do you deal with the privacy concerns, and are
there other ways like synthetic files or other types of things that can be
So we are working with CENSTAT to look at these issues, but I think what we
need really are people who are willing to go through that process, no matter
how difficult it is, but that is the process we have, to show that this is an
important thing to look at.
MR. PREVOST: I just wanted to add on, I think there is a third part. The
third part comes beforehand, it is the prequel, if you wish. Gerry was talking
about getting the data and linking it, and then providing access to the
But what I would submit is that there is a government business model out
there. We are all working to try to serve our clients, the general public, as
well as we can. The way that the agencies are set up right now is that we are
all working within these stovepipes of our own business models. I think some of
the things that have come out of E-Government have started to take a look at
generalized data systems and processes around the federal government.
One of the things that could be done is to come up with standardized
agreements and standardized processes. I don’t care if it is with Census or
not; if it is somebody sharing data with National Center for Health Statistics,
for example. Should have a streamlined process, so if you have a project that
is two years long, that you don’t spend 90 percent of your time trying to get
the data. That is the first part, and I think that is something that we could
MS. PARKER: I am Jennifer Parker from the National Center for Health
Statistics. I just wanted to say something about the research data centers. I
went to three epidemiology conferences this year, and by and large nobody
wanted to use the research data centers. The graduate students in particular
said they didn’t have money, and the academics said they didn’t have the money
to place somebody in the centers.
If we want people to use them, we have to identify funding, because people
can’t write grants if they don’t know what they are going to do. So they need
preliminary exploratory analysis to get the job done. The graduate students,
they don’t have any money. So if this community wanted to promote that, then we
are going to have to come up with money for people to use them.
MS. OBENSKI: I would like to respond from a different perspective. I think
that there are big, big questions that are arising, because we are in a new day
here with these integrated data sets. We are addressing policy problems and
questions that we really haven’t had to address in the past.
But I think that if you talk about how do we get the word out there, I
think a project like Mike Davern’s project on the Medicaid under count study,
where we have working collaboratively with experts from multiple entities, all
coming at it from a different vantage point, but all trying to answer the same
question of solving this very complex question.
I think it is somewhat unique. As the director said, we met with ASPE and
our team last November. It is probably a unique experience to bring in these
other agency entities as part of the team. What it has done is, it has enabled
us to do a project that the Census Bureau never could have done as well on its
So I think more of these, this model, is the way I believe it is going to
be in the future.
DR. STEINWACHS: Ed, you get the final word before lunch.
DR. SONDIK: Then it had better be brief. I think it is really interesting
that this discussion focused on the data centers and not what I thought it was
going to focus on, which was technical issues related to matching
administrative data to survey data or whatever. I think that is very important,
and we really should take note of that.
Second, I completely agree with Jennifer about the cost issue. We are
thinking about that in NCHS. There are several reasons for cost. One is
recovery and the other is regulating demand, if you will. So we are rethinking
Third, we have a way of doing remote access with our data center. You are
going to talk about that. So I think that is really important. That is an issue
of trying to deal with Russell’s point, easing the access.
Gene’s point struck me that you are raising a very fundamental question. It
is this tradeoff between access, increasing access, and at the same time
preserving the confidentiality. The way we work it in this country, as we all
know, we have a set of different agencies that have different rules, but
basically they are pretty much the same. But there are alternatives.
The National Center for Education Statistics licenses the data. What has
always troubled me with the licensing, even though there are penalties for
violation of the license in some way, is what is the impact of that, what is
the cost of that to the individuals involved, to the agency, or whatever.
If we are going to rethink this, I think we have to think very
fundamentally about this, and understand what risk really is. I find it
fascinating. I would be surprised if any of the agencies had any quantitative
measure of risk. I know we don’t have a quantitative measure of risk. By that,
I mean what is the probability that we have a leak, whatever it is, or a
successful hack or hacking, whatever the term would be. I think if we are going
to rethink this, we have to think about that.
But the point about the need for all of this in time of an emergency
actually raises a very interesting question in terms of our preparation for
dealing with emergencies. I think that one of the points that have come out of
this is that we are not necessarily — we are not prepared for that to the
extent that we should be. It may very well be that we can prepare and have
maybe even special authorities that can be invoked in time of particular
emergencies, which again would bring this back to the legal side.
A lot of food for thought here.
DR. STEINWACHS: I very much want to thank Sally, Ron and Gerry. Thank you
for leading off. You have certainly gotten this discussion going. I hope you
may be able to stick around. I think it says very much that there is a lot of
common ground here and certainly common interest in trying to reach the same
In the information you have there is an identification of places to eat. I
think there are about four of them listed in the food court. Originally we had
you back by one. If you could come back by 1:10.
(The meeting recessed for lunch at 12:18 p.m., to reconvene at 1:10 p.m.)
A F T E R N
O O N S E S S I
O N (1:10 p.m.)
DR. STEINWACHS: We are very fortunate to have Julia Lane with us, even more
so because I understand she needs to rush off and catch a flight due to a loss
of a friend of family member; your uncle died. So we appreciate your fitting
this into not an easy time and an easy schedule.
Julia is a senior vice president at NORC, the National Opinion Research
Center. Julia is going to talk about using linked micro data. Julia, the floor
Agenda Item: Using Linked Micro Data – Julia Lane
DR. LANE: Thanks very much. When Gene and Joan asked me to do this, I was
happy to be part of it. I think what we are doing is really important.
I spend a lot of time matching administrative and survey data with my
partner in crime there, Ron Prevost. Seven years of matching data from five
different agencies, SSA, IRS, Census, DOL and HHS, and then 50 states. So I
have the scars to show it; it took a very long time.
What I wanted to talk to you a little bit about, I am a health economist,
just to give you the benefits of what we saw in terms of putting linked micro
data together, the generic issues and then some of the challenges, and then
very much in light of the discussion we had before lunch, maybe some suggested
Obviously much of the stuff that was discussed before lunch, clearly when
you have linked and administrative survey data, you can get much better
analysis of existing data. So for example when you are trying to explain
earnings and employment outcomes on individuals, when you have linked employee
data, instead of just being able to explain 30 percent of earnings in
employment variation, you are able to explain 85 to 90 percent. So the
explanatory power of existing data is far enhanced.
You can also do re-analyses that you thought you might not be able to do
before. When you put multiple sources together, the feasible set of research
increases. In my case for example, when we had information on firms, rather
than just the supply side of the labor market, we could look at the demand
side. In health, if you were interested in looking at just patients, you could
also look at health care providers and geographic information and so on. So
there is a rich new set of analyses that can be done.
The other thing in my time as a rotator at the National Science Foundation,
it turns out that the capacity to capture new sources of information is
increasing. We now have MRIs on individuals, we have got biomarkers. At NRC we
are doing work for NIA, which captures biomarker information on the NCHAP
survey. You can capture people in video and text. All of a sudden it starts to
be able to explain other aspects that you couldn’t explain from just admin or
even survey records per se, just the straight data that were captured. So for
example, it might be that increased earnings potential is due to excessive
testosterone or something, just hypothetically. At least, that is what my
husband tells me. That is what makes him more productive than me. That is why
he makes more.
This came up a little bit in the discussion before. This is simply good
government. Enormous amounts of energy and taxpayer’s dollars have been used to
collect these data. The more that you can leverage that investment in data
collection, the better off the taxpayer is.
Of course, all of that information that is collected — this is a Toles
cartoon that you might have seen — might cause some privacy concerns and
privacy issues. You may recall the stuff about medical ID chip.
With the data collection, and I thought Gerry and Sally and Ron did a very
nice job of describing what those challenges are, we really have a serious
challenge with providing access to the data. If you think about data utility as
being a function of not just the quality of data that are collected, but also
the number of people who access it and the number of times it is used, the big
challenge is not just to collect a fabulous data set, but to have people use it
and use it in the way for which it was intended, for which taxpayers are paying
those millions of dollars.
It used to be that statistical agencies handled this by producing public
use data. The problem is that the increased likelihood of reidentification of
public use data, together with an increased understanding of the consequences
of public use and the quality of analysis that is done, and I’ll give you an
example of that in just a minute, means that not only is the quality of public
use files declining, but the likelihood is that fewer public use data sets will
In particular when you think about public use data set for health issues,
the very skewness of the distribution means that the types of approaches that
are used to protect the data, in particular the top coding, will have serious
implications for the quality of analysis that is done.
So for example, I was trying to remember this morning, I should have looked
it up before I came, but I think either David Cutler or David Meltzer shows
that there is five percent of the population that is responsible for 80 percent
of the Medicare expenditures. So if you cut off the top spenders because you
are worried about top coding, you are not going to have much analytical
capacity associated with that.
We have heard a lot about synthetic data this morning. Synthetic data also
certainly has an enormous potential to help researchers. But one problem that
you have with synthetic data is obviously by its very nature, because it is
using a distribution
to synthetically impute values and replace existing values with imputed
values, it is going to get rid of outliers. If outliers are the ones that are
defining what is happening in the economy and what is happening with health
care expenditures, as in the example I just gave, you have really affected the
utility and the quality of the data.
These issues are particularly exacerbated with linked data. Clearly once
you add in admin records, they are wonderful because they add so much richness
to the data, but you immediately have a much increased use of reidentification.
The more information you get on people, the more likely you are to be able to
reidentify someone, and then you go to jail, which is not a very good thing.
Then the other thing, and this was also pointed out this morning, admin
records are often received from enforcement agencies. So as soon as you link in
the survey data which are protected, and you have got to protect the
confidentiality of the respondents, once you add in admin data, the enforcement
agencies retain those, and they can very easily reidentify the individual.
So that is a problem.
I wanted to take a little bit of a detour and just give you a sense of the
impact of top coding, something as simple as top coding on the quality of
analysis that is done. What I want you to think about is that the data utility
is going to be a function not only of the quality of the data, but also of the
number of researchers that are going to look at it.
Public use files have the advantage that lots of people have access, but
the quality of the data isn’t very good. So what happens here, here is an
example of an earnings regression, where earnings have been top coded. This is
on the current population survey. You want to look at the effect of the
black-white earnings differential over time. Because it was top coded at
different levels over time, the impact of the estimate on the returns on the
black-white earnings differential are quite different. The same thing in terms
of the returns to education.
I am not going to go through it in any detail, but basically the
black-white earnings differential, depending on how you correct for that top
coding, how you statistically correct for the top coding, you might say that
the black-white earnings differential is .35 log points or .63 log points. If
you correct in different ways, you might say that the change in the gap was .06
or .15. Those are huge differences, right?
So lots of policy makers to decide, is the racial earnings gap big or still
big, but twice the size? Is the earnings gap closing rapidly or is it closing
slowly? And because of the increased noise in the data, which is by definition
what they do to protect public use data files, because that biases coefficients
down, do you know whether a perceived vanishing of let’s say a racial
discrimination coefficient or a double on race, do you know whether that is
really the case in the underlying data or is it just that more noise has been
added all the time? The same thing happens when you look at the return to
education, which is obviously another policy area.
So those are two labor economics issues, but you can see how the same
issues would come up in health. So here we have a situation where lots of
people have access to data, but you don’t know what is happening to the quality
of the data over time, and you don’t know what is done to it. Typically the
agency can’t tell you what is done to it, because they don’t want people to
back out their disclosure protection techniques.
So one alternative approach is to say, let’s have really high quality data
at the Census Bureau or at research data centers. Then what happens is, people
can go into the research data centers and they get the good data, and they can
do the analysis on that.
I have to say, I am a big supporter of any access in general, so the line
that you picked up I think was a little bit taken out of context. But I wanted
to make it clear that there were issues with this approach as well.
For those of you who are not aware of it, researchers and many statistical
institutes throughout the world picked up this. The Census Bureau pioneered it.
Researchers physically go, they are monitored by employees. This is supported.
It is a very expensive program. I think the Census Bureau alone puts in about
three million a year for eight or nine sites.
There is an elaborate project approval process. By law all these projects
must provide a benefit to Census Bureau programs. They are required to have
special sworn status and there are penalties if they reidentify individuals.
There is all kinds of access constraints. I’m not going to go through those,
since they are in your handout.
As you eloquently pointed out, there are lots of problems associated with
that. You have got very high quality data, but the N of researchers that access
the data is very small. Furthermore, one of the issues that you run into is
that by and large at least at NSF, what we found is that the absolutely top
quality researchers would not go to the research data centers because they had
other things to do with their time. So the best researchers are doing analyses.
Danish data in Statistics Denmark, for example, in this particular case
they set up a very nice remote access system. As Brad said, this procedure is
expensive, very fragile and very tenuous. One of the issues is the link to the
review process. Ron Prevost and Sally mentioned the work that they are doing
with Bob Gerter at Tapinol. That has taken two and a half years, and they are
still not approved yet. A similar type thing that we had with NSF supported
researchers take a very long time to get through. Then it is expensive in terms
of time and in terms of money.
The panel was concerned because of disparate use, so you have got
well-endowed institutions who can afford to set up the research data centers,
but if you are a University of West Virginia or if you are Iowa State or in
Oklahoma, it is much more difficult for you to have access, so there was a
disparate impact. Then the concern was no remote access.
As a researcher I was concerned about this. My colleague within the Census
Bureau, Pat Doyle, was very, very concerned about the impact on the data
quality, the quality of the research and the quality of the policy analysis
that was done. Ironically she was focused very much on the SIPP.
At her memorial session at the ASA meetings, they asked me to write a paper
on it. What I wrote down was many of Pat’s ideas, except informed by some of my
time at the National Science Foundation. What I thought was, why don’t we —
given that we know there are all these problems, Pat and I co-edited a book in
2001 in which we said this is a real issue. It is not just for health, it is
for just about every area in social sciences. This is a major issue. How might
one go about thinking about how to structure this?
Let’s think about it in terms of learning from other disciplines. I got put
in charge of doing the cyber infrastructure initiative for social behavioral
and economic sciences, so I met lots of computer scientists and engineers and
so on. It turns out that obviously in other areas of NSF, for example in the
computer science area, there is this whole research funding associated with
cyber trust. That is funded very much by agencies like DARPA.
What they are worried about is setting up secure computer access for
extraordinarily sensitive information. Let me give you an example of how that
is applied. When you are Joint Chiefs of Staff for example at DoD, you need to
think about troop movements in Iraq. They don’t go to a research data center to
look at the data. They log online with tight protocols to access the data. They
have gone through training and so on, but the cyber trust initiative has
invested substantially on setting up secure protocols whereby the Joint Chiefs
of Staff for example can access troop movements online.
I should say that the cyber trust doesn’t just look at the physical
computer security. Any computer scientist, everyone I talked to at NSF, would
say anyone can hack into a system. So what they also tried to do was to set up
human protocols as well, so figure out adoptable protocols. You are going to be
surprised to hear that computer scientists worry about this, but they learned
— you know when they tell you to make up a password, they say you have to use
some exclamation marks and some numbers and some capital letters and some small
case letters. So you are making up this completely unmemorizable password. So
you write it down and you stick it to your computer.
So thinking out the humanly adoptable protocols as well as the computer
science protocols is a very large portion of what they are worrying about, and
they are funding that. Joan Feickenbaum, who many of you might know at Yale,
has been heavily involved in the Portia project. There are lots of commercial
applications. Financial services, very sensitive financial information gets
accessed through the web. People don’t physically go on site to look at very,
very sensitive financial information. They have figured out how to solve these
So thinking about, can we think a little bit out of the box here and think
of rather than just one magic bullet, let’s think about a portfolio approach.
Let’s think about setting up a set of computer protections. At DoD for example
there are actually three levels of computer protections. The cyber trust people
can tell you more about that. Try and minimize the amount of statistical
protection, take off obvious identifiers, but try not to much around with the
quality of the data too much unless you can document what the impact is on the
quality of the analysis.
This is going to work differently for different agencies. Particularly with
the Census Bureau we have got Title 13, public use or not public use. But you
can have degrees of statistical protection, the law says within a reasonable
doubt. So you adjust the statistical protection by adjusting the screening
which you have for people coming in to use the data.
Kathy Wolman asked me to go and talk to the Conference of European
Statisticians a couple of years ago because they were worried about
confidentiality protection. So I did the opening speech. There were about a
hundred chief statisticians from around the world, and over and over again they
said, we will give the data to people we will trust. In other words, there is a
very big difference between giving a lot of people access, the great unwashed,
to giving the access to a subset of people who have gone through a screening
process, who maybe have an institutional bond and so on. So think about putting
those sets of requirements together.
Having been and still am a researcher, instead of assuming that researchers
know how to protect data, actually go through an intensive training class.
Thinking back on how I treated data before I went to the Census Bureau, I don’t
want to tell you about it because I know this is being taped. I never did
anything wrong. But when you go to the Census Bureau and you work with
statistical agencies, there is very clearly a cultural confidentiality that
gets inculcated in you. So figuring out ways to train researchers so that they
understand that this is part of the mandate.
In other words, think about setting this portfolio approach up. What you
might have is multiple access modalities. You might have public use files and
synthetic data so you can get all your code run and muck around a little bit.
Then if you work through the remote access in a subset of cases, and then if
you wanted to go on site and work from the research data centers, go on site
and do that.
Instead of thinking about the silver bullet, one approach, think about an
integrated approach, legal, statistical, operational and educational, and think
about as we were saying having a consortium of agencies as Ron and Sally and
Gerry said. This shouldn’t have to be invented de novo by each agency.
You might think about developing a set of legal options, a set of
statistical options, a set of operational and a set of educational options, and
then different agencies and different studies within agencies. For example, the
CPS within the Census Bureau will have a different set of rules in the annual
survey of manufacturers, which is business data. So you might have different
sets of options that are chosen.
You could think about how the remote access might work. Instead of having
the event where you physically have to go on site, use 21st century technology
and think about encrypted connections and smart cards. You can restrict user
access from specified predefined IP addresses. You can have something like
Citrix technology, which is becoming increasingly used, or you could have your
own developed approaches.
For statistical protection you might remove obvious identifiers. You limit
access to the data that is approved and you could have the statistical
techniques chosen by the agency.
I strongly believe in training researchers. I think it makes a big
difference. The Title 13-Title 26 IT security training that they do at the
Census Bureau is very good, but that is web based. You would probably need
maybe a two-day training class before people are allowed to access the data,
and then maybe refresher things. But just basic principles of confidentiality,
why is this done, basic information about the layers governing the agency and
why those layers work.
So I really do think that — it has been five years since Pat and I edited
that book on disclosure protection by statistical agencies. I don’t see that
there has been much new that has been generated since then. Yet the pressures
are increasing. The pressures for high quality analytical work, the enormous
promise of matching administrative and survey data is there. We have the
potential to do what I think is needed to be able to understand what is going
on in different parts of the economy. I would suggest that we might want to
think about using both the cyber trust and human cyber infrastructure. The San
Diego Super Computer Center is doing some work, we are doing some work, and so
So now I will shut up.
DR. STEINWACHS: Julia, thank you very much. I think you may have at most
five minutes to answer any questions or comments. So do people have questions
or comments for Julia?
DR. BREEN: You mentioned the portfolio approach. By that, do you mean an
array of different kinds of things that amongst them would provide protection
for the data and appropriate training and all of that stuff. It is a concept,
DR. LANE: Yes. Because public use files did so well for so long, we have
been very focused on statistical protection for public use type data sets. What
I am saying is that the problem of public use all by itself is that you have to
protect against any onslaught. So what you need to do is to think about a
portfolio where you don’t just use statistical protections, you also think
about limiting access to authorized researchers setting up computer security
protocols like Statistics Denmark has done very, very successfully, so that
people can’t hack into the data, and you limit their ability to re-identify the
data by matching to information that is on the Internet, and legal protocols so
that if they do go ahead and try to reidentify, you have got some legal
framework within which to prosecute them. There is an institutionally binding
DR. STEUERLE: Julia, thanks again for doing this. You are one of the many
stars we have here, and I really appreciate your coming.
I have two questions that are related to the ones I asked earlier. In your
experience at Census, are there really advocates within agencies for trying to
promote the research? I’m not saying researchers that want to do it, but I’m
thinking in the legal side, advocates who take the public good interest?
It seems to me that when you get down to the ultimate legal question, even
when I go through your matrix, somewhere, sometime, somehow, there is somebody
who could be identified. If I tell you eight characteristics down to the letter
that are observation in my data file and you can match that up and find the
ninth one, there is probably no circumstance at which at some level somebody
couldn’t be identified.
So the incentive of the legal community, if you are given the question of,
should you allow this to happen, is almost always going to be against trying to
figure out how to provide access, no matter to whom, even though we know that
at another level, a government worker could lose a veteran’s file. So the
chance of something happening is not necessarily related to people doing
research. So I am just curious, how to get around this.
DR. LANE: In most agencies it says reasonable means. So the big question is
whether a portfolio approach would match anyone’s definition of reasonable
means, and I would argue that it would.
I am not a statistician, we have got Tom Petska here who knows far more
than I, but my reaction would be, the law is reasonably clear as to what the
purposes are for which you can use the data. So you have to fit within each
agency’s mandate. It has to be authorized purpose. Without that you are
breaking the law, so you don’t want to go to jail.
The question is the degree to which — the interpretation that is put on
that. You raised the very good point earlier that a very narrow approach would
just be, you can only count. A broader approach is to — and if you take a look
at the IRS Census criteria agreement, there are nine categories, and some of
them are quite broadly written, under which you can do research as long as its
predominant purpose is to improve economic and demographic census and
So with a broad view, I think a lot of work can be done, but you still have
to find that authorized purpose. Is that a fair statement, Tom?
DR. STEINWACHS: My big job is time watcher, so I worry about you and the
plane. Do you think this is a good time?
DR. LANE: I think I had better run.
DR. STEINWACHS: Again, thank you very much.
DR. LANE: Thank you very much for inviting me.
DR. STEINWACHS: I am going to ask Gene Steuerle to introduce the next
DR. STEUERLE: We have a stellar lineup from our own Health and Human
Services agency, with whom we work closely in the National Committee on Vital
and Health Statistics. I have to confess, I know a couple of you but I don’t
know all of you, so I am going to be reading your titles, I apologize.
Jennifer Madans, Associate Director for Science. David Gibson of CMS. Steve
Cohen, the Director of the Center for Financing Access and Cost Trends at AHRQ.
This is going to be our first panel. Our second panel, depending on how long we
take to get to the presentations, is going to be Martin Brown, Chief of the
Health Services and Economics Branch at NCI; Gerald Riley is a social science
research analyst at CMS. John Drabek is an economist with the Office of the
Assistant Secretary for Planning and Evaluation, DHHS.
We are going to start with Jennifer and David and Christine.
Agenda Item: Health and Human Services Agencies
DR. MADANS: Well start with me. I am going to start this and then hand it
over to Chris who runs a linkage unit, and then pick up at the end on some of
the policy issues.
What we are going to do is just briefly go over the NCHS data linkage
programs, past, present and future, look at our current access procedures, some
of the challenges that we see in conducting linkages now and in the future, and
maybe some ideas about how to solve them.
Obviously we are doing record linkage the same reason everyone else is
doing record linkage. It certainly increases the accuracy and the detail of the
data that we can collect.
We generally use it to augment the information that we collect on our major
surveys. One of the early reasons for doing the linkages was to longitudinalize
what are really cross-sectional surveys. By doing the linkages we can follow
people over time without spending the money to recontact them. So it reduces
the burden on the respondent, it is cheaper, and it should get us better
information. So I think we are all doing it for basically the same very good
Just to give you an idea of some of the potential that has come out of our
record linkages, this is just a handful of examples, a lot of them having to do
with linking our population based surveys with morbidity and mortality outcome.
We can talk about some of these databases that these studies are based on.
We do two types of record linkage. I think most of the time we are talking
about the first type here, and that is what we are going to focus on for this
talk, but I did want to mention the second one very briefly. This is where we
are linking at the person level or in some of our provider based surveys at the
facility level. So we find some external data source that generally is a census
of some kind, whether it is the Census, which we don’t link to, or all of the
CMS records or all of the information from the American Hospital Association,
where we can do a one to one link with our unit record on our survey, which is
the person or the facility, with a record or set of records on the external
file. We have done this at the person level and as I say at the facility level.
But a lot of our linkages deal with contextual data. All of our surveys are
geocoded, and we do a large amount of linkage of geographic data from a variety
of sources with these population based studies and also the facility based
studies. So this is where we might use census tract characteristics from the
Census. We can link those into the population based surveys. We have data from
EPA. We get information at the state level on things like Medicaid eligibility
requirements. So it is not at the individual level, but it is more the
contextual or geographic level.
A lot of the same issues in terms of access and identifiability that you
have on the first kind of linkage also apply to the second kind of linkage, but
the examples we give for today will be at the person level.
Now I’ll turn it over to Chris.
DR. COX: The NCHS record linkage program got its start back in the early
1980s, which is about when Jennifer and I arrived at NCHS, if that is okay to
say. We had both just graduated from college.
We began bringing together survey data with health care utilization data by
linking the national medical care utilization expenditure survey to Medicare
MADARS data, the Medicare Automated Data Retrieval System, for those of you who
have been around for a very long time as well. We did the same thing with the
epidemiologic followup study to the first national health and nutrition
We were pretty much pioneers in linking to the national death index with
the NIFA survey. In fact that survey had respondent verification through either
proxy interviews or death certification collections, so it has become what we
call at NCHS a truth data set, and we use it to develop our probabilistic
record linkage algorithms.
We expanded the record linkage program in the 1990s to include a routinized
approach to linking the national health interview survey to the national death
index to correct mortality data, to allow for as Jennifer said the longitudinal
study of our cross-sectional data.
Then around 2000 NCHS developed an organization — reorganized and created
a data linkage unit where these services were centralized and systematized, so
we could gain from the expertise in that unit and conduct most of our record
We were very productive. As of the year 2006 we have linked a variety of
our large national public health surveys to three focus areas, mortality data
collected from the national death index, Medicare, CMS data, and that would be
denominator and the standard analytic file data covering the period 1991
through 2000, and retirement and disability benefits data from the Social
Security Administration. That actually goes from 1962 through 2003. So it takes
data back before the survey contact and extends the period of followup beyond
The surveys we were able to link to, for those of you who aren’t speaking
NCHS acronym today, that is the national health interview survey, the
longitudinal study of aging, the first, second and third national health and
nutrition examination surveys and the 1985 national nursing home survey, which
is a facility survey.
We have completed other data linkages. We link infant death records to the
birth certificates to allow for study of birth characteristics of infants who
die within the first year of life. We link information from the American
Hospital Association to annual survey of hospitals on facility characteristics
to the national hospital discharge survey.
We are trying to develop an ongoing program. These agreements we think we
have mostly standardized, if such a thing exists. It is pretty easy to
standardize an agreement to match with the national death index outside NCHS.
We take it easy on each other, we only filed about 30 forms back and forth to
share data within the organization. But we intend to keep linking the HIS from
1986 through our current year, which is about year 2005 available at the
moment, to NDI data. Our HANES surveys are now yearly surveys, so we will
continue linking them, and our national nursing home surveys will be linked to
mortality information. That is cause of death and date of death data.
We also will continue hopefully linking to Medicare enrollment and
utilization data from CMS. Right now we do have the HIS 1994 through 1998 link
to that data. We hope to extend that through more current years of HIS, and the
same for the HANES data.
We would like to start adding the national nursing home surveys to that
agreement. And of course we will continue to link birth and infant death files.
We plan lots of interesting linkages. I was congratulating Ron and Sally
earlier; I think we are about to have a one-year anniversary on the negotiation
of an agreement to add HIS to the CMS Medicaid project that they discussed
earlier. We are nearing that one-year mark, on trying to get that agreement
finalized between our two agencies to share that data, so we are very excited.
We think it will probably happen this year.
We also intend to be linking the 2004 national nursing home survey to the
CMS minimum data set, and that will allow us to pick up characteristics of the
facilities, the residents they are staying in, as well as characteristics of
the residents in that facility. We hope to be linking the current NHANES to
food assistance programs such as food stamps and other food assistance
programs. We’ll see.
We don’t only do record linkage. We also spend a fair amount of resources
developing user tools, documentation, methodologic reports, where we describe
how the linkage took place and what kind of probabilistic matching algorithms
we use. We conduct bias analyses so that when researchers use this data they
understand what the limitations are and where we are not matching people and
what kind of people we won’t have linked data for. We also continually evaluate
and try and improve our record matching algorithms to make them just a little
tighter and reduce type one and type two errors.
I’m turning it back over to Jennifer.
DR. MADANS: As you can tell, there are no problems with data access for a
I’m sorry Julia is not here, because we were having a conversation in the
hall about this remote access system. I’ll talk about ours just very briefly.
We have the same kind of confidentiality concerns that the folks in the Census
Bureau discussed this morning. As everyone else has mentioned, our general
files are becoming harder and harder to release through public access because
of not only the kind of information we collect, but the kind of information
everybody else is putting out on the web. So we are having a reduction in what
we can put out as public use. But the linked files are particularly vulnerable
and we worry about them more than we worry about many of the other files.
We try to put out as much as we can on public use. There is the problem of
reidentification by the people who gave us the data, which is an added problem,
but we know that there will be files that are not appropriate for public use.
This is the kind of thing where we make an announcement, the data are on the
web. We have no restrictions. There is no control, there is nothing.
We also have a data center. There are three means of access in our data
center. We only have one data center in Hyattsville, so anything that you can
say about the nine that the Census Bureau has, it is much worse when you only
I will report, I can announce that after many years of negotiation we have
reached an agreement with the Census Bureau to as a pilot project allow access
to our data in their data centers. So we feel like we have expanded access nine
times over what we had before so it is very discouraging to hear that nobody
likes their nine any more than they like our one. But it is still better. I
agree that there are problems with data centers, but they aren’t going to go
away, and I think we have to make them work better.
Our data center like the Census data center does have on site access. We
have a nice little room in NCHS. You have to go through two guards as well, and
it has all the other operational characteristics that the Census Bureau’s have.
We also started a remote system when we started our data center in
‘98. This I would call a first generation remote system. It is automated,
so it is remote and automated. It is called ANDRE, and I never remember what
that stands for. You submit programs through e-mail. Those are collected by the
system, the system evaluates them for illegal operations. You can’t do things
like list cases, and it won’t run them if it sees something illegal. It puts
them through a batch processing. The output is again looked at in an automated
way to see if there is any disclosure risk for what is sent back, and if there
is none it sends the output to the person requesting it.
A lot of the things Julia was talking about in terms of the front end are
very similar. You have to be a registered user. We will only send it to certain
e-mail addresses, all of those things. For our system we want to get to the
second generation, have a web-based system. We have to worry about the
firewalls and all of those things.
What she didn’t talk about, and I was curious to hear her talk about, is,
for us the real challenge is not the front end of that system, it is the back
end of that system. It is how do you do the appropriate kind of disclosure in
that kind of environment. A lot more control when somebody is sitting in our
facility. We have a lot less control when they are across the country. The
response of most of the system is that they let you do less remotely than they
will let you do if you are actually sitting there.
One does have to worry about things, like how do you deal with multiple
submissions of programs, where each one is fine but together they are not. You
have some of this when you are on site as well, but it is not as bad as when it
is remote. If I can do my commercial here, I’m not sure these problems are
insurmountable, but I am sure that we haven’t put the kind of resources that we
need to into solving them.
My other favorite soapbox is that we are reaching a point where we spend a
lot of money on data collection, but we missed a balance in terms of resources
for data collection versus data dissemination. The questions that we are being
faced with now in terms of dissemination are going to require some more funding
than we have been giving them in the past, and unfortunately if you have a flat
budget, if you are lucky enough to have a flat budget, if you are spending more
on dissemination you are not collecting as much data as you were before.
I think our users have to understand there are tradeoffs here. We can make
it easier to get access, but there will be less to have access to. That is a
fact. The question is, what do you cut, and that leads to many, many
Let me also say about data centers, I think all of us who are running these
data centers do understand some of the frustrations that our users are facing,
and are trying to solve some of those, especially the front-end stuff, getting
proposals approved, facilitating peoples’ access as they wander the maze of the
I sometimes say we are going to get a Walmart greeter. Everyone will be
assigned a person whose job it is to make their stay in our data center a happy
one. I think we are all moving towards that. It will solve some of the
problems. It certainly won’t solve all of them. You still have to come to
Hyattsville or one of the other seven data centers. So I personally think a
remote system is going to be more beneficial, but it is going to be a lot more
expensive to develop.
We also have staff in our RDC that can provide on-site programming and help
people out, but I know most researchers don’t like that, like that the least,
because they want to get down with the data, they want to be able to play with
it and see what is happening. But it is an option. So I think we are beginning
to get that portfolio that Julia was talking about, different ways of accessing
I think the main theme we are worrying about is the disclosure of view, and
how do we do that in a way where — to answer your question, Gene, I think
there is an issue of due diligence. I think if you force any of us into a
corner, we will agree that we can’t hundred percent protect, although that is
what we try to do, but when push comes to shove, we have to be comfortable that
we did everything that we could, and that an outside person looking at what was
done to try to protect the data will say that we exercise that due diligence.
Obviously it is a judgment call, but I think that is where we are comfortable
in saying that we have done what we can.
So where are the challenges? These have also been mentioned. First we have
to get the informed consent. We have to get the informed consent from the
person who is giving us the data. So we have to meet their needs, and we have
to do that in a way that will satisfy NCHS’ institutional requirements for
permission to link, what do we need to get from the person so that we can go to
our IRB and make a case that we should be able to do this linkage. We need to
satisfy the provider’s institutional requirements for permission to link, and
we have to be able to get that informed consent from the — in our case, from
the subject whether it is a person or a facility, where we on the one hand can
tell them about the importance, also be fair about what the risks are and how
we are going to protect them.
So those are hard things to do. One of our concerns are to get a more
systematized way of doing that in a way that will be acceptable to the
community. Up until now we have all been doing this on our own. There is not as
much collaboration among the agencies, both the providers and the receivers of
these administrative data as there could be. Every time we start a new linkage,
it is as if we have never done a link before. We start out from square one; do
you have the right permission, is it the right permission for us. To get the
lawyers involved is always a bad thing. So we really need a better way of
working this out.
This just shows you what is happening to our ability to get adults to give
us their social security number, similar to Census Bureau. If we got the
number, we took that as approval to link, and it is going down. There was a
change in how we did this in 2001, but even at 41 percent which is the better
number, that is not enough to have a viable data set. So using this as a
mechanism for getting permission isn’t going to get us very far. We are also
thinking of asking a more explicit question about linkage, because most people
don’t want to give you the social security number when they don’t want to link.
But how one does that not only so that we will get the number, but that we are
doing it in a way that is protective of the subject and acknowledges their
rights not to be linked. These surveys are not mandatory, so we have to provide
the true informed consent and get their permission.
As has been mentioned many times, the institutional requirements among
those of us who are receiving and providing data are very complex. We have some
differences in our privacy requirements and how our privacy requirements are
interpreted, the legislative mandates. Two years, we don’t see that as odd. It
takes a long time to develop these, but once you develop them you can extend
them. But it would make life a lot easier for everyone if this was done in a
more collaborative way.
Again, the balancing of the resources. These linked files are fairly
expensive to produce, to provide the right user documentation. They often don’t
come in a way that is easy to analyze. You have to transform them into
analytical data as opposed to administrative data. You have to tell people what
is good and bad about them and what mistakes they can make. We sometimes don’t
have the right expertise to do that, so we have to work closely with the
provider. Then what kind of resources are appropriate to put into assisting
users with these kinds of data versus other things we are doing, like
collecting new data.
We try again to create the public files from the linked data. It is very
difficult. Once you create a file, when we get data for example from CMS, and
if we create a file from that data that CMS can’t take and reidentify, there is
not much left on that file, but there are some things, and we need input from
users on what that something should be. We are making decisions about what to
take off and what to leave on and how do we satisfy the most use.
If you can get someone for a public use file, that may go a long way to
making their stay in the data center much more pleasant, because they have
experience with the file. We also use sworn agent status, and we are trying to
think of ways that we can use that authority for different kinds of data to use
different kinds of access mechanisms. We are also looking at whether or not we
can use perturbation on these linked files. I think we also looking at trying
to get synthetic files.
So again, we really want to share the knowledge and experience across
agencies so that we are not reinventing the wheel, and we can learn from each
The standardization should increase efficiency. We may be able to do it
faster, it will be cheaper. It might help us do the user documentation if the
providers are thinking about possible other uses for their data. We saw big
changes in the CMS data when we first started linking back in the ‘80s. I
remember, we couldn’t figure out how to read the file, the first one we got, to
now when things are much, much easier. And again, the development of standards
and best practices for linking, for data handling and how to get the extracts
and the documentation.
We would like to increase collaboration and communication among agencies. I
think this has to be done, maybe under OMB, I don’t know, but there has to be
some way of allowing us to work together without violating certain ways of
interpreting our authorizing legislation. We have to find safe havens where
things can be shared in the development.
We want to develop more linkage projects. If we see a file that is Census,
we are going to try to incorporate it into our data systems. I think if we can
expand the access to RDCs through the development of new disclosure
methodologies so that we are more comfortable in using remote access systems,
we have more control over being able to evaluate whether something is a risk, I
think that would go a long way into improving user access.
DR. STEINWACHS: One way might be to have a few questions now or comments,
and — it’s up to you, Gene.
DR. STEUERLE: Let’s go ahead and take some questions now. Then we will have
the next set of speakers, then we will have questions there. Somewhere in
between we need to take a break.
MS. TUREK: I’m just curious, on the re-disclosure, why can’t you have an
agreement with them that they won’t do that? That would seem to be relatively
DR. MADANS: I think all of us would interpret our requirements that we
can’t allow a redisclosure to take place, because if we were to do that, we
could let anybody have privacy access to data that would say they wouldn’t try
MS. TUREK: But presumably the only person that could do it would be the
agency that supplied it, right?
DR. MADANS: That particular file. But we have many data files. The only way
you could re-disclose it — most of our files, this may not be true for Census,
on their own, if we took out the names and the addresses, those straight
identifiers, when we get them out of the field they are not identifiable. They
are only identifiable if you link them to something else that is in either the
public or private domain.
In the case of a university user, we don’t give them our linked file with
the NDI, because they would be able to do re-disclosure. But if they signed
something that says we won’t do it, it should be the same thing as another
agency doing it. If an agency that has programmatic or administrative
responsibility did do the re-disclosure, there is harm to the subject. I guess
our bottom line is, do no harm to people who have done their civic duty.
I think this is this due diligence thing. If there was a breach, we would
feel in that case it would be our fault there is a breach. So whether or not
you can say they can have anything, or we won’t put out anything, it is what
steps can you go through. There are legal requirements that if somebody at CMS
would do this that they could be fired, fined, whatever. We try to do
statistical perturbation so they can’t. But I think it is the responsibility of
those of us who have made a pledge to the people allowing us to link their data
and giving us sometimes very sensitive data, that we will use a variety of
mechanisms to make sure that they are not harmed. We have thought that someone
signing a piece of paper that says they won’t do it is not sufficient
protection for that person.
DR. STEINWACHS: You were mentioning about the declining willingness to give
the social security number. Your explanation sounds not surprising. Sally or
Ron were talking about a capacity now to match to unique individuals without
the social security number. I thought whether or not there was any possibility
here of a short cross conversation, is there any way NCHS could use what you
have, or is that somehow not in the feasible domain. I can already see, Gerry
is ready to take this.
DR. MADANS: We are having that conversation. I don’t know. That is one of
the things we are trying to work out. We can do similar things and we do link
to the NDI without the social security number. We can do that. It is not as
good link, but we don’t have access to as much information.
So we started these dialogues, but we are on new ground here, whether or
not some of these things are possible. We are hoping that we can do things more
jointly. We have lots of joint projects. It does seem a reasonable thing to do.
So maybe next year when you have this we will have signed all these things and
can move forward.
But I think it is important for us to do it in a proper way. So we try to
work out what are we asking for, what are the problems, what are some of the
reasons you wouldn’t want to do it, and then have an open conversation.
DR. STEUERLE: A lot of your comments on disclosure seemed to me to be
related to surveys, where you had to get permission or you felt implicit that
you have got permission for people to not use the data in certain ways.
But among the other linkages that we are talking about here are linkages of
administrative data sets. I can imagine some private researcher going out and
gathering some data from hospitals somehow, if there was even such a data set
that exists, he wants to link to Medicare data on hospital payments or
something like that.
I am thinking there are a lot of pieces of data where we have never
actually gone to the public and asked for permission, but we may have gathered
the data in some form they followed in order to apply for a program.
There, correct me if I am wrong, but the companies uses a different one. It
is what we interpret the law as allowing us to do. Are you also saying that
there is often an implicit agreement in these administrative data sets that if
I filed for whatever it is, if I filed for Medicare, that that data is
therefore usable by some other agency who also has administrative data records,
that they could be merged?
DR. MADANS: Most of what we are linking are survey data. We have one data
set that is registry data, where our relationship is with the states.
I think where this becomes an issue is how is the owner of that
administrative file, how did they view their responsibilities to the people who
are in their file. That is why we have these very long negotiations. So if we
want to link to the Medicare file, we think we are in a very good position,
because we ask people and they said we could link to Medicare. But then we go
to the agency and they say, we view our requirements as not allowing you to
link. We never had that conversation with Medicare, but we have with others.
Even though we have an explicit statement that says we are going to link to
these kinds of records, and sometimes we even mention the kind of record that
we want to link to, their legal interpretation is that they have to get
approval, even though we have approval.
So it takes awhile, especially when this is the first time it is happening
with that particular agency or that file, to work through why do you think —
like IRB discussions; what is the risk, what is the benefit, how can you
protect. It takes a while to go through that.
But I think our interpretation has always been, if we have permission, then
we should be able to go into an administrative file and pull out the records
for that person.
Now, in terms of linkages where it is not based on a survey, you are just
taking two administrative files and linking them, we don’t do that very much.
So I don’t know much about it.
DR. STEUERLE: There may be other questions for the group when HHS finishes.
There are questions about administrative data sets we don’t combine at all,
independent of the ones that you are thinking about combining at this point.
DR. BREEN: Jennifer, you said that — thank you very much for your talk,
both of you — you mentioned that you would put more resources into data
dissemination, that you thought that was an area where you could use more
resources. But what exactly would you do with the resources? Have you thought
about what you would like to do and how you would like to expand that?
You said you were going to increase the number of data access locations to
utilize those of the Census Bureau, but knowing you, you probably have some
other ideas on what you want to do.
DR. MADANS: I think we should be spending more on this. I don’t know how Ed
feels about this. For me there needs to be some basic science methodologic
development on the disclosure review process. That is where I would put my
first $200,000 or whatever it is going to be.
We are committed to trying to use the Census data centers. There are costs
associated with that. I think there are costs associated with our own data
center, it has been underfunded. So that is two very easy things, to more fully
staff the data centers so that the process is simpler, easier for the user, and
then to develop these disclosure review processes.
I think making it easier on the front end is a staffing question that is
not rocket science. We have some fewer requirements than the Census Bureau does
in terms of how do you qualify to use our data center. We have found it very
difficult to say people can’t use the data center. I think that Julia mentioned
something about, we give it to people we trust. That is not a way we can work.
We have to have very clear rules about who gets it, and we can’t have a lot of
judgment about who can access the data if they have a reasonable proposal.
But I think we can fix that. I think that is a staffing and money issue. I
think it is an increased knowledge issue to deal with how do you do disclosure
review in a faster view.
Then we didn’t talk about this, to really use the administrative data and
use it in a way so that it really does augment the survey data, you have to
figure out a way to get the administrative data faster. That was brought up as
well. So that is money that we are not going to spend, but I think money that
our users need to get someone to spend for the providers of the data so that
they can get it out quicker.
But I think in general, within each of our data divisions we need to put
more resources, more staff primarily and more developmental money into doing
different products, easier access to the products, more technical assistance.
As they get more complicated, people get frustrated because they didn’t quite
understand how to use them. There are some things I don’t know how to file,
some I do. If we had the money we could solve them, they are pretty
straightforward. But the thing we don’t know so much about is how to do
disclosure quickly and efficiently.
DR. SONDIK: I completely agree with that. I think one of the areas where we
could really put some more resources is into smarter front ends to the surveys
and to the vital statistics, for that matter. I view that as part of
I think trying to deal with some of these issues in the RDCs that have come
up — I just want to make one brief comment. I was struck by the prior
presentation and this one, but maybe it is a bias on my part, but I had a sense
that in the prior presentation there was a sense of, it is useful to
disseminate the data, to have the RDCs, but not necessarily the primary purpose
of the data collection that was being addressed. Maybe that is a little harsh.
That is not true with NCHS. I think the primary purpose of NCHS is to
disseminate the information that we are collecting as effectively as we can. In
some sense it is the same problem, but in another sense I think maybe it is a
little bit different emphasis. But if we collect NHANES or HIS and we sit on
it, then we are not doing our mission.
So the idea of making this data as accessible as possible is really a
principal drive for us. Jennifer has said on many occasions that this is our
principal mission, and I agree with that. That is in a way why this is so
MR. LOCALIO: I am citing here the Institute of Medicine report on expanding
access to research data. This is page 12. One of the sets of research that they
cite has to do with evidence of increased public concern about privacy and
distrust of government, assurances of confidentiality. In other words, if you
don’t get people participating in these interviews, it becomes a problem.
I just want to tell you, I read this waiting to see a physician. I had to
wait a couple of hours. In order to get to see a physician today, you have to
give your social security number.
But anyway, one of the things that struck me while I was reading this, in
these studies it doesn’t appear that people, when they are asked about their
attitudes about trust, were asked about the various reasons that concerned
them. Are they concerned that the information they are going to get is going to
be turned over to the Immigration and Naturalization Service, very valid
concern these days, to the Homeland Security Department, to the Internal
Revenue Service, to law enforcement. I would say that those are concerns that a
lot of people have. Having their information linked, de-identified and then
given to some researcher who could care less about who these people are is a
very different level of concern.
It seems to me that you have a one size fits all set of guidelines here,
that we need to partition. So I would encourage people to do more research,
more methodological research on why people feel constrained about participating
in interviews, for example, exactly what the reasons are, and then design
policies based on those reasons, in terms of whom you give access to data to.
MS. GENTZLER: I wanted to ask Jennifer, you had touched briefly on your
experience trying to link with food and nutrition assistance programs, given
that they are state administered mostly.
DR. MADANS: We have just started that. There was that recent IOM or
somebody’s report that said we should link HANES to the food stamp and WIC and
a variety of other things. So we have something jointly going with the Census
Bureau to try to do a pilot on that. We have been talking with the folks at the
Food and Nutrition Service to figure out how to do that link. That is one of
the ones where it is going to have to work through some conflicting
interpretations about how you do it, and the state involvement and what kind of
approvals do you need and who has to give the approvals and where do you get
So we have started that, and again, next year we can tell you what
happened. I think there are those basic concerns, there are a lot of other
concerns with doing something on a state basis. It is not so bad for us because
we are only in a certain number of states every year, but you have to negotiate
with every state eery time you do something; it gets to be a little time
If we were in all 50 states, this wouldn’t probably be a way to do it,
HANES isn’t in all 50 states every year. But I think that this is again where
we are at the beginning. It probably won’t take us three years, but it probably
will take us a little time to figure out whether or not this kind of linkage is
going to work, and the data when it is linked with the survey data is useful.
Again, there is cost doing these linkages. The final reason for doing it
is, after you do it for awhile, is anybody using it, is it helping them. That
sometimes is a function of the quality of the data, or not so much the quality,
but is it appropriate for the research questions that you can address with that
data set. It may be that that is not the case, so why put everyone through the
process. But yes, that is one where we are at the beginning.
MS. TUREK: I am struck by the fact that the confidentiality and privacy
laws seem to be agency specific. I think my big example of this is, the HIS
only puts out range for its total family income question, and you have to go to
the research center to get the dollar amount, where on the CPS I can get the
dollars of income for detailed sources of income on 20 sources. I don’t see
where the risk on the two surveys is all that different.
It seems to me that this is a case where some consistency across agencies
would also be helpful.
DR. MADANS: I don’t disagree with you on that particular issue. I think
that there would be consistency.
But it is not that item again. When you make a public use file, it is a
tradeoff of where do you want your details. So when those files were created,
it was decided because it is a health survey to give up some detail on that
item, as opposed to detail on anther item. I don’t know where the tradeoff was
for that, but it is a balancing act.
I think the folks who create the files try to do the balance in the best
way they can with their knowledge of what the data are going to be used for,
and they try to create the files in a way that they will meet the needs of the
greatest number of people, which sometimes means people like you don’t get
exactly what you need.
But in this particular instance, we have heard you. They are going back to
look at that to see if we can change that and would it affect the other items.
But when our disclosure review board looks at a file, they look at a whole
file. They may go back to the program and say, okay, you can have this or you
can have this, but you can’t have both. In that decision it might have been
where top coding comes.
DR. STEUERLE: I think we are going to have to cut here and go to our next
speakers. I have got the list here, but I don’t know how many people we have
got left from HHS speaking. So just so I know how many presentations we have,
can you raise your hands so I can count? Five formal presentations.
Why don’t we go then immediately to the next presentations. That would be
David Gibson and Steve Cohen.
MR. GIBSON: My name is David Gibson. I work in CMS Office of Research
Dissemination and Information. I am here to replace Spike Duzor, who is the
project officer on this particular project that you asked me to speak to you
about today. To the extent that I can answer questions, I will certainly try to
First of all, you have to realize that when Medicare and to a large extent
Medicaid came into being, we saw our missions not to do research; it was
basically to pay claims. Originally it was oriented to acute care, at least on
the Medicare side. It was oriented towards cost based or reasonable charge
reimbursement. There was little concern for the impact of administrative
decisions on research and how to use the data.
Consequently, we organized our databases in ways that often made it
difficult to do research. But as time went on, we saw that we wanted to come up
with more rational ways to make payment, for instance, to bundle the payment
for a number of services together rather than paying for each service
individually. We also saw that we wanted to match the services that we were
providing and that we were covering to what the needs of the population were.
We also realized that when we were in this mode of trying to pay claims
quickly, that we often made payments when it was inappropriate or even
fraudulent. We were often in what they call a pay and chase mode. We made the
payment, and then we had to go back and get the dollars, and that was often
hard when a home health agency or a DME was using a former BP station as their
place of business.
What are some of the barriers that we have in using CMS administrative and
other data to study health outcomes? Spike asked me to go through some of the
list of items. Many of you probably know many, many more; I don’t pretend that
these are exhaustive. Then I want to describe to you a database that we are
building that we hope address some of them, and it is not going to be all of
One of the big problems that we have in the Medicare and Medicaid systems
is a lack of unique identifiers within programs across types of data. The
biggest one we have is the health insurance claim number or HIC that we use to
identify beneficiaries in the Medicare program. We use an identifier that as
most of you know is a combination of a CAN, a claim account number, which is
the social security number or an RRB, Railroad Retirement Board number, for a
beneficiary, and then a beneficiary identification code that relates that
person to the bennie.
Unfortunately, some of these identifiers change over time. You have
situations where a wife gets benefits under her husband’s account, and over
time she may earn enough benefits under her own account to get benefits. She
will change the CAN portion of her health insurance claim number.
This causes problems in following people longitudinally. If you have a
situation where you are working with a five percent sample, you already have a
fairly small sample. It is robust, but think about if you want to follow a
cohort over a long period of time. If one to two percent of the people cross
reference out of your five percent sample, think of multiplying .05 times .98
times .98, and you keep seeing the effect of that diminishing of the population
that you are interested in.
So we have that problem of unique identifiers for beneficiaries. We have
the same problem with some of our providers. If you look at some of our data
that is provider oriented, say hospital oriented, you will see that the number
of hospitals has supposedly dropped drastically over time. What that is, is a
recategorization of those hospitals as critical access instead of short stay.
Second, we have a lack of unique identifiers across programs. We have
Medicare, we have Medicaid. As the speaker before me mentioned, we have
assessments now for home health, for skilled nursing facilities, for rehab
facilitates and for swing bed. These numbers are assigned by the state, not by
the Medicare or the Medicaid programs, so you have a different set of
identifiers for bennies for this very rich source of information that will help
to give more information about the facility and about the patient.
What are some other barriers that we have? We have the separation of
billing and associated diagnostic and therapeutic care into separate bill
types. By this we mean, if you look at the services that we have, you can see
that many times if you wanted to create an episode you would have to go to as
many as seven different bill types to try to find the information that is
related to that particular episode.
We also have the use of what we call ruleout or confirmed diagnoses on
certain types of bills. Some of the bills that we have are not at the point
where final diagnosis has been made. So if you are looking at the principal
diagnostic and even sometimes the secondary diagnosis, you are going to be
seeing the diagnoses that the physician is trying to determine whether the
patient really has that particular condition or not. So that diagnosis and the
presence of that diagnosis in the field doesn’t necessarily mean that you would
want to include this person in a particular category.
This causes us many times if we are doing some cross sectional analyses,
that we are looking at patients who have diabetes or whatever, that we are in
fact including bills there where the physician was trying to confirm or not
whether the patient really had diabetes or not.
So what does this cause? This causes problems with trying to identify
persons by diagnosis in a cross sectional analysis. Just getting prevalence
rates is often difficult for us, because we are not sure that we really are
looking at those people who in a cross sectional time frame have a particular
diagnosis. It also causes problems with us trying to find when certain
conditions started, the incidence rates associated with particular diagnoses.
We also have problems with the fact that we use different coding systems.
Many of you have encountered this. If you are looking at someone who had a
surgical stay in the hospital, you will find that on the hospital sides we use
the ICD-9 procedure codes. On the physician or professional side you will see
that we use what we call the HCCPC, the health care common procedural code,
which is basically based on the CPT-4 that the AMA puts out.
Other problems are the lack of what many people think is most crucial,
clinical information that determines and differentiates personal level critical
pathways. In other words, if a physician has decided that a person has a
particular disease and they want to adopt a particular regimen or protocol for
that patient, it would be helpful for us if we had that clinical information.
We do not have that.
Next is the lack of information on the cause of disability for the
disabled. About 15 percent of our people are disabled. They are under 65. We
can be looking at the diagnoses on the claims to try to figure out what was the
cause of the disability that they have, but we do not actually know the cause
for disability for this individual. So we do not have from SSA the actual cause
for why the person is entitled to Medicare.
Another problem that we have with our data is, you will notice that most of
the time the data that you will get from us will be listed by what we call MSC
code, Medicare status code, which will just tell if they are aged, if they are
disabled, or if they have ESRD. This will not tell you for the aged population
if they formerly had a disabled status. For instance, many of our aged
population aged in, if they survived long enough, into our aged population.
They are quite different, we find, from the aged folks who come in, many times
postponing to come in when they are eligible at 65, if they are covered by a
working aged situation. They may be 69, 70 or even older before they sign up.
Have very, very few causes for disability. Yet our data tends to group them
together if you just look at things like age or Medicare status code.
What else? Lack of comprehensive, and by that I mean both breadth and
depth, of person level data on primary and secondary health insurance coverage.
We group people who are working aged and are covered by their employer sponsor
plan with those people who use Medicare as their primary source of insurance.
Consequently you may miss a lot of their utilization if you go out looking in
the claims, because the provider if they are going to Blue Cross Blue Shield
may not even bother to send the information in. So when you compute rates and
whatnot, you find that you often underestimate what the actual cost is for
those people who use Medicare as their primary, because you are grouping people
in who you shouldn’t be including.
Next is the lack of information on socioeconomic status. That would be an
ideal thing to add in. We also are unable to get the cause of death, and that
has been discussed earlier today in other presentations.
This one is an odd one. Since we have just one into Part D we consider that
we would be supplementing the Part A and the Part D benefits with our Part D
event data. Unfortunately we are unable to do that. The law prohibits us from
doing that. We have a notification that we are hoping to get out soon that will
allow us to start linking the Part D drug events for those people. It is not
all the people who had drug benefits that we are going to be getting their drug
events. If they are in a situation where they are getting coverage by their
employer, their employer may get a subsidy, but that doesn’t mean we are going
to get the drug events. We are going to only hopefully be getting the events
from the folks who are going through PDPs.
Next is the size of the sample. Traditionally we give out the five percent
sample, but for many conditions — and I will be talking about the chronic
condition warehouse in just a couple of minutes — it would be ideal if we had
a larger sample. But right now we have a five percent sample that we are
dealing with, and that is primarily because of the size of the physician
claims. There are about 800 to 900 million physician supplier claims a year,
and these are very complex records going often over 2,000 and 3,000 characters
per record, and it often becomes difficult to do large scale analyses where you
would like to do small area analysis.
Next is the inability to disaggregate the program payments, payments by
other payors such as some claims for the working aged, and the beneficiary
payments. If the services are all bundled such as for the DRG under the
inpatient TPS system, to disaggregate them to get accurate revenue functions
for the provides. If you wanted to disaggregate an entire stay that involved 20
or 30 different revenue centers and different types of services, you have all
the information that is only stored on what we call the fixed portion of the
Next is the inability to link specific services on claims with provider
costs to data provider cost functions. I think this was mentioned earlier.
Wouldn’t it be nice for institutional providers to link the claim back to the
cost report, take the payment amount, and using some methodology, whether it is
cost to charge ratios or some other methodology, allocate the provider’s cost
for providing that service to all the bundled services within that entire stay.
On the physician side, it would be good to have information at the office
setting for determining the cost to the physician for providing a service. We
do not have that.
Next is the inconsistent use of the UPIN or the unique physician
identification number. We have a problem, that many times the physicians in a
practice will all use the UPIN number for the main doc. Especially we see this
with radiologists, pathologists and to some extent anesthesiologists, where
they all use the UPIN for one doc. So you may see one service only for many
physicians, or you may see 25,000 for a limited set of physicians.
Now, admittedly for pathology they may see a large number of samples, but
many times we think this is the result of using the same UPIN number for all
the physicians in a practice.
DR. STEUERLE: David, if I could interrupt you for a second, this is partly
my fault, but we have seven speakers on this session.
MR. GIBSON: I am going too long, I apologize.
DR. STEUERLE: It is not entirely your fault. We ran over in the morning
meeting, and we haven’t enforced on anybody yet, so we are starting to enforce
MR. GIBSON: I apologize. Let me talk about Section 723. There are a lot of
these barriers to using data. I would like to mention a couple of them that we
are trying to address with Section 723. That is the final one down here.
723 was created by the Medicare Modernization Act. It was signed in
December of 2003. It made a lot of major changes in the Medicare program. I
talked about the outpatient prescription drug benefit, basic changes to
Medicare Advantage, to the managed care program, but it also mandated studies
and demonstrations to improve the effectiveness of the Medicare program and the
quality of program recipients.
What that law did was establish a research database for the chronically
ill. It was recognized that the chronically ill account for the great bulk of
program payments. Before the program had been geared towards acute episodes, it
was recognized that the chronically ill represent a large proportion of the
dollars. In fact, they estimated that it is closer to 80 percent of the dollars
are accounted for by the chronically ill.
So what they wanted to do in this database that would be geared towards the
chronically ill, they wanted to provide the capacity for improving quality of
care, for reducing cost of care, et cetera.
So what we did is, we went out, we talked with some clinicians. We
developed algorithms for defining 21 chronic conditions that we could use the
claims for to identify these individuals. These are just some statistics that I
just quoted to you. The idea here was to identify individuals who had these
chronic conditions, and to eliminate some of the barriers that I mentioned
earlier, such as, we developed a system to do away with the problem of cross
referencing, and the fact that many people change numbers over time.
We developed a methodology that would also allow us to link data in these
different systems. There is a diagram here that describes that. It is called
the enterprise cross reference system. The idea here was to link Medicare,
Medicaid and assessment data. We assigned one unique number within each of the
systems, and then across the systems we allowed a methodology that will
identify individuals in both systems and pool their information together and
put it into the chronic condition warehouse.
This allows us to create patient level profiles. It allows us to identify
and keep a history of people who have these 21 chronic conditions, but in
addition it allows us if we want to search the database for other conditions or
do ad hocs, it will allow us also to do that as well.
The difference that you see in this database versus the other that we had
is that most of the databases that we have worked with in the past have been
flat files. This isn’t a relational database.
I am trying here to pick up a couple of points that might be of particular
relevance. I think the idea that I want to emphasize in this database is that
it will allow us to link data across the different programs, Medicare and
Medicaid and the assessment data, pull it together into a relational database
and allow us to query it and develop statistics off of it. That is what we are
in the process of doing right now.
It is based on a five percent sample. We have loaded from ‘99 to 2000
and four at the five percent level. We are testing to see if we can go a
hundred percent with the 2005 data, and if so, we are going to load the 2006 as
well. One of the things that is particularly appealing to me is that instead of
waiting for the staff files, we are going to use the claims as they come right
out of our system.
I talked about the ECR a little bit. Core research files, that is the idea
of paring down our records so that they are not so long. I mentioned the
algorithms that we have gotten for the 21 chronic conditions.
The Iowa Foundation for Medical Care, that is one of the QIOs. It is a
recognized data center for QIOs. They are also the maintainer of our assessment
Moving on, we are working also with the University of Minnesota. They have
a group out there called Resdac, that works with researchers to try to help
them to see if the database is of particular use to them.
I think that is it. I hope on the last part I didn’t go through too
quickly, but I got the feeling I was running a little over time on that.
DR. STEUERLE: Again, I have to apologize both to you and to our main
speakers. I think we probably crammed way too many speakers into this session.
But I think to make sure that all the speakers to make sure they are able to
give their talk, I’ll ask them to stay within maybe ten minutes for the
MR. COHEN: Given that I have a half hour presentation, I am really going to
cut to the highlights. I have some handouts, so you could get into more detail
In addition to everything that we have covered so far, this presentation
takes a little bit more of an intensity of integrated survey designs,
analytical enhancements through the integration of surveys, both with
administrative but also other survey data.
I will go over briefly some activities in the department and in AHRQ in
terms of implementing integrated survey designs to enhance analytical capacity
and see data quality enhancements, and see some efficiencies in the collection
of data, and how this model is a good model to help also improve the accuracy
of survey data. There will be a few examples at the end in terms of the
application of this AHRQ data portfolio, in terms of improving health outcomes
and limitations of the approach. Much of the limitations in terms of some of
the confidentiality issues have already been covered. The data portfolio
activity fits in very nicely with the mission of the agency to improve the
quality and the safety, the efficiency and the effectiveness of health care for
On to this integrated survey design model, the model itself looks at core
health survey, and it looks to other existing larger surveys or administrative
databases from which the survey could be derived. Rather than having an
independent large scale screener which would be very costly, this integrated
survey design makes use of ongoing surveys, so one can have that
predispositional information and use it in a very cost efficient manner. It
also makes use of the capability of linking secondary data, and if there was a
prior host survey, record of call information in terms of where there were
reluctant respondents, where there were multiple contacts to get the higher
response rate, and after the fact quite a bit of detail in terms of
sociodemographic factors for non-response adjustments.
In terms of the core survey and the department that has used this, it is
the medical expenditure panel survey. It is linked to the health interview
survey. The medical expenditure panel survey is designed to provide estimates
to health care use, expenditures, sources of payment.
Some of the core issues Julia Lane talked about was outlier cases and in
the expenditure distribution, what percent of the population is tied to 25
percent of the expenditures. So in terms of confidentiality issues and linkage,
we would opt to do everything possible to give the greatest accuracy on those
high expenditure cases and be constrained in terms of some geographical
dimensions for data linkage, so one can get at the core analytical variables,
one could go to either the AHRQ data center or the NCHS research data center,
and I will talk about how the health interview survey and the MEPS are linked.
In addition to those linkage enhancements, we have a survey much like MEPS
which is core, and you have a predispositional survey such as the health
interview survey, a general purpose public health survey, 40,000 households,
110,000 individuals. In addition to screening perspectives, you have an extra
time point to enhance predispositional information, longitudinal analyses.
So we draw up the health interview survey with all its strengths and
analytical capacities. It facilitates a very timely effective over sample of
cases, joined to the medical expenditure panel survey for two years. I am going
fast, but there are a few other compelling additions that have helped both
NCHS has been on the hook to get data to us fast track, really fast track,
so we can get the information and then sub sample. From that they have
developed a fast release of insurance coverage estimates. They are putting it
out. I think the 2005 estimates came out roughly like June of this year, and I
think they do quarterly estimates. So both organizations benefitted by demands
to get data out quickly to make it more usable to the research community.
Again, very efficient over samples of policy relevant groups.
So that is the first core example of the integrated design. MEPS recognizes
that while households could give you quite a bit of very accurate information
on their use, on their insurance coverage, on their access to care and their
perception. There are a lot of things that are problems with it in terms of
health expenditures, very, very demanding questions. So the underpinnings of
the design makes use of the recognition of a lot of item non-response, and it
links to other follow-back surveys. So you get permission to go to medical
providers to get details on health expenditures and the diagnoses and the event
We have an insurance component. There are two stages of linkage that I will
go into later, but this gives the depth of information on what we are spending
for health care, where the asymmetry is in terms of those expenditures, how
that impacts on take-up and overall access to care, and how that translates to
health outcomes. The insurance component is linked to the household component
to help support what the premium costs are.
A little bit more of this integrated design. As I said, the health
interview survey serves as the sample frame for the household component of the
We also benefit by a Census Bureau business register, which has tremendous
timeliness, is the best frame available in terms of having good coverage rates,
great information for modeling and over sampling, an annual survey of roughly
40,000 establishments. It builds off all the precursor information and allows
for tremendous modeling design work. Then the linkage from the core surveys to
follow-back surveys, and as we have seen on other examples, linkage to
Instead of a typical survey where you just have information at the county
level or the MSA level, where you are drawing in the analytical units, here you
have precursor information from the health survey on the sociodemographics, so
you can form non-response adjustments. A few equations have to be part of any
But in terms of one other thing about the linkage between one survey and
other survey, this goes into one of the very core outcome measures rather than
the SF-12 that we also have in the MEPS. This looks at in the last year of the
MEPS a self assessment of health, and then it looks at of those people, were
they ever uninsured over a two-year period, were they consecutively uninsured
for two years, and through linkage between MEPS and HIS, MEPS can give you two
years of coverage with several rounds of data collection to minimize recall.
At the first round, if a person is not covered, it has a question with a
two-year recall, very, very risky to put that out without any verification.
With the linkage to the health interview survey where they have a point in time
and they also go back in time, we can make edits, and now we can put out
estimates of the long term uninsured in MEPS. You can see how that could factor
into certain analyses. Again, on expenditure estimates, one other dimension in
terms of precursor information from HIS to MEPS.
Concentration of expenditures is critical. Persistence of the concentration
of expenditures is critical. So year one, year two. A lot of analysts come to
us, groups like Kaiser Permanente, they are looking at the top one percent of
the population and what can be done in terms of more efficient use of services.
By having HIS information, we can look at two prior years in terms of
predicting a third year. Two surveys married, and the whole is greater than the
sum of its parts.
A little more detail on the medical provider survey, compensating for item
non-response on expenditures, gold standard for expenditure estimates, greater
accuracy and supports imputation. For our households, we get permission to go
to their hospitals, their associated physicians in the hospital, office based
physicians, home health agencies and associated pharmacies. The pharmacy
verification component is critical as we wait for CMS to come on line with the
Part D data to inform a number of the comparative effectiveness analyses. In
the meanwhile, MEPS for both Medicare beneficiaries and the population in
general, the civilian non-institutionalized population, has a pharmacy
We have units of building blocks from the household and the provider. We
take the provider when we have it from both sides. We take the provider when we
only have it from the provider. We take the household data when it is from the
household, but we recalibrate based on those two points in time.
I am going very, very quickly. I hope this is helpful.
I am going to slow up a little bit. The non-response is based on
imputations from our medical provider survey. Hot deck imputation; we build on
provider expenditure data. We use model based predictors in terms of defining
the hot deck cells. We look at predictors of expenditures, predictors of
non-response. The intersection of the two is what we prioritize on. Then we go
to factors associated with expenditures. Then if we have room for other
variables, we will go into the item non-response.
On the pharmacy verification component, the household gives us entree to
the pharmacies. We get details on the pharmacy itself and the use. Then the
pharmacy gives us the medication name, the NDC code, quantity dispensed,
strength and form, source of payment, amount paid.
Fictitious person so there is no confidentiality violation. This is Sandy
King. She might exist, but she doesn’t exist at this address.
You can see some of the wealth of data. With that expansion from the
household data to get entree to the pharmacy to get all this data. It
sequences, where we can then link that NDC code to other proprietary databases
that can get us more granularity in terms of the therapeutic class and the
sub-classes. So it really enhances analytical capacity. We can link to
databases like the FDA, the year of approval for the drug, whether it is a
brand or a generic indicator.
Some of the outcome analyses. We look at trends in out of pocket burdens
across all major population subgroups, prevalence of potentially inappropriate
prescribing patterns. We also look for substitution effects over time, new
higher priced medications, but maybe there is some cut utilization overall,
looking at outcomes. Trends and use by therapeutic category, and more and more
work on predicting models of future year expenditures.
We just had a meeting with the Medicaid chronic disease directors. Very
concerned in terms of looking at chronic disease, Medicaid expenses, looking at
things that can inform them in terms of cost avoidance. I don’t know if there
are any cost savings, but just something that is informative.
On administrative data, we work very closely with CMS. This is not a
linkage exercise, but it shows how administrative data would have national
estimates. To be reconciled to MEPS, they cover infrastructure, research and
development, they cover the nursing home and institutionalized population.
We have to bring the national health counts down to MEPS, so we can see if
the two data sources are providing concurrent information. If they are not, if
the reconciliation shows areas of disparity, we have to look to see how each of
those data sources as they speak to one another at a higher level, can be
We also link to CMS data through our IRB, getting information from Medicare
to validate household reports on the use and going into the details of the type
of service and some complexities in patients on separate billing doctors.
A little bit more on our establishment survey. The estimates that come out
from this look at differentials nationally and state by state in terms of
take-up of coverage, the cost of the coverage. These estimates go into health
update, GDP estimates of the premium cost for the provision of health insurance
coverage. We benefit by a linkage to the business registry. With that comes a
lot of conditions. We cannot release the micro data to the public. This data is
residing at the research data center, gets tremendous use, but it is in a
secure environment at the RDC. But we do release tabular data.
The integrated design allows for the detail of the sampling to optimize
sample designs to minimize variance for fixed cost constraints. Serves as an
imputation source for editing, for small area estimation models, for table
production and also just like HIS and MEPS, non-response adjustments.
These integrated results, rather than more surveys, would have some
palliative effects on reduction in respondent burden, sample precision and
improvements, and modeling research.
Just as we close out, let me just give a little flavor to some of the other
elements in the AHRQ data portfolio. We are having some discussions, and this
might feed into some of the questions in terms of what is permissible on
This is a different part of the data portfolio, the health care cost and
utilization project; 37 state partners, roughly 90 percent of all payor
hospital discharges in the U.S., inpatient data, ambulatory surgery and
emergency department databases. The patient enters the hospital, get billing
records. The state partners send it to the major data contractor tied to this,
and produces a standardized administrative data set.
The standard linkages on the administrative data would be the AHA data,
more details on characteristics of the hospitals, information on the providers
and secondary linkages that we have heard from all the other speakers so far.
The identifiers are encrypted. There is some limitations with this
encryption, different states do different things, so we don’t have it across
the board. But some states are better than others, and you can have some more
episodic, I wouldn’t say longitudinal, but episodic type analyses across the
databases. Some of the state partners in their own right make use of linkage to
vital records, to disease registries, to state program files to enhance the
Some of the challenges as I said are, the encryption methods are not
uniform. You don’t have consistency across time, and some sensitivity to some
of these supplemental linkages.
Some of the analytical outcomes with this database would be looking on
racial and ethnic disparities and readmissions for diabetes, the incident cost
of motorcycle injury to inform decisions on state helmet laws, financial status
of safety net hospitals, impact of motor vehicle exhaust on pediatric asthma
admissions, certainly marrying a whole bunch of disparate data sources for an
In terms of pushing further, the department needs more information,
particularly with Secretary Leavitt’s transparency initiative, and looking
further in terms of quality metrics. Building on the capacity when it does
become more visible on electronic medical records and supplements to the claims
data, better links across the different states that are uniform. More
information on hospitals that are missing right now to help in terms of
decision making by consumers on organizational culture, clinical integration
and the availability of data from health information technology, and more
quality metrics and nursing staffing data is always critical.
Where we need to be. There is an initiative in the Department called
Joining Forces to expand on existing administrative data for consumer choice.
We need information that is more timely, that is less expensive, that is
actionable. Right now we have quite a bit of administrative data, but a lot
more work needs to be done, and dealing with some of the confidentiality issues
Some of the examples we are pushing on would be more HIT applications for
timeliness, more augmentation on clinical detail, on what condition was present
in admission. I think our colleagues from CMS said that would be very, very
helpful, too. And some lab values and more cross-site data, and promising all
the confidentiality commitments are adhered to.
We have some staff from HCUP, so if you have some other questions, they are
here to answer some questions.
Just before I close out, the agency with CMS and others in the Department
have quite a bit of activities going on with MMA. The agency in particular has
a role in terms of studies on comparative effectiveness analysis. One of the
programs at the agency is the Decide research network. That focuses on
developing evidence to inform decisions and effectiveness. The purpose of this
program is to expeditiously develop valid science evidence about the outcomes,
comparative clinical effectiveness, safety and appropriateness of health care
items and services.
Several well-known academic centers are a part of the network. The onus of
the Decide network is to analyze existing health care databases for comparisons
of effectiveness and outcomes of treatment, analyze existing disease, device
and other registries, and improve the accuracy of those data sources. We have
spoken on ancillary and secondary data sources for some of the linkage.
We have a data center. It is the second one in the Department. Some of the
models that NCHS are pushing on for getting more and more information in the
hands of the analysts to inform policy and practice directions AHRQ intends to
go in as well. There is quite a bit of coordination in the Department on that
Some linkage variables that you can look at; it is probably hard to read on
this screen. I think the presentations gave great justice to some of the
limitations in the availability, so I am not going to repeat that.
Let me just close. I might have done 25 minutes in ten, I don’t know, but I
went very quickly. I did cover the capacity of integrated survey designs, the
ability to reduce non-response, related enhancements and data quality,
analytical capacity, some attention to MEPS but some of the other data
resources, both within AHRQ and the Department.
DR. STEUERLE: Stephen, that was not only a brilliant presentation, but a
tour de force on your speaking ability and speed.
MR. RILEY: Good afternoon. Martin and I are going to team up to describe
the SEER Medicare linked database, which represents a joint effort of the
National Cancer Institute and CMS. We will also briefly discuss some other
linkage projects. I think we are planning to go about ten minutes each, if that
is okay. We will try to make it a little shorter.
I will give some background information on the linkage, and Martin will
discuss some of the uses of the data and some of the access to the database for
Before I begin, I should acknowledge Joan Warren of the National Cancer
Institute, who helped prepare this presentation. She has played a very central
role in developing and improving the linked database.
SEER Medicare consists of cancer registry data from NCI’s surveillance,
epidemiology and results of the SEER program linked to Medicare records on an
individual basis. The linked database has been in existence since 1991, and has
became a significant resource for cancer related health services research.
Under SEER, NCI contracts with individual cancer registries to collect and
identify information on all incident cancer cases in their reporting areas,
with the exceptions of non-melanoma skin cancer and in situ cervical cancer.
The participating registries attempt to identify cases treated in all
settings, and they examine death certificate records and autopsy records to
identify additional cases. Incident cases only are reported, and recurrences
are not captured. The SEER program began in 1973 and has covered 11 geographic
areas since 1992, representing 14 and a half percent of the U.S. population.
The program expanded to four new areas in 2001 and now covers about a quarter
of the U.S. population.
This map shows the SEER reporting areas. Until 2001, the program covered
five states and six metropolitan areas. In 2001 the states of New Jersey,
Kentucky, Louisiana and the remainder of California were added.
SEER areas were not selected to be representative of the U.S. population,
but were selected for the quality of their cancer registries and the diversity
of their populations. Analyses of the elderly population in SEER areas have
shown that there are lower percentages of whites and people living in poverty,
and a higher percentage of urban dwellers compared to the U.S. in general.
This slide shows some of the detailed clinical data that are collected
under SEER. Each individual is assigned a unique case number. If an individual
is diagnosed with more than one primary cancer while residing in the SEER area,
information on each cancer is recorded separately under that person’s case
number. Information on month and year of diagnosis is collected, as well as
site of cancer, histologic type and extent of disease at diagnosis.
The SEER program staff uses information on extent of disease to assign a
stage of diagnosis. SEER also captures type of cancer related surgery if any,
and any radiation therapy that is given or planned as part of the first course
of treatment. Vital status is also tracked over time and includes cause of
death for most cases.
Most of the Medicare administrative records are included in the linked
database. Enrollment data provide information on entitlement, demographics,
Medicaid buy-in status and managed care enrollment. Individual claim records
are included for all kinds of coverage services.
Most claims records are available from 1991 forward. The continuous
Medicare history sample file contains longitudinal data on a five percent
sample of Medicare beneficiaries from 1974 onwards. In addition, a five percent
random sample of Medicare enrollees has been identified who are not in SEER but
who reside in a SEER reporting area. These individuals serve as a cancer-free
comparison group for studies on cancer screening and other topics.
It is our intention to add enrollment and Part D prescription drug plans in
future updates of the database, assuming those data are available. We hope to
add Part D drug claims if this becomes feasible. As Dave said, we are working
to try to get access to that now.
The SEER and Medicare data complement each other in providing information
on a variety of cancer control activities. The data are useful for patterns of
care studies that control for the effects of comorbidities. Post diagnostic
surveillance can be measured for many years after the initial course of
therapy, and recurrences can often be identified from the claims data. End of
life care can also be analyzed, including the use of hospice services, which
are covered by Medicare.
I will briefly describe our linkage activities. NCI receives files with
personal identifiers for cancer cases directly from the SEER registries. NCI
checks the files, and if they appear satisfactory, they are forwarded to a CMS
contractor, who matches the data to the Medicare enrollment database. For all
cases that are successfully matched, the unique health insurance claim numbers
or HICs are identified and all Medicare claims for those individuals are
extracted. The contractor then removes all identifiers including the HICs, and
the claims and enrollment data are sent to NCI’s contractor for creation of
analytic files. All the analytic files retain arbitrarily assigned SEER case
numbers to distinguish the individual cases.
The linkage is updated every three years, with the next update scheduled to
begin in August 2007. We do this every three years because it is a pretty time
consuming complicated process to do the updates.
MR. BROWN: And when we update, we update the entire thing retrospectively
because of these changes in enrollment status that were discussed.
MR. RILEY: The current files contain SEER cases diagnosed through 2002, so
the next update will carry us through 2005.
I will just briefly describe the matching algorithm that is used to match
the SEER and the Medicare records. In most cases, social security number is
reported by the registry. That is the most important variable that we use in
matching. We do require some agreement on corroborating variables such as first
and last name, month of birth and sex. Agreement on sex is relatively important
to prevent us from inadvertently matching records for husbands and wives. If
SSN is not available or doesn’t match, we use first and last name, date of
birth, middle initial and date of death to match records.
Our matching criteria in the absence of SSN are rather strict, because it
is relatively easy to get false positive matches with such large databases
involved. Our match rates for persons aged 65 and over are quite good. We have
been able to find a Medicare record for about 94 percent of the elderly in the
SEER database. This varies somewhat by race and ethnicity. The match rate for
Hispanics is only about 88 percent, and that for Asians is 90 percent. The
match rates for persons under age 65 is rather meaningless because we don’t
expect to match most SEER cases in that age range.
The next table shows the number of linked cases for some of the most common
sites of cancer. There are about 300,000 prostate cancer cases and over 200,000
cases of breast, lung and colorectal cancers, so this is a very large database.
These top four cancer sites account for 60 percent of all linked cases. The
database is large enough to support detailed studies of many less common types
of cancer as well.
Before Martin describes some of the applications of SEER and Medicare, I
will briefly describe some other database linkages involving SEER and Medicare
data. SEER has been linked to CMS’ health outcome survey or HOS. The HOS
gathers data on health status measures among the Medicare managed care
population. The survey is administered annually to 1,000 Medicare enrollees in
each Medicare Advantage plan. The survey is used to assess the ability of plans
to maintain or improve the physical and mental health of its Medicare members
over time. The linkage with SEER will permit better studies of quality of life
for cancer patients in Medicare managed care.
There are also current plans to link SEER with the Medicare consumer
assessment of health care providers and systems or HCAPS survey. HCAPS measures
the experiences of beneficiaries in their health plans with fee for service
Medicare and with providers like hospitals and nursing homes. The survey is
used to monitor quality of care and to measure the performance of Medicare
health plans and providers. So the linkage of HCAPS with SEER would permit more
detailed studies of patient experiences with cancer care.
Medicare claims and enrollment records have been linked to many other
databases besides SEER. Medicare administrative data are routinely linked to
the Medicare current beneficiary survey. They have also been linked to other
surveys, like the national long term care survey and health and retirement
study data. Data have been linked to social security administrative records
under interagency agreement between CMS and SSA, and Medicare data have been
linked to several NCHS surveys under an interagency agreement, as Jennifer
mentioned a little while ago.
These data linkages have greatly enhanced the value of survey data by
adding information on use of health care services that is difficult to obtain
I will mention one attempted linkage with SEER that did not prove very
useful. NCI considered linking SEER with Medicaid data to get better
information on health care use and costs for cancer patients with low incomes.
This link is difficult to begin with because the Medicare data are state
specific, which complicates the privacy and the technical issues involved.
The validation project was conducted by NCI with SEER and Medi-Cal data
from California. Medi-Cal claims were found to be not very useful, primarily
because of heavy enrollment in managed care plans. That is, the state does not
give claims data for Medi-Cal managed care enrollees, so the information
available on them is minimal. NCI has therefore dropped its effort to link SEER
with Medi-Cal data because the claims are not complete enough.
So Martin is going to talk about some of the specific applications of SEER
Medicare data, and also some of the conditions of access to the data.
MR. BROWN: There are three or four contexts in which we use the SEER
Medicare data. This whole project was instituted ten or more years ago, because
at NCI we are associated with the SEER program that Gerry mentioned. This was
seen as a natural extension of our mission to do surveillance research and
Part of what we do is what we call inhouse research at a research
institute. We are an extramural branch, but we do inhouse research as part of
our federal mandate to do surveillance.
Secondly, we have done a lot of work to publicize the existence of the SEER
Medicare database, to provide technical assistance through various mechanisms
such as a very extensive webpage, conferences, outreach, Joan is up at Harvard
today, giving an extensive seminar, for example, and a funding mechanism. We
have developed a large stable of extramural researchers who use SEER Medicare
Then we have hybrid studies in which we have developed over the years this
partnership with extramural researchers. Quite often we will have a need to do
a particular analysis at NCI. We will get some of the clinical or health
services expertise from the outside and we will get together and partner a
study. We oftentimes have people come to us and say, can you do this for me. If
we think it is a worthwhile thing we will say yes. For example, we have been
working with the American Medical College’s Center for Workforce Studies to do
a study of supply and demand for the oncology workforce over the next 20 years
So that is the context in which these things have come about. We now have
about over 200 published articles that have used SEER Medicare data.
A couple of examples that fall in these various categories are trends and
use of cancer related services, procedures, resources and costs, descriptions
and disparities in use of cancer care, patient, physician and health services
and determinants of patterns of care, and volume outcome studies.
For example, this paper by Gerry Riley looked at differences between HMOs
and fee for service settings and stage of diagnosis and treatment approaches
for early stage breast cancer. This was done in 1999 when there was a lot of
heat and emotion about whether the quality of care was substandard in so-called
managed care or HMO settings versus fee for service settings. So at least by
these parameters, that certainly is not the case.
The next one is an example of a study done by an extramural researcher
which looks at trends and use of adjuvant deprivation therapy among men with
prostate cancer. This was study that was shown in clinical trials to have
potential benefit in the early ‘90s. So the question is, what is going on
with this treatment, and this is one way we can track the dissemination of this
kind of treatment. This is the kind of treatment we can’t capture very well in
our routine SEER data collection, for example.
Then the next one is dear to my heart. This is an example of estimates of
cost of treatment as defined by Medicare payments. We can have a long
discussion of whether Medicare payments are a true or not true proxy for cost,
but compared to almost anything else that is available they are pretty good.
This is an example done by an extramural researcher. We do a lot of inhouse
research on this topic. I won’t go into details, but it is interesting because
it shows that the incremental costs for treating colorectal cancer relative to
if that person had lived and not been diagnosed with colorectal cancer, lived
and died of another disease, it is inversely related to stage. The more severe
stage the lower the cost. In fact, it is a negative cost compared to not having
this disease. So in the context of cost effectiveness analyses this has some
obvious implications that are oftentimes not well appreciated.
This is an example of health disparities done by a fellow who worked with
me, which showed a pretty robust disparity in African-American men in receiving
— not men, just African-Americans, in receiving surveillance colonoscopy after
initial treatment for colorectal cancer. The level of detail isn’t shown on
this diagram. Not only were African-Americans less likely to receive a
surveillance colonoscopy, but they were much less likely to receive a
colonoscopy exam versus a sigmoidoscopy plus a barium enema exam. That level of
details you can get with the Medicare data that would be hard to come by
Finally, Deb Schrag who is in the audience is the first author of this
paper, which is one of many studies that have used SEER Medicare data to look
at the issue of volume outcome for cancer specifically. The marginal is very
interesting, because they show that there is a hospital volume effect, there is
also an individual surgeon volume effect, and the two of them together explain
more than each one of them separately. There has been some very good
descriptive work done using SEER Medicare data, but also some very good
methodological work that suggests some of the limitations of these analyses as
Those are some examples of the types of studies that have been done.
This shows that there has been a growing interest and use in the SEER
Medicare data by the number of data requests and also the number of
I don’t know if we need to repeat this one. The advantages of SEER Medicare
are obvious from what you have heard from several speakers today. The link to
SEER does provide very rich clinical information at the date of diagnosis which
you couldn’t get from Medicare alone. Going to Medicare allows us — despite
all of the problems that were mentioned earlier, nevertheless it provides a
pretty good longitudinal set of data that we can then follow from the original
incident cohort that we define in the SEER data.
The limitations are legion. Most of those have been mentioned. It is
limited to individuals over the age of 64. Our cancer diagnoses are not. We can
link someone diagnosed prior to the age of 65 once they get into the Medicare
program later on. We have used that to great effect, actually.
There is a problem with HMO enrollees that you know about, and Part D, we
will see what happens. A lot of people would like to use SEER Medicare to look
at issues of cancer screening. There are limitations about that, because the
Medicare codes for things like mammography and colonoscopy don’t always allow
us to accurately identify whether the exam was a screening exam or some other
type of exam. The SEER registry areas are not totally representative, although
they are large, they are national.
Probably the biggest limitation in our viewpoint, and this has also been
mentioned, is the time lag. We have a significant time lag between when the
events occur and when we can make the data available, three or four years. We
would definitely like to do something to shorten that time lag, and we are
doing everything we can to explore potential ways to do that.
We are in touch with CMS and with NIH about various efforts to improve the
timeliness and efficiency of using CMS data for research purposes. We also have
other efforts that are complementary, for example, we do have a large HMO based
research network. We had a grant and we recently received a fundable score, and
its basic purpose is to create a HMO parallel system to the SEER Medicare data
using these large HMOs.
This maybe speaks to some of the early concerns about how do you get SEER
Medicare data. I won’t go into any detail on this, but maybe we can talk about
it in the discussion. This is not a public use database, but it is data that is
available to outside users upon filing a data use agreement that has a certain
amount of requirements that we then enforce. We have not found it to be a big
problem so far. We have had lots and lots of users and we haven’t had any cases
of the kind of abuse that would cause problems.
DR. STEUERLE: Thank you both for rushing through and taking time to let
your colleague John make the last presentation. So John, we are going to let
you go next.
DR. DRABEK: My name is John Drabek. I work on OASPE and HHS. I have a few
comments about one of the studies that was displayed in Jennifer and Chris’
slides. That concerns the linkage of the national health interview survey data
with CMS and social security data.
This project has taken a number of years to bring to this stage. It first
of all required people from four agencies, NCHS, CMS, Social Security and OASPE
to meet and discuss how to accomplish such a match, that would satisfy the
privacy concerns of all the agencies, but yet yielded a database that was
usable for researchers. We were able to do that, and we were able to agree on
the technical details of how the match needed to be conducted and how the files
needed to be sent back and forth between the agencies.
The data are now available at the NCHS data center. There is extensive
documentation on the NCHS website under the data linkage page. There is a large
amount of data that are available. It is four years of the national health
interview survey linked with extensive Social Security and Medicare histories.
The data are useful from the standpoint of linking people in the HIS to
those administrative record sources, but also surveys that are built off the
NIS are also available to be analyzed, particularly the LSOA with its followup
interviews of aged respondents. Although MEPS started being linked with the
national health interview survey late in the process, the potential to link to
MEPS is there as well.
There are a couple of things I want to draw attention to. One is the fact
of help in terms of being able to analyze the data, the Social Security and
Medicare files are more oriented around program operations. So learning how to
use the Social Security and Medicare files is a level of complexity that is
considerable on top of the HIS.
To help in this process, our agency, OASPE, provided funds for the
development of analytical files for the four HIS merged surveys, and also
developed documentation. We are providing additional funds this year to make
further support available to users.
Gerry Riley, who just presented the SEER data linkage, has done one study
with a linked database, looking at people in the waiting period, those who have
received SSDI and are awaiting their two-year interval before they are eligible
for Medicare, and looking at their characteristics and seeing how they differ
from other individuals. So that is a unique project that could only be done
with this type of database.
We are also in the process of awarding a research contract to use the
database to look at three areas. One is understanding the interaction of the
population with disabilities and the SSI and SSDI programs. There are many more
people who have disabilities than those who are in the Social Security
programs, and what their characteristics are and severity of disability and
things like that are important.
The second area is to take advantage of the fact that we have a survey, say
the disability survey conducted in 1994. We can go up to 2003 and look at what
happened to people after that in terms of their interactions with Social
Security. So we can see how many people who had conditions in ‘94 applied
to received disability benefits in later years.
The third area is understanding family and caregiving support for people
with SSI and SSDI and other disabilities. There is extensive questions about
disability and use of services, household structure and things like that in the
‘94 and ‘95 HIS that go beyond the normal core data in the HIS.
We have supported this, because we think it is essential to have good data
for policy research and program evaluation. Linking databases in this manner is
a relatively new activity, and the design and use of the products is not well
understood at this point. We feel that only by conducting analyses with the
linked data that has been conducted so far will we understand the benefits and
the limitations of what has been done so far.
In particular I think we need to get a better handle on the different time
frame available with various parts of the database. We have the health
interview survey in 1994, but we have extensive data before and after from the
case history files. Figuring out how to summarize those case histories and
integrate it with the survey data is a challenge that hasn’t been faced before.
Similarly, if you have longitudinal interviews, how you integrate that with
case history data. So hopefully we will learn better how to do that in the
DR. STEUERLE: Thank you again for helping us to get back onto schedule. I
now have to apologize to everybody; we will probably have to let you take your
own break again today, because I want to make sure — we have Howard on at the
last session, and I want to make sure we give him his full time. So I am going
to go right to questions, and if people feel you need to step out, please do
so, and we are just going to continue straight on through.
I am going to take the liberty of asking the first question, and I can
address it to any of you. As I heard your presentations, one thing that stood
out in my mind, and I was comparing your presentation to those of the Census
Bureau, it seemed to me that one thing that seemed to make a huge difference
was whether the particular part of your agency or particular program you are in
or whatever had positive incentive to do something.
So the Secretary decides he wants a Decide program. That creates an
enormous weight within the Department to go out and make things happen, which
had he not done that might have much more left it up to some outside researcher
pressing to have something done, which maybe the inside lawyers would say we
might get into trouble for doing this.
The SEER program, I’m not sure what the underlying legislation impetus was
there, but cancer has always been a hot topic. So it appears that if in the
system somewhere there is somebody who creates this incentive that information
should be provided on something, that it often is the motivation for breaking
through so many of these barriers we are talking about. Am I correct or
MR. COHEN: You really are on target, but let me give you two examples where
you are on target. The integration of the MEPS and the HIS were part of a major
initiative within DHHS, I think it was ‘95, as part of Vice President
Gore’s RIGO-2 activity, reinventing government. There was an initiative to look
at all the departmental surveys in a very short period of time, and that
created opportunities and synergies, and that was a catalyst.
More recent activities, there are Secretarial initiatives. The Department
gets a lot more efficient by bringing the core forces together and getting it
With that said, it has a momentum of its own. There are other
opportunities, once these things are put in play, that other users see things
that weren’t on the table before, but because of this coalescing there are
these opportunities that present themselves.
I would say there are probably other cases where you have independent
forces coming together, but there is a framework that makes this all jell when
it is a mandate or an initiative and you get the support that is needed from
the leadership to bring people together, because we are all so busy meeting all
our objectives. That is first and foremost, but many times you get these
DR. MADANS: I guess we have had a different experience, although the
MEPS-HIS link clearly was something that was started by the Department and
I think the other linkages that we have done actually haven’t been a real
push. We have had a lot of support from the Department, particularly from
OASPE, to do them. The earlier links with Medicare data and the original
thinking about it I think was coming from the agencies. We have a lot of
conversations among the agencies in looking for enhancements.
But that said, it is much easier to do if there is a structure within the
departments that help you do it and encouragement for it, so that has been very
MR. BROWN: I think the fact that we have a surveillance mandate, NCI was
important. We said should this mandate just cover incidents in survival or
should it cover expenditures. It was expenditures that was first published,
because we also had constant demands from Congress and still have constant
demands from Congress to answer questions like how much does cancer cost, what
is the economic burdens of cancer. As great as MEPS is, and we use MEPS
actually and that linkage asks for quite a few studies, it doesn’t have the
numbers to get at specific cancer sites.
Some of us, and Gerry was there, Ed was there and other people were there,
said this is something we need to produce. Then later on, once it was on its
own, and it was a struggle to get this thing moving to the point where it is
now, the whole quality of care tsunami struck. Then it turned out there was a
huge interest and a lot of extramural researchers have taken advantage of this
to use the SEER Medicare database not for what we thought was going to be its
main purpose of looking at expenditures, but looking at cancer care and health
MS. TUREK: One interesting point. You can get the linked NHIS Medicare file
and do analysis, so you can analyze how similar questions were answered on the
two surveys. I get it for income. I thought it was wonderful as a user, and it
is in public use form. I just wanted to commend both organizations for making
that available in a way that protected confidentiality, but really enriched the
data set a lot.
DR. STEUERLE: The speakers are also free to ask questions or comment on
these others if they have something that they felt was missing or needed to be
DR. SCHRAG: I am an end user, so I am going to ask my question from that
perspective. We heard about two examples of data sets that are linked to HIS
and Medicare and SEER and Medicare. But of course, what end users really want
is to be able to link all three.
So two questions to put on the table. We want to link data sources. There
are many examples where two agencies have gotten together and created a useful
link, and Medicare is most frequently at the nexus, but we want to link three
and four, which gets a bit more complicated.
The other challenge I would put on the table is, we in the extramural,
extra government, extra academic research community often want to link agency
data with data that I will call — they are non-governmental, but they are
extremely informative. We heard a couple of examples, AHA data, data from the
American Board of Internal Medicine, about where different kinds of doctors are
located and what their qualifications are. Those would be the biggest examples
But there are whole different sets of rules that govern those private data
resources from the governmental resources. As an extramural researcher, it is
very difficult to navigate those waters when you are trying to link data sets
that cross public-private boundaries. I think that is where some of our
greatest challenges lie.
The only time we have been successful at it is when we find cooperative
investigators within the agencies, like Martin’s branch in particular for those
of us in cancer, who recognize the importance of those private sources and
accomplish the linkages for us, so that we can get our grubby greedy little
hands in there, carefully of course, with confidentiality.
Those are some of the challenges that we face. It is great to hear about
two, but we don’t want two sources. We want three, four.
DR. MADANS: The HIS link is the three, because it is SSA, CMS and us. So it
is going across.
We did do SEER once. It was the HANES-1 I think. But SEER isn’t everywhere.
So by the time we did the link we were down to no cases. So we don’t have a big
enough sample in any state. That is part of the problem. There are lots of data
sets, but there are probably not that many that would meet your particular
I think you are right about the proprietary data. If someone comes to us
and says we want to take your survey and we want to link in our proprietary
data, they can do that in our data center. But they have to make the
arrangements with the other entity, and that means you can’t share it, because
they are paying for it.
That would be an area where it would be nice to have some government wide
leadership so that we are not negotiating those things over and over again.
Some of the AHA stuff we buy so we can put it out. We know other people have
linked in. But you’re right, it gets very complicated.
DR. STEUERLE: Any other questions?
DR. BREEN: I have one question.
DR. STEUERLE: Yes, please.
DR. BREEN: This is a question for Steve. You said that the sample frame for
MEPS from the NHIS was at the health level. I always thought it was at the
person level, so that you had all of the data that you asked to sample adults
included in your MEPS database. Is that not true?
MR. COHEN: It is actually a household based survey. It is person specific
measures that would be factors that we would look at. But MEPS is a household
survey, so you could reconfigure once we take the household in from HIS to
person based analysis, to health insurance eligibility units, to tax filing
units. You could reconfigure nuclear families, secondary families.
I think one of the difficulties and challenges is when on gets to something
like a condition over sample, where you have a sample adult in the household.
It is a bit tricky that everybody in the HIS didn’t get the same set of
questions, so it is a little different than sampling on what is in their core,
in terms of some of the demographic, the insurance, some of the other measures.
But that said, once we sample a household, everybody comes along for the ride.
DR. BREEN: Thanks.
DR. SONDIK: To add a point to the answer to your question, I think one of
the things is that all of these areas have a very strong analytical staff
behind them. So this is not something that is, here it is, here is the
directive, go do it, and you get people whose interest isn’t primarily in this
area and using it. I think that is really important.
The other point is that Martin mentioned, and it is certainly true with
AHRQ, unfortunately it isn’t true with NCHS, but I wish it were, which is a
support of research funds that can be earmarked to build a user community which
can then spread. I think that is really important, because the evaluation of
these things is in the doing. Even though we have strong staffs involved, the
real evaluation is when you get people outside using it as well. I think that
is really an important point.
So if we did get such a directive, I would want to be sure that the
directive as the promise of funds along with it. It doesn’t have to be a lot.
We are talking about tiny amounts of money here, relatively speaking, that can
seed the research community.
DR. STEINWACHS: You were talking about linkages. Medicare is an ideal
population, Medicaid for some reasons.
One of the health issues that keeps coming up over and over again is the
effects of people going to war, which is potentially the veterans. I know the
HIS asks you about veteran status, but I don’t know whether anyone has explored
trying to identify who the veterans are that are part of surveys.
Then there is also VA utilization, which probably would be a small piece if
you linked with HIS, I don’t know. But they are among the people many times
with heavy disabilities if you were focusing on individual disabilities.
So it is a question whether any one of you who have been dealing with
health issues have tried to bring the VA into this, either on cancer or any
DR. COX: I can sort of answer it. Before the VA event, we were in
negotiations with them to try and link with their population of VA data to the
HIS. I think we had some analysis, about ten to 15 percent of the HIS had
served in the armed forces at some point in their lifetime, and we were going
to add data about the dates of service and whether there was active duty in war
We were very close. We were actually talking about the amount of money it
would take to do that kind of work, and we stopped. So we are letting VA get
their IT policies into place and recover from the situation they got themselves
in. I think it will come back at some point. But they need to get their house
in order first.
MR. BROWN: We have two experiences. One is, we have a project called
Cancors. It is a large study that is recruiting 5,000 newly diagnosed
colorectal, 5,000 newly diagnosed lung cancer patients, and have been following
them intensely over a one-year period. VA is one of our sites for recruiting
I don’t know if we have any specific information on their status in regard
to their military experience, but they are a participating site in that study.
The other, over the years we have had several discussions not with the VA,
but with the military health care system, the Tricare system. It has been
disappointing mainly, because of this problem of not being able to identify the
enrollment status in secondary and tertiary payors. So it is hard to set up
anything that is any kind of longitudinal study if you don’t know who your
denominator is or where they are in any given time. But we have some continuing
discussions scheduled on that topic.
DR. STEINWACHS: Just out of curiosity, I assume in the SEER registries
there are veterans that may or may not have been contributed by the VA as a
source. Is there any information collected in SEER about veteran status as
described here as having military service or service during wartime?
MR. BROWN: Not that I know of. But you’re right. Some of the reporting
units to SEER are part of VA, but there have been some issues lately about that
as well, which we are negotiating.
MR. PREVOST: At Census prior to the event, we were having lots of
discussions with the VA as well. If one were to use our surveys in trying to
attribute status in a linked environment, there were many concerns that the VA
had about how well veterans are responding to our questionnaires and saying
that they were veterans, particularly if they were in a reinstatement status.
So we were going to begin to start doing some quality assessment work as
well, to see how well the veterans were responding to those questionnaires and
if they could be used. But anyway, I am hoping that at some point in the near
future we will be able to re-start those talks.
DR. STEINWACHS: There are Americans at any given time, a lot of them
overseas, and some of them are there for periods of time. Do they get captured
at all? A household interview would not, you have to be a resident of the
United States, but where you talk about births and deaths, they have to occur
in the United States, too. But if you are in the military overseas, would that
be back into the U.S. in terms of data if you die?
I was just wondering what that link was, because some people are over there
serving us, working for us. Some people are over there on their own choice.
MR. COHEN: For some of those core surveys they would have had to have been
unfortunately an injury, and they would have had to return to the States.
But I can tell you, issues like this are coming to the table in health
care, although it is very, very rare. You are hearing of people going out of
the country to get health care. There were tradeoffs in terms of quality, but
there were some costs, to several disparate countries. If this becomes a bit
more prevalent, I’m sure we will want to track this and get a sense of this. So
right now, we don’t have the mechanisms.
DR. STEUERLE: I would like to thank each of you, and just remind you that
this committee forum is pretty much of an open forum. We invite you to stay for
other sessions. Also, remember to keep in mind that the goal of our committee
is mainly to advise. We mainly advise the Secretary.
So if there are things you think we should be advising, please let us know.
I have worked many years in government, many years outside. I have often viewed
the role of advisory groups or consultants as finding the information that is
already inside, and just rewriting it and replaying it a little bit. The good
ideas that you have had that haven’t quite made it to the top, our role is to
help them get there. So thank you again for your patience, and thank you for
especially dealing with the fact that we crowded you into this one session.
Howard, I think that leaves you the end game.
DR. IAMS: Yes. It is always a challenge to be the last on a long day. I
will try and cover the topic and do it relatively briefly.
I need to alert everybody in truth in advertising that I am an advocate of
using linked survey data. I have been actively doing this since 1984, and I
started using the new beneficiary survey in 1986, SIPP matched survey. The
phrase end user, I have been an end user and co-authored at least a dozen
analyses, a number of which are published in the Social Security Bulletin.
I work for an agency that is a proponent, a strong advocate. The Social
Security Administration in the 1960s started a survey called the retirement
history survey, which had administrative data from benefits and earnings linked
to it, and followed people into retirement over a decade.
The Census matched data was started in the early ‘70s. I believe Fritz
Shurren who is in the audience and Dorothy Projector were heavily involved in
starting that work, and it was continued by Denny Vaughn, and I continued it
after Denny Vaughn left the agency.
We just awarded our ninth retirement research consortium. We awarded
roughly 70 or 80 projects, about $5.5 million to three centers at NBER, the
University of Michigan and Boston College. At least half of those studies are
using the health and retirement survey and a fair portion of them end up using
matched survey data. So my agency not only started it, it continues to fund and
support it, and we have an analytical group that continues working on it.
Kelman Rupp who is in the audience has been active with the financial
eligibility model which we developed with matched data. We use that for
modeling SSI, for doing Quimby, and we recently were doing estimates on
Medicare extra help eligibility and what was the participation rate of people
in Medicare extra help.
I have been active involved in using SIPP matched data to project the
future retiree population, the baby boom, which has been extended through the
21st century. I call it modeling income in the near term. John Sabelhouse at
CBO says it ought to be modeling income in the late term, MILT, but anyway,
this has been providing social security reform information to the White House
in designing alternatives for over two years, and also to the House Ways and
Means Committee, and to the GAO. We are actively supporting it. We think it
does a lot of things.
We currently have linkages to the Census, SIPP and CPS. You have heard a
lot about those linkages and how that works. We have supported data
improvements for SIPP and CPS. We are currently supporting a Census effort to
get more matched data for the 2004 model panel using administrative address
record process, which you heard about this morning. They are in the process of
contacting people, and we are contributing funds to support that.
We also enhance the health and retirement survey and try to promote it as a
data set that is of use to the community. We have funded enhancements. We are
currently funding a projection of lifetime pension wealth and lifetime social
security wealth using administrative records, which will be released to the
user community for people in the early baby boom. A similar activity was done
for people born in the Depression.
We have studied bias from using only matched data. We find there is a bias,
and we are going to try and promote or create something for the health and
retirement survey to allow the analytical community to deal with that bias.
So I think we are very active in pushing this and supporting it. We have
tried to create a user community. I’m not sure I mentioned, but at least 158
HRS studies have been funded by us, and many of them are using this stuff.
What future directions? That was a question that was raised, future
directions I see. We are trying synthetic data. There has been an effort with
Giana Baud in Census to cerate administrative data for the SIPP that would be
released publicly. It involves all the SIPP panels for the ‘90s.
I’m not sure how well it will work. I suspect it will work for some things
and not for others. I am somewhat skeptical about its use for the decision to
retire and take up social security benefits, for example, which can be an
idiosyncratic kind of thing, as opposed to measuring general wealth and
I think we are going to see — I mentioned the assignment of linkage from
government address files which Census is doing. That is being tried for the
first time in the SIPP or CPS. That was done on ACS, but I don’t think it was
done on SIPP or CPS.
There is a broader health and retirement survey consent form that allows us
to release more administrative data. Previously the lawyers who negotiated that
kind of stuff made things very narrow, and would only allow release up to the
moment that the consent was signed. We now have an agreement that the consent
will include matching through 2021, which means that we will be able to follow
these people forward into the future, which is what we currently are doing with
The 1984 SIPP, for example, those folks were interviewed a long time ago.
We have earnings records, yearly earnings records through 2004. We have social
security benefits through December of 2005. We have SSI benefits paid by Social
Security through December of 2005. We keep updating our administrative file.
Each spring we go and pull in another year of data, so we are constantly
creating a very useful longitudinal data set from our administrative records.
Possible future linkages. I would love to see a linkage to the 1040 tax
records, so we can identify the validity of some of the income being reported.
I would love to see a match to the national compensation survey, which is a
survey of employers about their pension plans. Potentially Medicare bills.
I did that once with the new beneficiary surveys. As Gerry said it was a
very painful process because of the dirty data set. So I don’t know, I would
probably have to be pushed to do Medicare bills.
We currently have linked a lot of SSA records. We have a master beneficiary
record, which is our Title 2. That is what most people call social security.
For the disabled, we have the latest disability insurance information,
including the first and second diagnosis of why someone was disabled. This is
close to the ICD-9. Social Security has some slight differences.
I would just point out that in order to get disability from Social Security
you have to have been examined by a doctor and proved that you are unable to
work, that this is a medically determinable condition that lasts 12 months or
will kill you.
It is a growth industry. We now have eight million beneficiaries, 6.2
million of them are disabled workers, 154,000 are spouses and 1.6 million are
children of workers.
So this is a fairly concrete establishment. If somebody has poor health at
this point, roughly a third of the folks are mentally impaired. Back in the old
days when we did the new beneficiary survey it was ten percent. So that is
quite a shift.
The SSI record, Title 16, we have every month the benefits since the
program started. We have seven million SSI people. Four million are working
age, two million are aged, and about a million are kids.
We have the Numident. Census tells you about the Numident for giving you
the social security number. We use it for data death. That is where death is
recorded so we can tell death when people have departed our data set.
Summary earnings record, that is the record of — actually, you never
depart our data set. You always are in our data set. You depart active
updating, shall we say. The summary earnings record is the record of earnings
under social security taxes. That is what these people will receive their
benefits on. That is what they are going to get paid from, so you have the real
The detailed earnings record is our phrase for what is the W2 tax record.
We have the summary of earnings record from 1951 through 2004; we have the
detailed earnings record from 1982 through 2004. Recently researchers have
discovered that they can pull out the amount of money that has been deferred
for 401K type pensions. It is measurable on the form. They are starting to
measure it, and it is a much more valid source of information on participation
in defined contribution pensions than self reports and retrospective
information might be.
We do have Medicare Part D extra help application information. It is
feasible, but we haven’t actually matched it up. I think that basically is it.
So we have what people are being paid by an agency that pays 46 million
people benefits in Title 2 and in supplemental, seven million. We have what
they really have earned, according to what is filed with the tax authorities.
This is a fairly useful data set, and we run it longitudinally for all of the
surveys we deal with, which is SIPP and CPS and HRS.
Rule for use. You have heard the Census rules for use fairly extensively,
so I won’t mention that.
The HRS is widely used because the researchers can do it in their own
location. They have to have an approved research project. They have to have a
federal grant or use a secured data center at Michigan; most get a federal
grant. They have to have an isolated computer, they have to agree to
restrictions on access and public release. They can’t use geography with the
SSA records except at the Michigan secure site. Social Security funds the HRS
to send out a contractor to evaluate whether on an unannounced visit the use of
the computer and the data are being complied with. They do get visited even at
Validity and reliability. I think that it is hard to doubt that our record
of what you got paid for SSI and our record of what the Treasury issued to you
for payments in terms of social security is better, is more reliable than the
self reports. We have done several studies which are in the SIPP working paper
series about the extent of bias and misreporting. It is not extensive, but it
does have an effect, particularly on the SSI.
I think another thing that the data provide is that people are ignorant of
a number of the details of the social security program. For people doing
retirement research, sometimes those details are very, very important. I
highlighted one thing that the man on the street would say is important, how
much money did you contribute to your pension account. I doubt there are that
many people who remember the dollar amount in the last 20 years.
Another reason is, it is a fairly high cost thing to collect longitudinal
data. These are fairly high quality data, earnings 1951 to current year, W2s
1982. The SSA benefits start in 1962, the SSI benefits in 1974.
What is new? We have got this publicly synthesized administrative data
which is somewhere in disclosure review. If the world works right it will be
available in the next couple of months. For the health and retirement survey,
we have a much more complete release of administrative data. A past practice
had been to release a selected set of items. I think it was way too minimal.
We had evolved at Social Security a whole set of files that we developed
for the SIPP and for our modeling. We are now releasing most of that
information with the health and retirement survey. It was sent out to Michigan
for the 2004 interview in the spring. There are some users who are starting to
use it now.
That’s it on that.
We were supposed to talk about barriers. I didn’t put barriers in my thing,
I would just as soon not write it down. I don’t think there are any barriers
internal to my organization. The only issue with my organization is that these
records come from administering a program. They are not created for research,
they are created to administer the program. So in order to use the records very
well, you have to understand the program, and there are not very many people in
the research community who understand the details of the program.
I think if you stick to how much was the payment and how much were the
earnings, and when did the benefits start, everybody is going to be okay, but
they do have their idiosyncracies. We are doing internally an effort to
document all these files and all these data. The last version was yay thick. We
are updating it, and we should have this in the next six months. This effort
has been going on for about nine months, involving specialists in the MBR,
specialists in the SSR, specialists in these different files, and we will make
that available to the research community who wants it.
Another barrier is, there is a public reluctance to provide social security
numbers in these surveys. The last HRS got a 50 percent response rate for this,
the SIPP got a 60 percent. We are going to fund HRS to do a better job and
perhaps that will enhance things, but it is a problem. These are such valuable
data, it would be tragic not to have them.
There is a difficulty getting other agencies to let us use their data in
our matched data records. I think I expressed this morning, I consider our data
security equivalent to breaking into Fort Knox. I really am serious about that.
My agency is obsessed that the public can’t get access to peoples’ earnings and
benefits records. I think that the computer security is incredibly difficult to
break through. We take much more advanced methods. Anyway, it is just very,
So I have a hard time understanding why we wouldn’t be able to bring in a
data set and match it up for statistical purposes. For three years we have been
trying to get the Bureau of Labor Statistics to let us use their national
compensation survey. We first tried to get a copy to statistically match, not
an exact match but a statistical match. We did it on our own. We worked through
Census with SIPP-C. That all has failed, so now they are drawing up some
agreement and our lawyers are talking to their lawyers, and I don’t know if
that will ever occur.
For three years we have been trying to get the 1040 records for a year of
the CPS and a year of SIPP, so we can assess the validity of reports of asset
income. Our statistical measures believe that there is an under report. The
percentage reporting asset income is under reported by 15 percentage points.
This is a very big deal to Social Security, because there is a statistic we
calculate about how many people rely entirely upon Social Security for their
income. If you just had one dollar in asset income you wouldn’t be in that
statistic, and we think it would cut our level in half if we had valid
information. But we have been unable to get that, for some reason.
Disclosure you know is a problem. We use secure files. We support
financially the synthetic data. We have made every effort to try and get the
synthetic data to work. It would be nice if it would, I don’t know if it will.
We expect to have a contractor evaluate its utility when it finally is made
available. So I think those are the barriers that we face, and those are the
strengths and weaknesses of where things stand. I will stop.
DR. STEUERLE: Questions, comments?
MR. LOCALIO: You made an interesting comment about, their lawyers are
talking to your lawyers. One of the problems I have felt since I have been on
the committee for the last four years is that the lawyers never talk to the
data people. There are some good reasons for this. Lawyers tend to hate data,
they don’t understand data. On the end of my tag here, what I used to do in my
former life, before I climbed the data mountaintop and saw the promised land.
But I am finding that there are these lawyers out there who make
pronouncements and write rules and regulations and policies and do not touch
base with the people who have to use the data. Then they don’t understand the
practical problems or the implications of what they do. I don’t think we have
any agency lawyers here today, is that correct?
DR. IAMS: I want to defend our agency lawyers. We have been having to go
through the agency lawyers on interagency agreements now for about two years.
We are able to communicate with them fairly well.
This last one with BLS, maybe you are right, because I don’t think they
understand SIPP-C at all. It is partly under SIPP-C and it is using SIPP-C
language. But they were able to call over and talk to each other and come up
with some sort of agreement.
I noticed that ELS wants to review every piece of information that is
created by my agency using their data. If that means that they have to approve
what goes to the White House, I don’t think we will have a deal.
It has been a very long and drawn-out process. For the most part I can
understand what you are saying.
MS. TUREK: When you say synthetic data, I think of something polyester.
DR. IAMS: It’s close.
MS. TUREK: What happens? How do you create synthetic data? Do you try to
maintain the same distribution? If a person is in a program, do they stay in
it? I suspect that we could have a whole conference on what do you mean by
DR. IAMS: When it was first sold to us four years ago, they were going to
statistically change the survey data and keep all of our administrative records
intact and whole.
What is now being reviewed by the disclosure board has statistically made
up every single one of our administrative records, and all of the survey data
except three data items. So you will have to go review the work of the
statisticians who did the econometric modeling, that says that they think they
retained the relationships.
As a multivariate analyst type, I am dubious and have been dubious, which
is why I wanted what they originally said they were going to do. But in order
to meet disclosure review, they got pushed and pushed and pushed until now it
is all statistically made up, for the most part.
DR. STEUERLE: I should indicate, at Treasury for years we developed — and
still we had to use, I don’t think they called them synthetic data, but they
were statistically matched data sets. So they weren’t exact matches, they were
statistically matched. Tom Petska who will talk to you in a second has been
involved with that somewhat.
The notion was, you preserve both the variance and covariance matrix for
the original administrative data set and say a survey data set you had, but the
link you used some sort of transportation algorithm. You have a cost function,
you minimize the distance between wages or interest, stuff like that. Then you
hope that the things you didn’t minimize on were okay, which is where I think
Howard was becoming quite skeptical.
But sometimes, if you are in the government and you have to make cost
estimates, or you are in perhaps Medicare or some other place, you have to have
a more complete data set. Sometimes we have no other choice.
DR. IAMS: I think it will work very well for certain kinds of research
questions. If you want to know if someone earned a lot of money in their
lifetime and worked a lot of years, these data should be just fine. They are
probably pretty close to the original.
If you are trying to do what a lot of labor economists are doing, which is
predict exactly when someone applies for social security, when do they leave
the labor force, that point of decision is going to be statistically smudged.
I’m not sure it will be as good to research that subject.
DR. STEUERLE: Another way to say this, we have no idea what the standard
deviation on any variable you create is.
DR. PETSKA: I’d just like to say a couple of things. Gene, that is exactly
right. Treasury is still doing statistical matches. Treasury has a very strong
interest in the CPS, and they cannot get the identifiable version of that, so
they have been doing statistical matches for many years, back when Gene was
there in the Tax Reform Act of 1986 and so on.
But in regard to your comments about access to 1040 data, I don’t know if
you were trying three years ago or for three years or whatever, but I would
certainly be willing to talk more about that. As I think you know, I will be
talking tomorrow morning at 9 o’clock in the first presentation. Access to tax
data is closely regulated by Title 26, 6103 of the Internal Revenue Code. It
requires an authorized statutory purpose.
Social Security is clearly in the code, but the question is, what about the
purpose? Even though you get access to certain files, what about other files
and the content? We agonize with Census Bureau repeatedly, and Gerry can vouch
for this, about not only what files, but literally an item by item basis. There
is a regs request to expand the items that is pending right now. The Bureau of
Economic Analysis has the same thing.
So we can certainly talk about that off line and see where it is now. It
has gone from your lawyers to our lawyers, and that is why it hasn’t gone
anywhere. Maybe reason can prevail.
The last comment I just wanted to make very briefly was, for such things as
W2 data which go into earnings histories, there is a joint custodianship
relationship. This is shared by the agencies. So if you appeal to Social
Security to get access to the earnings histories, that has to come to us as
well because those data are co-owned by the two agencies.
DR. IAMS: We have had a very amicable and longstanding relationship in
dealing with those.
MR. RILEY: One additional on the synthetic data. As we try to match
administrative records data to survey data, this is the problem. We can either
release no administrative record, we can force people to come to an RDC or come
to the area or figure out some way to get remote access, or we can create
these synthetic files, which is one of the things I think they were trying
to do with the SIPP.
One of the big problems with creating a synthetic file after the data set
is released is, you haven’t done any top coding to begin with, and you are
trying to add these other data on, and it makes it much more complicated. So I
think this is one of the things we are trying to do with the new system, is
design it such that we know beforehand what is going to be released and what is
not going to be released.
This is also one of the issues the CENSTAT panel, we have asked them to
look at, are synthetic data good, what is going on with them, can we assess the
quality of them in addition to the administrative data. So this is one of the
things that really need to be looked at, which is why we need to get the data
set out, so people can review this and get some information.
DR. STEINWACHS: I was just curious from the researcher side. Are there
examples in which the analyses of synthetic data sets in terms of policy
analysis have been done and published in the peer review literature? Or is
there a pushback from reviewers about, once you indicate it is a synthetic data
set versus quote-unquote a real data set?
MR. RILEY: I think there are some examples. John Ebatt has a number of
papers. They have used both the synthetic data and the actual data.
We currently have a variable that we have looked at, the multi-benefit
analysis, which is from the security records, matched to the 2004 SIPP data to
compare the mean. The key is the distribution in the covariances. You look at
the means by about 20 different characteristics, and they are very similar, the
administrative data versus the actual data.
So this is the stuff we are looking at. But I think if you go to the LEHD
site on the Census website, you will see some papers that have tried to
evaluate the synthetic data.
MR. LOCALIO: I want to follow up on that. I have seen the technical papers
in the field, but I don’t think I have seen an applied paper presented using
I can tell you right now, if I saw one, I would be open minded about it
only because I have read the technical data. But I would say that it is well
beyond the capability of most journals to review papers like that. It is too
new. It hasn’t been tried out enough.
MR. RILEY: Applied journals.
MR. LOCALIO: Applied journals, that is correct.
DR. STEINWACHS: This may take you in a little bit different direction, but
when we all talk, we always think about the social security number as the thing
to link on if you can. There have been a number of comments about the
difficulty of collecting it and so on.
I am just wondering how good the social security number is. If you say
there is one person for the entire life, I always think of the witness
protection program, where I assume it changes. I know that is a very small
number hopefully of people. But there must be things that look funny when you
look at social security numbers, where people have picked them up, used them,
maybe don’t use them. Some people have two or more of them eventually in their
lifetime. Is that a tenth of a percent that you think are funny numbers, or is
this more pervasive in our system?
DR. IAMS: I’m not in a position to say. I think the Office of the Inspector
General at Social Security has had some statements or studies made about the
validity of social security numbers. You know that it is a concern at the
Homeland Security Administration.
I would say, we have something called the earnings suspense file, which has
millions if not billions of dollars sitting in it, which comes about when
someone’s tax record comes in and something is wrong with the record. It
doesn’t quite match the name with the number, and it can’t get deposited in the
Now, that is not an area I know a great deal about, but this was going on
when I first went to work for the agency back in the 1980s when our agency lost
a quarter of its work force. The suspense file went way up. I recall, at one
point IRS refused to work on it and SSA refused to work on it, because they
were both short of administrative funds in the ‘80s. Anyway, I would just
It empirically exists. It is not trivial. If you go to the Inspector
General reports, I know they have had work done on the earnings suspense file.
That might give you a sense.
DR. COX: I can say that when we link the HIS data to SSA data, we did do a
validation step first. While I don’t think it is a huge problem, we see two
areas of concern. One is elderly women who are using their husband’s social
security numbers because they don’t have their own, or they receive benefits
under his work history. We do see lower validation rates for SSNs provided by
DR. STEUERLE: Howard, turning to an area that you and I have worked on over
the years, one thing you do have in Social Security is an ability to try to
measure lifetime earnings. Although there are gaps, it is probably one of the
best data sources anywhere to look at relative status of populations over long
periods of time.
In fact, the former head of CMS did a study right before he took that job
in which he tried to estimate whether Medicare was distributed in a progressive
manner according to income. I can’t remember the study, but I remember looking
at it and thinking, if anything is a synthetic data set, because that is what
he had, he had that.
You have Social Security which if linked to Medicare could come much closer
to answering that question. So beyond answering that narrow question of that
linkage of Medicare data to lifetime earnings to see studies on the
distribution of Medicare benefits, are there other efforts being made to try to
figure out how to use a lifetime earnings measure of well-being for a whole
variety of health outcomes and measures?
DR. IAMS: I think that our retirement research consortium has a number of
papers and analyses that use lifetime earnings. Because the health and
retirement survey is in the public domain with minimal restrictions, there is a
lot of use.
The economists who study that type of thing were using this when everything
was in the public domain back in the 1960s, before Watergate led to the
restrictions of the Privacy Act. They carried on that type of work.
Now, sometimes they have health as an outcome. Many times they are doing
labor force retirement. Let me ask, Kelman Rupp, do you know of — you do more
in the disability and health area, are some of the people studying the outcome
of disablement or health impairment or something along that line, Medicare,
using a predictor of lifetime earnings?
MR. RUPP: (Comments off mike.)
DR. IAMS: I guess you have tapped a subject for the user research community
to get into.
DR. STEUERLE: Does the SEER data file have lifetime earnings as a
prognosticator of cancer?
DR. IAMS: You haven’t matched it up to earnings records, have you?
DR. BREEN: We actually tried to link it with Social Security Administration
records at one point. Maybe we tried to be too detailed, but not all of the
data were electronic, so they weren’t available for all the years.
Largely, and I think it will change as baby boomers retire, but it was
available pretty much for white males. There weren’t earnings history for most
of the people.
DR. IAMS: That sounds weird. I have been using these data for 20 years, and
that has not been the case when you match it up to SIPP. We have something like
85, 90 percent match to SIPP and 75 percent match to CPS, and we don’t have the
kind of holes you are talking about.
DR. BREEN: What is the age range? Because cancer is a disease of older
DR. IAMS: Our data don’t start until 1951, so we have earnings on people
after 1950. If you were looking at people born in the First World War, there
would be a lot of missing earnings.
DR. STEINWACHS: I am told that our contractual arrangements for the room
end this afternoon at 4:45. You may be thankful for that.
I do want to thank all the speakers this afternoon. This has really been
fantastic, a wonderful learning experience and a great exchange. Thank you very
DR. STEUERLE: Can I just add one thank you to Cynthia Sidney, who helped
DR. STEINWACHS: Hear, hear. I hope you will be here at 9 o’clock in the
morning tomorrow morning when we will learn from the IRS how to solve all these
problems. We will move on to the Veterans Administration, education and some
other areas. Thank you very much.
(Whereupon, the meeting was recessed at 4:46 p.m., to reconvene Tuesday,
September 19, 2006 at 9:00 a.m.)